Фиксация изменений

2026-03-05 11:03:17 +03:00
parent 1ef0b4d68c
commit 417b8b6f72
261 changed files with 8215 additions and 332 deletions
--- a/docs/architecture/retrieval_inventory.md
+++ b/docs/architecture/retrieval_inventory.md
@@ -0,0 +1,457 @@
+# Retrieval Inventory
+
+## Scope and method
+
+This document describes the retrieval and indexing pipeline as implemented in code today. The inventory is based primarily on:
+
+- `app/modules/rag/services/rag_service.py`
+- `app/modules/rag/persistence/*.py`
+- `app/modules/rag/indexing/code/**/*.py`
+- `app/modules/rag/indexing/docs/**/*.py`
+- `app/modules/rag_session/module.py`
+- `app/modules/agent/engine/graphs/project_qa_step_graphs.py`
+- `app/modules/agent/engine/orchestrator/*.py`
+
+`ASSUMPTION:` the intended layer semantics are the ones implied by code and tests, not by future architecture plans. This matters because only `C0` through `C3` are materially implemented today; `C4+` exist only as enum constants.
+
+## Current retrieval pipeline
+
+1. Retrieval entrypoint is `POST /internal/rag/retrieve` in `app/modules/rag_session/module.py`.
+2. The endpoint calls `RagService.retrieve(rag_session_id, query)`.
+3. `RagQueryRouter` chooses `docs` or `code` mode from the raw query text.
+4. `RagService` computes a single embedding for the full query via `GigaChatEmbedder`.
+5. `RagQueryRepository.retrieve(...)` runs one SQL query against `rag_chunks` in PostgreSQL with `pgvector`.
+6. Ranking order is:
+   - lexical rank
+   - test-file penalty
+   - layer rank
+   - vector distance `embedding <=> query_embedding`
+7. Response items are normalized to `{source, content, layer, title, metadata, score}`.
+8. If embeddings fail, retrieval falls back to latest chunks from the same layers.
+9. If code retrieval returns nothing, service falls back to docs layers.
+
+## Storage and indices
+
+- Primary store: PostgreSQL from `DATABASE_URL`, configured in `app/modules/shared/db.py`.
+- Vector extension: `CREATE EXTENSION IF NOT EXISTS vector` in `app/modules/rag/persistence/schema_repository.py`.
+- Primary table: `rag_chunks`.
+- Cache tables:
+  - `rag_blob_cache`
+  - `rag_chunk_cache`
+  - `rag_session_chunk_map`
+- SQL indexes currently created:
+  - `(rag_session_id)`
+  - `(rag_session_id, layer)`
+  - `(rag_session_id, layer, path)`
+  - `(qname)`
+  - `(symbol_id)`
+  - `(module_id)`
+  - `(doc_kind)`
+  - `(entrypoint_type, framework)`
+
+`ASSUMPTION:` there is no explicit ANN index for the vector column in schema code. The code creates general SQL indexes, but no `ivfflat`/`hnsw` index is defined here.
+
+## Layer: C0_SOURCE_CHUNKS
+
+### Implementation
+
+- Produced by `CodeIndexingPipeline.index_file(...)` in `app/modules/rag/indexing/code/pipeline.py`.
+- Chunking logic: `CodeTextChunker.chunk(...)` in `app/modules/rag/indexing/code/code_text/chunker.py`.
+- Document builder: `CodeTextDocumentBuilder.build(...)` in `app/modules/rag/indexing/code/code_text/document_builder.py`.
+- Persisted via `RagDocumentRepository.insert_documents(...)` into `rag_chunks`.
+
+### Input contract
+
+This is an indexing layer, not a direct public retriever. The observed upstream indexing input is a file dict with at least:
+
+- required:
+  - `path: str`
+  - `content: str`
+- optional:
+  - `commit_sha: str | None`
+  - `content_hash: str`
+  - metadata fields copied through by `RagService._document_metadata(...)`
+
+For retrieval, the layer is queried only indirectly through:
+
+- `rag_session_id: str`
+- `query: str`
+- inferred mode/layers from `RagQueryRouter`
+- fixed `limit=8`
+
+### Output contract
+
+Stored document shape:
+
+- top-level:
+  - `layer = "C0_SOURCE_CHUNKS"`
+  - `lang = "python"`
+  - `source.repo_id`
+  - `source.commit_sha`
+  - `source.path`
+  - `title`
+  - `text`
+  - `span.start_line`
+  - `span.end_line`
+  - `embedding`
+- metadata:
+  - `chunk_index`
+  - `chunk_type`: `symbol_block` or `window`
+  - `module_or_unit`
+  - `artifact_type = "CODE"`
+  - plus file-level metadata injected by `RagService`
+
+Returned retrieval item shape:
+
+- `source`
+- `content`
+- `layer`
+- `title`
+- `metadata`
+- `score`
+
+No `line_start` / `line_end` are returned to the caller directly; they remain in DB columns `span_start` / `span_end` and are only used in logs.
+
+### Defaults & limits
+
+- AST chunking prefers one chunk per top-level class/function/async function.
+- Fallback window chunking:
+  - `size = 80` lines
+  - `overlap = 15` lines
+- Global retrieval limit from `RagService.retrieve(...)`: `8`
+- Embedding batch size from env:
+  - `RAG_EMBED_BATCH_SIZE`
+  - default `16`
+
+### Known issues
+
+- Nested methods/functions are not emitted as C0 chunks unless represented inside a selected top-level block.
+- Returned API payload omits line spans even though storage has them.
+- No direct filter by path, namespace, symbol, or `top_k` is exposed through the current endpoint.
+
+## Layer: C1_SYMBOL_CATALOG
+
+### Implementation
+
+- Symbol extraction: `SymbolExtractor.extract(...)` in `app/modules/rag/indexing/code/symbols/extractor.py`.
+- AST parsing: `PythonAstParser.parse_module(...)`.
+- Document builder: `SymbolDocumentBuilder.build(...)`.
+- Retrieval reads rows from `rag_chunks`; there is no dedicated symbol table.
+
+### Input contract
+
+Indexing input is the same per-file payload as C0.
+
+Observed symbol extraction source:
+
+- Python AST only
+- supported symbol kinds:
+  - `class`
+  - `function`
+  - `method`
+  - `const` for top-level imports/import aliases
+
+Retrieval input is still the generic text query endpoint. Query terms are enriched by `extract_query_terms(...)`:
+
+- extracts identifier-like tokens from query text
+- normalizes camelCase/PascalCase to snake_case
+- adds special intent terms for management/control-related queries
+- max observed query terms: `6`
+
+### Output contract
+
+Stored document shape:
+
+- top-level:
+  - `layer = "C1_SYMBOL_CATALOG"`
+  - `title = qname`
+  - `text = "<kind> <qname>\n<signature>\n<docstring?>"`
+  - `span.start_line`
+  - `span.end_line`
+- metadata:
+  - `symbol_id`
+  - `qname`
+  - `kind`
+  - `signature`
+  - `decorators_or_annotations`
+  - `docstring_or_javadoc`
+  - `parent_symbol_id`
+  - `package_or_module`
+  - `is_entry_candidate`
+  - `lang_payload`
+  - `artifact_type = "CODE"`
+
+Observed `lang_payload` variants:
+
+- class:
+  - `bases`
+- function/method:
+  - `async`
+- import alias:
+  - `imported_from`
+  - `import_alias`
+
+### Defaults & limits
+
+- Only Python source files are indexed into C-layers.
+- Import and import-from declarations are materialized as `const` symbols only at module top level.
+- Retrieval ranking gives C1 priority rank `1`, after C3 and before C2/C0.
+
+### Known issues
+
+- No explicit visibility/public-private model.
+- `parent_symbol_id` currently stores the parent qname string from the stack, not the parent symbol hash. This is an observed implementation detail.
+- Cross-file symbol resolution is not implemented; `dst_symbol_id` in edges resolves only against symbols extracted from the same file.
+
+## Layer: C2_DEPENDENCY_GRAPH
+
+### Implementation
+
+- Edge extraction: `EdgeExtractor.extract(...)` in `app/modules/rag/indexing/code/edges/extractor.py`.
+- Document builder: `EdgeDocumentBuilder.build(...)`.
+- Built during `CodeIndexingPipeline.index_file(...)`.
+
+### Input contract
+
+Indexing input is the same per-file source payload as C0/C1.
+
+Graph construction method:
+
+- static analysis only
+- Python AST walk only
+- no runtime tracing
+- no tree-sitter
+
+Observed edge types:
+
+- `calls`
+- `imports`
+- `inherits`
+
+### Output contract
+
+Stored document shape:
+
+- top-level:
+  - `layer = "C2_DEPENDENCY_GRAPH"`
+  - `title = "<src_qname>:<edge_type>"`
+  - `text = "<src_qname> <edge_type> <dst>"`
+  - `span.start_line`
+  - `span.end_line`
+  - `links` contains one evidence link of type `EDGE`
+- metadata:
+  - `edge_id`
+  - `edge_type`
+  - `src_symbol_id`
+  - `src_qname`
+  - `dst_symbol_id`
+  - `dst_ref`
+  - `resolution`: `resolved` or `partial`
+  - `lang_payload`
+  - `artifact_type = "CODE"`
+
+Observed `lang_payload` usage:
+
+- for calls: may include `callsite_kind = "function_call"`
+
+### Defaults & limits
+
+- Edge extraction is per-file only.
+- `imports` edges are emitted only while visiting a class/function scope; top-level imports do not become C2 edges.
+- Layer rank in retrieval SQL: `2`
+
+### Known issues
+
+- There is no traversal API, graph repository, or query language over C2. Retrieval only treats edges as text/vector rows in `rag_chunks`.
+- Destination resolution is local to the file-level qname map.
+- Top-level module import relationships are incompletely represented because `visit_Import` / `visit_ImportFrom` skip when there is no current scope.
+
+## Layer: C3_ENTRYPOINTS
+
+### Implementation
+
+- Detection registry: `EntrypointDetectorRegistry.detect_all(...)`.
+- Detectors:
+  - `FastApiEntrypointDetector`
+  - `FlaskEntrypointDetector`
+  - `TyperClickEntrypointDetector`
+- Document builder: `EntrypointDocumentBuilder.build(...)`.
+
+### Input contract
+
+Indexing input is the same per-file source payload as other C-layers.
+
+Detected entrypoint families today:
+
+- HTTP:
+  - FastAPI decorators such as `.get`, `.post`, `.put`, `.patch`, `.delete`, `.route`
+  - Flask `.route`
+- CLI:
+  - Typer/Click `.command`
+  - Typer/Click `.callback`
+
+Not detected:
+
+- Django routes
+- Celery tasks
+- RQ jobs
+- cron jobs / scheduler entries
+
+### Output contract
+
+Stored document shape:
+
+- top-level:
+  - `layer = "C3_ENTRYPOINTS"`
+  - `title = route_or_command`
+  - `text = "<framework> <entry_type> <route_or_command>"`
+  - `span.start_line`
+  - `span.end_line`
+  - `links` contains one evidence link of type `CODE_SPAN`
+- metadata:
+  - `entry_id`
+  - `entry_type`: observed `http` or `cli`
+  - `framework`: observed `fastapi`, `flask`, `typer`, `click`
+  - `route_or_command`
+  - `handler_symbol_id`
+  - `lang_payload`
+  - `artifact_type = "CODE"`
+
+FastAPI-specific observed payload:
+
+- `lang_payload.methods = [HTTP_METHOD]` for `.get/.post/...`
+
+### Defaults & limits
+
+- Retrieval layer rank: `0` highest among code layers.
+- Entrypoint mapping is handler-symbol centric:
+  - decorator match -> symbol -> `handler_symbol_id`
+  - physical location comes from symbol span
+
+### Known issues
+
+- Route parsing is string-based from decorator text, not semantic AST argument parsing.
+- No dedicated entrypoint tags beyond `entry_type`, `framework`, and raw decorator-derived payload.
+- Background jobs and non-decorator entrypoints are not indexed.
+
+## Dependency graph / trace current state
+
+### Exists or stub?
+
+- C2 exists and is populated.
+- It is not a stub.
+- It is also not a full-project dependency graph service; it is a set of per-edge documents stored in `rag_chunks`.
+
+### How the graph is built
+
+- static Python AST analysis
+- no runtime instrumentation
+- no import graph resolver across modules
+- no tree-sitter
+
+### Edge types in data
+
+- `calls`
+- `imports`
+- `inherits`
+
+### Traversal API
+
+- No traversal API was found in `app/modules/rag/*` or `app/modules/agent/*`.
+- No method accepts graph traversal parameters such as depth, start node, edge filters, or BFS/DFS strategy.
+- Current access path is only retrieval over indexed edge documents.
+
+## Entrypoints current state
+
+### Implemented extraction
+
+- HTTP routes:
+  - FastAPI
+  - Flask
+- CLI:
+  - Typer
+  - Click
+
+### Mapping model
+
+- `entrypoint -> handler_symbol_id -> symbol span/path`
+- The entrypoint record itself stores:
+  - framework
+  - entry type
+  - raw route/command string
+  - handler symbol id
+
+### Tags/types
+
+- `entry_type` is the main normalized tag.
+- Observed values: `http`, `cli`.
+- `framework` is the second discriminator.
+- There are no richer endpoint taxonomies such as `job`, `worker`, `webhook`, `scheduler`.
+
+## Defaults and operational limits
+
+- Query mode default: `docs`
+- Code mode is enabled by keyword heuristics in `RagQueryRouter`
+- Retrieval hard limit: `8`
+- Fallback limit: `8`
+- Query term extraction limit: `6`
+- Ranked source bundle for project QA:
+  - top `12` RAG items
+  - top `10` file candidates
+- No exposed `namespace`, `path_prefixes`, `top_k`, `max_chars`, `max_chunks`, `max_depth` in the public/internal retrieval endpoint
+
+`ASSUMPTION:` the absence of these controls in endpoint and service signatures means they are not part of the current supported contract, even though `RagQueryRepository.retrieve(...)` has an internal `path_prefixes` parameter.
+
+## Known cross-cutting issues
+
+- Retrieval contract is effectively text-only at API level; structured retrieval exists only as internal SQL parameters.
+- Response payload drops explicit line spans even though spans are stored.
+- Vector retrieval is coupled to a single provider-specific embedder.
+- Docs mode is the default, so code retrieval depends on heuristic query phrasing unless the project/qa graph prepends `по коду`.
+- There is no separate retrieval contract per layer exposed over API; all layer selection is implicit.
+
+## Where to plug ExplainPack pipeline
+
+### Option 1: replace or extend `project_qa/context_analysis`
+
+- Code location:
+  - `app/modules/agent/engine/graphs/project_qa_step_graphs.py`
+- Why:
+  - retrieval is already complete at this step
+  - input bundle already contains ranked `rag_items` and `file_candidates`
+  - output is already a structured `analysis_brief`
+- Risk:
+  - low
+  - minimal invasion if ExplainPack consumes `source_bundle` and emits the same `analysis_brief` shape
+
+### Option 2: insert a new orchestrator step between `context_retrieval` and `context_analysis`
+
+- Code location:
+  - `app/modules/agent/engine/orchestrator/template_registry.py`
+  - `app/modules/agent/engine/orchestrator/step_registry.py`
+- Why:
+  - preserves current retrieval behavior
+  - makes ExplainPack an explicit pipeline stage with its own artifact
+  - cleanest for observability and future A/B migration
+- Risk:
+  - low to medium
+  - requires one new artifact contract and one extra orchestration step, but no change to retrieval storage
+
+### Option 3: introduce ExplainPack inside `ExplainActions.extract_logic`
+
+- Code location:
+  - `app/modules/agent/engine/orchestrator/actions/explain_actions.py`
+- Why:
+  - useful if ExplainPack is meant only for explain-style scenarios
+  - keeps general project QA untouched
+- Risk:
+  - medium
+  - narrower integration point; may create duplicate reasoning logic separate from project QA analysis path
+
+## Bottom line
+
+- C0-C3 are implemented and persisted in one physical store: `rag_chunks`.
+- Retrieval is a hybrid SQL ranking over lexical heuristics plus pgvector distance.
+- C2 exists, but only as retrievable edge documents, not as a traversable graph subsystem.
+- C3 covers FastAPI/Flask/Typer/Click only.
+- The least invasive ExplainPack integration point is after retrieval and before answer composition, preferably as a new explicit orchestrator artifact or as a replacement for `context_analysis`.