Фиксация изменений
This commit is contained in:
457
docs/architecture/retrieval_inventory.md
Normal file
457
docs/architecture/retrieval_inventory.md
Normal file
@@ -0,0 +1,457 @@
|
||||
# Retrieval Inventory
|
||||
|
||||
## Scope and method
|
||||
|
||||
This document describes the retrieval and indexing pipeline as implemented in code today. The inventory is based primarily on:
|
||||
|
||||
- `app/modules/rag/services/rag_service.py`
|
||||
- `app/modules/rag/persistence/*.py`
|
||||
- `app/modules/rag/indexing/code/**/*.py`
|
||||
- `app/modules/rag/indexing/docs/**/*.py`
|
||||
- `app/modules/rag_session/module.py`
|
||||
- `app/modules/agent/engine/graphs/project_qa_step_graphs.py`
|
||||
- `app/modules/agent/engine/orchestrator/*.py`
|
||||
|
||||
`ASSUMPTION:` the intended layer semantics are the ones implied by code and tests, not by future architecture plans. This matters because only `C0` through `C3` are materially implemented today; `C4+` exist only as enum constants.
|
||||
|
||||
## Current retrieval pipeline
|
||||
|
||||
1. Retrieval entrypoint is `POST /internal/rag/retrieve` in `app/modules/rag_session/module.py`.
|
||||
2. The endpoint calls `RagService.retrieve(rag_session_id, query)`.
|
||||
3. `RagQueryRouter` chooses `docs` or `code` mode from the raw query text.
|
||||
4. `RagService` computes a single embedding for the full query via `GigaChatEmbedder`.
|
||||
5. `RagQueryRepository.retrieve(...)` runs one SQL query against `rag_chunks` in PostgreSQL with `pgvector`.
|
||||
6. Ranking order is:
|
||||
- lexical rank
|
||||
- test-file penalty
|
||||
- layer rank
|
||||
- vector distance `embedding <=> query_embedding`
|
||||
7. Response items are normalized to `{source, content, layer, title, metadata, score}`.
|
||||
8. If embeddings fail, retrieval falls back to latest chunks from the same layers.
|
||||
9. If code retrieval returns nothing, service falls back to docs layers.
|
||||
|
||||
## Storage and indices
|
||||
|
||||
- Primary store: PostgreSQL from `DATABASE_URL`, configured in `app/modules/shared/db.py`.
|
||||
- Vector extension: `CREATE EXTENSION IF NOT EXISTS vector` in `app/modules/rag/persistence/schema_repository.py`.
|
||||
- Primary table: `rag_chunks`.
|
||||
- Cache tables:
|
||||
- `rag_blob_cache`
|
||||
- `rag_chunk_cache`
|
||||
- `rag_session_chunk_map`
|
||||
- SQL indexes currently created:
|
||||
- `(rag_session_id)`
|
||||
- `(rag_session_id, layer)`
|
||||
- `(rag_session_id, layer, path)`
|
||||
- `(qname)`
|
||||
- `(symbol_id)`
|
||||
- `(module_id)`
|
||||
- `(doc_kind)`
|
||||
- `(entrypoint_type, framework)`
|
||||
|
||||
`ASSUMPTION:` there is no explicit ANN index for the vector column in schema code. The code creates general SQL indexes, but no `ivfflat`/`hnsw` index is defined here.
|
||||
|
||||
## Layer: C0_SOURCE_CHUNKS
|
||||
|
||||
### Implementation
|
||||
|
||||
- Produced by `CodeIndexingPipeline.index_file(...)` in `app/modules/rag/indexing/code/pipeline.py`.
|
||||
- Chunking logic: `CodeTextChunker.chunk(...)` in `app/modules/rag/indexing/code/code_text/chunker.py`.
|
||||
- Document builder: `CodeTextDocumentBuilder.build(...)` in `app/modules/rag/indexing/code/code_text/document_builder.py`.
|
||||
- Persisted via `RagDocumentRepository.insert_documents(...)` into `rag_chunks`.
|
||||
|
||||
### Input contract
|
||||
|
||||
This is an indexing layer, not a direct public retriever. The observed upstream indexing input is a file dict with at least:
|
||||
|
||||
- required:
|
||||
- `path: str`
|
||||
- `content: str`
|
||||
- optional:
|
||||
- `commit_sha: str | None`
|
||||
- `content_hash: str`
|
||||
- metadata fields copied through by `RagService._document_metadata(...)`
|
||||
|
||||
For retrieval, the layer is queried only indirectly through:
|
||||
|
||||
- `rag_session_id: str`
|
||||
- `query: str`
|
||||
- inferred mode/layers from `RagQueryRouter`
|
||||
- fixed `limit=8`
|
||||
|
||||
### Output contract
|
||||
|
||||
Stored document shape:
|
||||
|
||||
- top-level:
|
||||
- `layer = "C0_SOURCE_CHUNKS"`
|
||||
- `lang = "python"`
|
||||
- `source.repo_id`
|
||||
- `source.commit_sha`
|
||||
- `source.path`
|
||||
- `title`
|
||||
- `text`
|
||||
- `span.start_line`
|
||||
- `span.end_line`
|
||||
- `embedding`
|
||||
- metadata:
|
||||
- `chunk_index`
|
||||
- `chunk_type`: `symbol_block` or `window`
|
||||
- `module_or_unit`
|
||||
- `artifact_type = "CODE"`
|
||||
- plus file-level metadata injected by `RagService`
|
||||
|
||||
Returned retrieval item shape:
|
||||
|
||||
- `source`
|
||||
- `content`
|
||||
- `layer`
|
||||
- `title`
|
||||
- `metadata`
|
||||
- `score`
|
||||
|
||||
No `line_start` / `line_end` are returned to the caller directly; they remain in DB columns `span_start` / `span_end` and are only used in logs.
|
||||
|
||||
### Defaults & limits
|
||||
|
||||
- AST chunking prefers one chunk per top-level class/function/async function.
|
||||
- Fallback window chunking:
|
||||
- `size = 80` lines
|
||||
- `overlap = 15` lines
|
||||
- Global retrieval limit from `RagService.retrieve(...)`: `8`
|
||||
- Embedding batch size from env:
|
||||
- `RAG_EMBED_BATCH_SIZE`
|
||||
- default `16`
|
||||
|
||||
### Known issues
|
||||
|
||||
- Nested methods/functions are not emitted as C0 chunks unless represented inside a selected top-level block.
|
||||
- Returned API payload omits line spans even though storage has them.
|
||||
- No direct filter by path, namespace, symbol, or `top_k` is exposed through the current endpoint.
|
||||
|
||||
## Layer: C1_SYMBOL_CATALOG
|
||||
|
||||
### Implementation
|
||||
|
||||
- Symbol extraction: `SymbolExtractor.extract(...)` in `app/modules/rag/indexing/code/symbols/extractor.py`.
|
||||
- AST parsing: `PythonAstParser.parse_module(...)`.
|
||||
- Document builder: `SymbolDocumentBuilder.build(...)`.
|
||||
- Retrieval reads rows from `rag_chunks`; there is no dedicated symbol table.
|
||||
|
||||
### Input contract
|
||||
|
||||
Indexing input is the same per-file payload as C0.
|
||||
|
||||
Observed symbol extraction source:
|
||||
|
||||
- Python AST only
|
||||
- supported symbol kinds:
|
||||
- `class`
|
||||
- `function`
|
||||
- `method`
|
||||
- `const` for top-level imports/import aliases
|
||||
|
||||
Retrieval input is still the generic text query endpoint. Query terms are enriched by `extract_query_terms(...)`:
|
||||
|
||||
- extracts identifier-like tokens from query text
|
||||
- normalizes camelCase/PascalCase to snake_case
|
||||
- adds special intent terms for management/control-related queries
|
||||
- max observed query terms: `6`
|
||||
|
||||
### Output contract
|
||||
|
||||
Stored document shape:
|
||||
|
||||
- top-level:
|
||||
- `layer = "C1_SYMBOL_CATALOG"`
|
||||
- `title = qname`
|
||||
- `text = "<kind> <qname>\n<signature>\n<docstring?>"`
|
||||
- `span.start_line`
|
||||
- `span.end_line`
|
||||
- metadata:
|
||||
- `symbol_id`
|
||||
- `qname`
|
||||
- `kind`
|
||||
- `signature`
|
||||
- `decorators_or_annotations`
|
||||
- `docstring_or_javadoc`
|
||||
- `parent_symbol_id`
|
||||
- `package_or_module`
|
||||
- `is_entry_candidate`
|
||||
- `lang_payload`
|
||||
- `artifact_type = "CODE"`
|
||||
|
||||
Observed `lang_payload` variants:
|
||||
|
||||
- class:
|
||||
- `bases`
|
||||
- function/method:
|
||||
- `async`
|
||||
- import alias:
|
||||
- `imported_from`
|
||||
- `import_alias`
|
||||
|
||||
### Defaults & limits
|
||||
|
||||
- Only Python source files are indexed into C-layers.
|
||||
- Import and import-from declarations are materialized as `const` symbols only at module top level.
|
||||
- Retrieval ranking gives C1 priority rank `1`, after C3 and before C2/C0.
|
||||
|
||||
### Known issues
|
||||
|
||||
- No explicit visibility/public-private model.
|
||||
- `parent_symbol_id` currently stores the parent qname string from the stack, not the parent symbol hash. This is an observed implementation detail.
|
||||
- Cross-file symbol resolution is not implemented; `dst_symbol_id` in edges resolves only against symbols extracted from the same file.
|
||||
|
||||
## Layer: C2_DEPENDENCY_GRAPH
|
||||
|
||||
### Implementation
|
||||
|
||||
- Edge extraction: `EdgeExtractor.extract(...)` in `app/modules/rag/indexing/code/edges/extractor.py`.
|
||||
- Document builder: `EdgeDocumentBuilder.build(...)`.
|
||||
- Built during `CodeIndexingPipeline.index_file(...)`.
|
||||
|
||||
### Input contract
|
||||
|
||||
Indexing input is the same per-file source payload as C0/C1.
|
||||
|
||||
Graph construction method:
|
||||
|
||||
- static analysis only
|
||||
- Python AST walk only
|
||||
- no runtime tracing
|
||||
- no tree-sitter
|
||||
|
||||
Observed edge types:
|
||||
|
||||
- `calls`
|
||||
- `imports`
|
||||
- `inherits`
|
||||
|
||||
### Output contract
|
||||
|
||||
Stored document shape:
|
||||
|
||||
- top-level:
|
||||
- `layer = "C2_DEPENDENCY_GRAPH"`
|
||||
- `title = "<src_qname>:<edge_type>"`
|
||||
- `text = "<src_qname> <edge_type> <dst>"`
|
||||
- `span.start_line`
|
||||
- `span.end_line`
|
||||
- `links` contains one evidence link of type `EDGE`
|
||||
- metadata:
|
||||
- `edge_id`
|
||||
- `edge_type`
|
||||
- `src_symbol_id`
|
||||
- `src_qname`
|
||||
- `dst_symbol_id`
|
||||
- `dst_ref`
|
||||
- `resolution`: `resolved` or `partial`
|
||||
- `lang_payload`
|
||||
- `artifact_type = "CODE"`
|
||||
|
||||
Observed `lang_payload` usage:
|
||||
|
||||
- for calls: may include `callsite_kind = "function_call"`
|
||||
|
||||
### Defaults & limits
|
||||
|
||||
- Edge extraction is per-file only.
|
||||
- `imports` edges are emitted only while visiting a class/function scope; top-level imports do not become C2 edges.
|
||||
- Layer rank in retrieval SQL: `2`
|
||||
|
||||
### Known issues
|
||||
|
||||
- There is no traversal API, graph repository, or query language over C2. Retrieval only treats edges as text/vector rows in `rag_chunks`.
|
||||
- Destination resolution is local to the file-level qname map.
|
||||
- Top-level module import relationships are incompletely represented because `visit_Import` / `visit_ImportFrom` skip when there is no current scope.
|
||||
|
||||
## Layer: C3_ENTRYPOINTS
|
||||
|
||||
### Implementation
|
||||
|
||||
- Detection registry: `EntrypointDetectorRegistry.detect_all(...)`.
|
||||
- Detectors:
|
||||
- `FastApiEntrypointDetector`
|
||||
- `FlaskEntrypointDetector`
|
||||
- `TyperClickEntrypointDetector`
|
||||
- Document builder: `EntrypointDocumentBuilder.build(...)`.
|
||||
|
||||
### Input contract
|
||||
|
||||
Indexing input is the same per-file source payload as other C-layers.
|
||||
|
||||
Detected entrypoint families today:
|
||||
|
||||
- HTTP:
|
||||
- FastAPI decorators such as `.get`, `.post`, `.put`, `.patch`, `.delete`, `.route`
|
||||
- Flask `.route`
|
||||
- CLI:
|
||||
- Typer/Click `.command`
|
||||
- Typer/Click `.callback`
|
||||
|
||||
Not detected:
|
||||
|
||||
- Django routes
|
||||
- Celery tasks
|
||||
- RQ jobs
|
||||
- cron jobs / scheduler entries
|
||||
|
||||
### Output contract
|
||||
|
||||
Stored document shape:
|
||||
|
||||
- top-level:
|
||||
- `layer = "C3_ENTRYPOINTS"`
|
||||
- `title = route_or_command`
|
||||
- `text = "<framework> <entry_type> <route_or_command>"`
|
||||
- `span.start_line`
|
||||
- `span.end_line`
|
||||
- `links` contains one evidence link of type `CODE_SPAN`
|
||||
- metadata:
|
||||
- `entry_id`
|
||||
- `entry_type`: observed `http` or `cli`
|
||||
- `framework`: observed `fastapi`, `flask`, `typer`, `click`
|
||||
- `route_or_command`
|
||||
- `handler_symbol_id`
|
||||
- `lang_payload`
|
||||
- `artifact_type = "CODE"`
|
||||
|
||||
FastAPI-specific observed payload:
|
||||
|
||||
- `lang_payload.methods = [HTTP_METHOD]` for `.get/.post/...`
|
||||
|
||||
### Defaults & limits
|
||||
|
||||
- Retrieval layer rank: `0` highest among code layers.
|
||||
- Entrypoint mapping is handler-symbol centric:
|
||||
- decorator match -> symbol -> `handler_symbol_id`
|
||||
- physical location comes from symbol span
|
||||
|
||||
### Known issues
|
||||
|
||||
- Route parsing is string-based from decorator text, not semantic AST argument parsing.
|
||||
- No dedicated entrypoint tags beyond `entry_type`, `framework`, and raw decorator-derived payload.
|
||||
- Background jobs and non-decorator entrypoints are not indexed.
|
||||
|
||||
## Dependency graph / trace current state
|
||||
|
||||
### Exists or stub?
|
||||
|
||||
- C2 exists and is populated.
|
||||
- It is not a stub.
|
||||
- It is also not a full-project dependency graph service; it is a set of per-edge documents stored in `rag_chunks`.
|
||||
|
||||
### How the graph is built
|
||||
|
||||
- static Python AST analysis
|
||||
- no runtime instrumentation
|
||||
- no import graph resolver across modules
|
||||
- no tree-sitter
|
||||
|
||||
### Edge types in data
|
||||
|
||||
- `calls`
|
||||
- `imports`
|
||||
- `inherits`
|
||||
|
||||
### Traversal API
|
||||
|
||||
- No traversal API was found in `app/modules/rag/*` or `app/modules/agent/*`.
|
||||
- No method accepts graph traversal parameters such as depth, start node, edge filters, or BFS/DFS strategy.
|
||||
- Current access path is only retrieval over indexed edge documents.
|
||||
|
||||
## Entrypoints current state
|
||||
|
||||
### Implemented extraction
|
||||
|
||||
- HTTP routes:
|
||||
- FastAPI
|
||||
- Flask
|
||||
- CLI:
|
||||
- Typer
|
||||
- Click
|
||||
|
||||
### Mapping model
|
||||
|
||||
- `entrypoint -> handler_symbol_id -> symbol span/path`
|
||||
- The entrypoint record itself stores:
|
||||
- framework
|
||||
- entry type
|
||||
- raw route/command string
|
||||
- handler symbol id
|
||||
|
||||
### Tags/types
|
||||
|
||||
- `entry_type` is the main normalized tag.
|
||||
- Observed values: `http`, `cli`.
|
||||
- `framework` is the second discriminator.
|
||||
- There are no richer endpoint taxonomies such as `job`, `worker`, `webhook`, `scheduler`.
|
||||
|
||||
## Defaults and operational limits
|
||||
|
||||
- Query mode default: `docs`
|
||||
- Code mode is enabled by keyword heuristics in `RagQueryRouter`
|
||||
- Retrieval hard limit: `8`
|
||||
- Fallback limit: `8`
|
||||
- Query term extraction limit: `6`
|
||||
- Ranked source bundle for project QA:
|
||||
- top `12` RAG items
|
||||
- top `10` file candidates
|
||||
- No exposed `namespace`, `path_prefixes`, `top_k`, `max_chars`, `max_chunks`, `max_depth` in the public/internal retrieval endpoint
|
||||
|
||||
`ASSUMPTION:` the absence of these controls in endpoint and service signatures means they are not part of the current supported contract, even though `RagQueryRepository.retrieve(...)` has an internal `path_prefixes` parameter.
|
||||
|
||||
## Known cross-cutting issues
|
||||
|
||||
- Retrieval contract is effectively text-only at API level; structured retrieval exists only as internal SQL parameters.
|
||||
- Response payload drops explicit line spans even though spans are stored.
|
||||
- Vector retrieval is coupled to a single provider-specific embedder.
|
||||
- Docs mode is the default, so code retrieval depends on heuristic query phrasing unless the project/qa graph prepends `по коду`.
|
||||
- There is no separate retrieval contract per layer exposed over API; all layer selection is implicit.
|
||||
|
||||
## Where to plug ExplainPack pipeline
|
||||
|
||||
### Option 1: replace or extend `project_qa/context_analysis`
|
||||
|
||||
- Code location:
|
||||
- `app/modules/agent/engine/graphs/project_qa_step_graphs.py`
|
||||
- Why:
|
||||
- retrieval is already complete at this step
|
||||
- input bundle already contains ranked `rag_items` and `file_candidates`
|
||||
- output is already a structured `analysis_brief`
|
||||
- Risk:
|
||||
- low
|
||||
- minimal invasion if ExplainPack consumes `source_bundle` and emits the same `analysis_brief` shape
|
||||
|
||||
### Option 2: insert a new orchestrator step between `context_retrieval` and `context_analysis`
|
||||
|
||||
- Code location:
|
||||
- `app/modules/agent/engine/orchestrator/template_registry.py`
|
||||
- `app/modules/agent/engine/orchestrator/step_registry.py`
|
||||
- Why:
|
||||
- preserves current retrieval behavior
|
||||
- makes ExplainPack an explicit pipeline stage with its own artifact
|
||||
- cleanest for observability and future A/B migration
|
||||
- Risk:
|
||||
- low to medium
|
||||
- requires one new artifact contract and one extra orchestration step, but no change to retrieval storage
|
||||
|
||||
### Option 3: introduce ExplainPack inside `ExplainActions.extract_logic`
|
||||
|
||||
- Code location:
|
||||
- `app/modules/agent/engine/orchestrator/actions/explain_actions.py`
|
||||
- Why:
|
||||
- useful if ExplainPack is meant only for explain-style scenarios
|
||||
- keeps general project QA untouched
|
||||
- Risk:
|
||||
- medium
|
||||
- narrower integration point; may create duplicate reasoning logic separate from project QA analysis path
|
||||
|
||||
## Bottom line
|
||||
|
||||
- C0-C3 are implemented and persisted in one physical store: `rag_chunks`.
|
||||
- Retrieval is a hybrid SQL ranking over lexical heuristics plus pgvector distance.
|
||||
- C2 exists, but only as retrievable edge documents, not as a traversable graph subsystem.
|
||||
- C3 covers FastAPI/Flask/Typer/Click only.
|
||||
- The least invasive ExplainPack integration point is after retrieval and before answer composition, preferably as a new explicit orchestrator artifact or as a replacement for `context_analysis`.
|
||||
Reference in New Issue
Block a user