# Retrieval Inventory ## Scope and method This document describes the retrieval and indexing pipeline as implemented in code today. The inventory is based primarily on: - `app/modules/rag/services/rag_service.py` - `app/modules/rag/persistence/*.py` - `app/modules/rag/indexing/code/**/*.py` - `app/modules/rag/indexing/docs/**/*.py` - `app/modules/rag_session/module.py` - `app/modules/agent/engine/graphs/project_qa_step_graphs.py` - `app/modules/agent/engine/orchestrator/*.py` `ASSUMPTION:` the intended layer semantics are the ones implied by code and tests, not by future architecture plans. This matters because only `C0` through `C3` are materially implemented today; `C4+` exist only as enum constants. ## Current retrieval pipeline 1. Retrieval entrypoint is `POST /internal/rag/retrieve` in `app/modules/rag_session/module.py`. 2. The endpoint calls `RagService.retrieve(rag_session_id, query)`. 3. `RagQueryRouter` chooses `docs` or `code` mode from the raw query text. 4. `RagService` computes a single embedding for the full query via `GigaChatEmbedder`. 5. `RagQueryRepository.retrieve(...)` runs one SQL query against `rag_chunks` in PostgreSQL with `pgvector`. 6. Ranking order is: - lexical rank - test-file penalty - layer rank - vector distance `embedding <=> query_embedding` 7. Response items are normalized to `{source, content, layer, title, metadata, score}`. 8. If embeddings fail, retrieval falls back to latest chunks from the same layers. 9. If code retrieval returns nothing, service falls back to docs layers. ## Storage and indices - Primary store: PostgreSQL from `DATABASE_URL`, configured in `app/modules/shared/db.py`. - Vector extension: `CREATE EXTENSION IF NOT EXISTS vector` in `app/modules/rag/persistence/schema_repository.py`. - Primary table: `rag_chunks`. - Cache tables: - `rag_blob_cache` - `rag_chunk_cache` - `rag_session_chunk_map` - SQL indexes currently created: - `(rag_session_id)` - `(rag_session_id, layer)` - `(rag_session_id, layer, path)` - `(qname)` - `(symbol_id)` - `(module_id)` - `(doc_kind)` - `(entrypoint_type, framework)` `ASSUMPTION:` there is no explicit ANN index for the vector column in schema code. The code creates general SQL indexes, but no `ivfflat`/`hnsw` index is defined here. ## Layer: C0_SOURCE_CHUNKS ### Implementation - Produced by `CodeIndexingPipeline.index_file(...)` in `app/modules/rag/indexing/code/pipeline.py`. - Chunking logic: `CodeTextChunker.chunk(...)` in `app/modules/rag/indexing/code/code_text/chunker.py`. - Document builder: `CodeTextDocumentBuilder.build(...)` in `app/modules/rag/indexing/code/code_text/document_builder.py`. - Persisted via `RagDocumentRepository.insert_documents(...)` into `rag_chunks`. ### Input contract This is an indexing layer, not a direct public retriever. The observed upstream indexing input is a file dict with at least: - required: - `path: str` - `content: str` - optional: - `commit_sha: str | None` - `content_hash: str` - metadata fields copied through by `RagService._document_metadata(...)` For retrieval, the layer is queried only indirectly through: - `rag_session_id: str` - `query: str` - inferred mode/layers from `RagQueryRouter` - fixed `limit=8` ### Output contract Stored document shape: - top-level: - `layer = "C0_SOURCE_CHUNKS"` - `lang = "python"` - `source.repo_id` - `source.commit_sha` - `source.path` - `title` - `text` - `span.start_line` - `span.end_line` - `embedding` - metadata: - `chunk_index` - `chunk_type`: `symbol_block` or `window` - `module_or_unit` - `artifact_type = "CODE"` - plus file-level metadata injected by `RagService` Returned retrieval item shape: - `source` - `content` - `layer` - `title` - `metadata` - `score` No `line_start` / `line_end` are returned to the caller directly; they remain in DB columns `span_start` / `span_end` and are only used in logs. ### Defaults & limits - AST chunking prefers one chunk per top-level class/function/async function. - Fallback window chunking: - `size = 80` lines - `overlap = 15` lines - Global retrieval limit from `RagService.retrieve(...)`: `8` - Embedding batch size from env: - `RAG_EMBED_BATCH_SIZE` - default `16` ### Known issues - Nested methods/functions are not emitted as C0 chunks unless represented inside a selected top-level block. - Returned API payload omits line spans even though storage has them. - No direct filter by path, namespace, symbol, or `top_k` is exposed through the current endpoint. ## Layer: C1_SYMBOL_CATALOG ### Implementation - Symbol extraction: `SymbolExtractor.extract(...)` in `app/modules/rag/indexing/code/symbols/extractor.py`. - AST parsing: `PythonAstParser.parse_module(...)`. - Document builder: `SymbolDocumentBuilder.build(...)`. - Retrieval reads rows from `rag_chunks`; there is no dedicated symbol table. ### Input contract Indexing input is the same per-file payload as C0. Observed symbol extraction source: - Python AST only - supported symbol kinds: - `class` - `function` - `method` - `const` for top-level imports/import aliases Retrieval input is still the generic text query endpoint. Query terms are enriched by `extract_query_terms(...)`: - extracts identifier-like tokens from query text - normalizes camelCase/PascalCase to snake_case - adds special intent terms for management/control-related queries - max observed query terms: `6` ### Output contract Stored document shape: - top-level: - `layer = "C1_SYMBOL_CATALOG"` - `title = qname` - `text = " \n\n"` - `span.start_line` - `span.end_line` - metadata: - `symbol_id` - `qname` - `kind` - `signature` - `decorators_or_annotations` - `docstring_or_javadoc` - `parent_symbol_id` - `package_or_module` - `is_entry_candidate` - `lang_payload` - `artifact_type = "CODE"` Observed `lang_payload` variants: - class: - `bases` - function/method: - `async` - import alias: - `imported_from` - `import_alias` ### Defaults & limits - Only Python source files are indexed into C-layers. - Import and import-from declarations are materialized as `const` symbols only at module top level. - Retrieval ranking gives C1 priority rank `1`, after C3 and before C2/C0. ### Known issues - No explicit visibility/public-private model. - `parent_symbol_id` currently stores the parent qname string from the stack, not the parent symbol hash. This is an observed implementation detail. - Cross-file symbol resolution is not implemented; `dst_symbol_id` in edges resolves only against symbols extracted from the same file. ## Layer: C2_DEPENDENCY_GRAPH ### Implementation - Edge extraction: `EdgeExtractor.extract(...)` in `app/modules/rag/indexing/code/edges/extractor.py`. - Document builder: `EdgeDocumentBuilder.build(...)`. - Built during `CodeIndexingPipeline.index_file(...)`. ### Input contract Indexing input is the same per-file source payload as C0/C1. Graph construction method: - static analysis only - Python AST walk only - no runtime tracing - no tree-sitter Observed edge types: - `calls` - `imports` - `inherits` ### Output contract Stored document shape: - top-level: - `layer = "C2_DEPENDENCY_GRAPH"` - `title = ":"` - `text = " "` - `span.start_line` - `span.end_line` - `links` contains one evidence link of type `EDGE` - metadata: - `edge_id` - `edge_type` - `src_symbol_id` - `src_qname` - `dst_symbol_id` - `dst_ref` - `resolution`: `resolved` or `partial` - `lang_payload` - `artifact_type = "CODE"` Observed `lang_payload` usage: - for calls: may include `callsite_kind = "function_call"` ### Defaults & limits - Edge extraction is per-file only. - `imports` edges are emitted only while visiting a class/function scope; top-level imports do not become C2 edges. - Layer rank in retrieval SQL: `2` ### Known issues - There is no traversal API, graph repository, or query language over C2. Retrieval only treats edges as text/vector rows in `rag_chunks`. - Destination resolution is local to the file-level qname map. - Top-level module import relationships are incompletely represented because `visit_Import` / `visit_ImportFrom` skip when there is no current scope. ## Layer: C3_ENTRYPOINTS ### Implementation - Detection registry: `EntrypointDetectorRegistry.detect_all(...)`. - Detectors: - `FastApiEntrypointDetector` - `FlaskEntrypointDetector` - `TyperClickEntrypointDetector` - Document builder: `EntrypointDocumentBuilder.build(...)`. ### Input contract Indexing input is the same per-file source payload as other C-layers. Detected entrypoint families today: - HTTP: - FastAPI decorators such as `.get`, `.post`, `.put`, `.patch`, `.delete`, `.route` - Flask `.route` - CLI: - Typer/Click `.command` - Typer/Click `.callback` Not detected: - Django routes - Celery tasks - RQ jobs - cron jobs / scheduler entries ### Output contract Stored document shape: - top-level: - `layer = "C3_ENTRYPOINTS"` - `title = route_or_command` - `text = " "` - `span.start_line` - `span.end_line` - `links` contains one evidence link of type `CODE_SPAN` - metadata: - `entry_id` - `entry_type`: observed `http` or `cli` - `framework`: observed `fastapi`, `flask`, `typer`, `click` - `route_or_command` - `handler_symbol_id` - `lang_payload` - `artifact_type = "CODE"` FastAPI-specific observed payload: - `lang_payload.methods = [HTTP_METHOD]` for `.get/.post/...` ### Defaults & limits - Retrieval layer rank: `0` highest among code layers. - Entrypoint mapping is handler-symbol centric: - decorator match -> symbol -> `handler_symbol_id` - physical location comes from symbol span ### Known issues - Route parsing is string-based from decorator text, not semantic AST argument parsing. - No dedicated entrypoint tags beyond `entry_type`, `framework`, and raw decorator-derived payload. - Background jobs and non-decorator entrypoints are not indexed. ## Dependency graph / trace current state ### Exists or stub? - C2 exists and is populated. - It is not a stub. - It is also not a full-project dependency graph service; it is a set of per-edge documents stored in `rag_chunks`. ### How the graph is built - static Python AST analysis - no runtime instrumentation - no import graph resolver across modules - no tree-sitter ### Edge types in data - `calls` - `imports` - `inherits` ### Traversal API - No traversal API was found in `app/modules/rag/*` or `app/modules/agent/*`. - No method accepts graph traversal parameters such as depth, start node, edge filters, or BFS/DFS strategy. - Current access path is only retrieval over indexed edge documents. ## Entrypoints current state ### Implemented extraction - HTTP routes: - FastAPI - Flask - CLI: - Typer - Click ### Mapping model - `entrypoint -> handler_symbol_id -> symbol span/path` - The entrypoint record itself stores: - framework - entry type - raw route/command string - handler symbol id ### Tags/types - `entry_type` is the main normalized tag. - Observed values: `http`, `cli`. - `framework` is the second discriminator. - There are no richer endpoint taxonomies such as `job`, `worker`, `webhook`, `scheduler`. ## Defaults and operational limits - Query mode default: `docs` - Code mode is enabled by keyword heuristics in `RagQueryRouter` - Retrieval hard limit: `8` - Fallback limit: `8` - Query term extraction limit: `6` - Ranked source bundle for project QA: - top `12` RAG items - top `10` file candidates - No exposed `namespace`, `path_prefixes`, `top_k`, `max_chars`, `max_chunks`, `max_depth` in the public/internal retrieval endpoint `ASSUMPTION:` the absence of these controls in endpoint and service signatures means they are not part of the current supported contract, even though `RagQueryRepository.retrieve(...)` has an internal `path_prefixes` parameter. ## Known cross-cutting issues - Retrieval contract is effectively text-only at API level; structured retrieval exists only as internal SQL parameters. - Response payload drops explicit line spans even though spans are stored. - Vector retrieval is coupled to a single provider-specific embedder. - Docs mode is the default, so code retrieval depends on heuristic query phrasing unless the project/qa graph prepends `по коду`. - There is no separate retrieval contract per layer exposed over API; all layer selection is implicit. ## Where to plug ExplainPack pipeline ### Option 1: replace or extend `project_qa/context_analysis` - Code location: - `app/modules/agent/engine/graphs/project_qa_step_graphs.py` - Why: - retrieval is already complete at this step - input bundle already contains ranked `rag_items` and `file_candidates` - output is already a structured `analysis_brief` - Risk: - low - minimal invasion if ExplainPack consumes `source_bundle` and emits the same `analysis_brief` shape ### Option 2: insert a new orchestrator step between `context_retrieval` and `context_analysis` - Code location: - `app/modules/agent/engine/orchestrator/template_registry.py` - `app/modules/agent/engine/orchestrator/step_registry.py` - Why: - preserves current retrieval behavior - makes ExplainPack an explicit pipeline stage with its own artifact - cleanest for observability and future A/B migration - Risk: - low to medium - requires one new artifact contract and one extra orchestration step, but no change to retrieval storage ### Option 3: introduce ExplainPack inside `ExplainActions.extract_logic` - Code location: - `app/modules/agent/engine/orchestrator/actions/explain_actions.py` - Why: - useful if ExplainPack is meant only for explain-style scenarios - keeps general project QA untouched - Risk: - medium - narrower integration point; may create duplicate reasoning logic separate from project QA analysis path ## Bottom line - C0-C3 are implemented and persisted in one physical store: `rag_chunks`. - Retrieval is a hybrid SQL ranking over lexical heuristics plus pgvector distance. - C2 exists, but only as retrievable edge documents, not as a traversable graph subsystem. - C3 covers FastAPI/Flask/Typer/Click only. - The least invasive ExplainPack integration point is after retrieval and before answer composition, preferably as a new explicit orchestrator artifact or as a replacement for `context_analysis`.