14 KiB
Retrieval Inventory
Scope and method
This document describes the retrieval and indexing pipeline as implemented in code today. The inventory is based primarily on:
app/modules/rag/services/rag_service.pyapp/modules/rag/persistence/*.pyapp/modules/rag/indexing/code/**/*.pyapp/modules/rag/indexing/docs/**/*.pyapp/modules/rag_session/module.pyapp/modules/agent/engine/graphs/project_qa_step_graphs.pyapp/modules/agent/engine/orchestrator/*.py
ASSUMPTION: the intended layer semantics are the ones implied by code and tests, not by future architecture plans. This matters because only C0 through C3 are materially implemented today; C4+ exist only as enum constants.
Current retrieval pipeline
- Retrieval entrypoint is
POST /internal/rag/retrieveinapp/modules/rag_session/module.py. - The endpoint calls
RagService.retrieve(rag_session_id, query). RagQueryRouterchoosesdocsorcodemode from the raw query text.RagServicecomputes a single embedding for the full query viaGigaChatEmbedder.RagQueryRepository.retrieve(...)runs one SQL query againstrag_chunksin PostgreSQL withpgvector.- Ranking order is:
- lexical rank
- test-file penalty
- layer rank
- vector distance
embedding <=> query_embedding
- Response items are normalized to
{source, content, layer, title, metadata, score}. - If embeddings fail, retrieval falls back to latest chunks from the same layers.
- If code retrieval returns nothing, service falls back to docs layers.
Storage and indices
- Primary store: PostgreSQL from
DATABASE_URL, configured inapp/modules/shared/db.py. - Vector extension:
CREATE EXTENSION IF NOT EXISTS vectorinapp/modules/rag/persistence/schema_repository.py. - Primary table:
rag_chunks. - Cache tables:
rag_blob_cacherag_chunk_cacherag_session_chunk_map
- SQL indexes currently created:
(rag_session_id)(rag_session_id, layer)(rag_session_id, layer, path)(qname)(symbol_id)(module_id)(doc_kind)(entrypoint_type, framework)
ASSUMPTION: there is no explicit ANN index for the vector column in schema code. The code creates general SQL indexes, but no ivfflat/hnsw index is defined here.
Layer: C0_SOURCE_CHUNKS
Implementation
- Produced by
CodeIndexingPipeline.index_file(...)inapp/modules/rag/indexing/code/pipeline.py. - Chunking logic:
CodeTextChunker.chunk(...)inapp/modules/rag/indexing/code/code_text/chunker.py. - Document builder:
CodeTextDocumentBuilder.build(...)inapp/modules/rag/indexing/code/code_text/document_builder.py. - Persisted via
RagDocumentRepository.insert_documents(...)intorag_chunks.
Input contract
This is an indexing layer, not a direct public retriever. The observed upstream indexing input is a file dict with at least:
- required:
path: strcontent: str
- optional:
commit_sha: str | Nonecontent_hash: str- metadata fields copied through by
RagService._document_metadata(...)
For retrieval, the layer is queried only indirectly through:
rag_session_id: strquery: str- inferred mode/layers from
RagQueryRouter - fixed
limit=8
Output contract
Stored document shape:
- top-level:
layer = "C0_SOURCE_CHUNKS"lang = "python"source.repo_idsource.commit_shasource.pathtitletextspan.start_linespan.end_lineembedding
- metadata:
chunk_indexchunk_type:symbol_blockorwindowmodule_or_unitartifact_type = "CODE"- plus file-level metadata injected by
RagService
Returned retrieval item shape:
sourcecontentlayertitlemetadatascore
No line_start / line_end are returned to the caller directly; they remain in DB columns span_start / span_end and are only used in logs.
Defaults & limits
- AST chunking prefers one chunk per top-level class/function/async function.
- Fallback window chunking:
size = 80linesoverlap = 15lines
- Global retrieval limit from
RagService.retrieve(...):8 - Embedding batch size from env:
RAG_EMBED_BATCH_SIZE- default
16
Known issues
- Nested methods/functions are not emitted as C0 chunks unless represented inside a selected top-level block.
- Returned API payload omits line spans even though storage has them.
- No direct filter by path, namespace, symbol, or
top_kis exposed through the current endpoint.
Layer: C1_SYMBOL_CATALOG
Implementation
- Symbol extraction:
SymbolExtractor.extract(...)inapp/modules/rag/indexing/code/symbols/extractor.py. - AST parsing:
PythonAstParser.parse_module(...). - Document builder:
SymbolDocumentBuilder.build(...). - Retrieval reads rows from
rag_chunks; there is no dedicated symbol table.
Input contract
Indexing input is the same per-file payload as C0.
Observed symbol extraction source:
- Python AST only
- supported symbol kinds:
classfunctionmethodconstfor top-level imports/import aliases
Retrieval input is still the generic text query endpoint. Query terms are enriched by extract_query_terms(...):
- extracts identifier-like tokens from query text
- normalizes camelCase/PascalCase to snake_case
- adds special intent terms for management/control-related queries
- max observed query terms:
6
Output contract
Stored document shape:
- top-level:
layer = "C1_SYMBOL_CATALOG"title = qnametext = "<kind> <qname>\n<signature>\n<docstring?>"span.start_linespan.end_line
- metadata:
symbol_idqnamekindsignaturedecorators_or_annotationsdocstring_or_javadocparent_symbol_idpackage_or_moduleis_entry_candidatelang_payloadartifact_type = "CODE"
Observed lang_payload variants:
- class:
bases
- function/method:
async
- import alias:
imported_fromimport_alias
Defaults & limits
- Only Python source files are indexed into C-layers.
- Import and import-from declarations are materialized as
constsymbols only at module top level. - Retrieval ranking gives C1 priority rank
1, after C3 and before C2/C0.
Known issues
- No explicit visibility/public-private model.
parent_symbol_idcurrently stores the parent qname string from the stack, not the parent symbol hash. This is an observed implementation detail.- Cross-file symbol resolution is not implemented;
dst_symbol_idin edges resolves only against symbols extracted from the same file.
Layer: C2_DEPENDENCY_GRAPH
Implementation
- Edge extraction:
EdgeExtractor.extract(...)inapp/modules/rag/indexing/code/edges/extractor.py. - Document builder:
EdgeDocumentBuilder.build(...). - Built during
CodeIndexingPipeline.index_file(...).
Input contract
Indexing input is the same per-file source payload as C0/C1.
Graph construction method:
- static analysis only
- Python AST walk only
- no runtime tracing
- no tree-sitter
Observed edge types:
callsimportsinherits
Output contract
Stored document shape:
- top-level:
layer = "C2_DEPENDENCY_GRAPH"title = "<src_qname>:<edge_type>"text = "<src_qname> <edge_type> <dst>"span.start_linespan.end_linelinkscontains one evidence link of typeEDGE
- metadata:
edge_idedge_typesrc_symbol_idsrc_qnamedst_symbol_iddst_refresolution:resolvedorpartiallang_payloadartifact_type = "CODE"
Observed lang_payload usage:
- for calls: may include
callsite_kind = "function_call"
Defaults & limits
- Edge extraction is per-file only.
importsedges are emitted only while visiting a class/function scope; top-level imports do not become C2 edges.- Layer rank in retrieval SQL:
2
Known issues
- There is no traversal API, graph repository, or query language over C2. Retrieval only treats edges as text/vector rows in
rag_chunks. - Destination resolution is local to the file-level qname map.
- Top-level module import relationships are incompletely represented because
visit_Import/visit_ImportFromskip when there is no current scope.
Layer: C3_ENTRYPOINTS
Implementation
- Detection registry:
EntrypointDetectorRegistry.detect_all(...). - Detectors:
FastApiEntrypointDetectorFlaskEntrypointDetectorTyperClickEntrypointDetector
- Document builder:
EntrypointDocumentBuilder.build(...).
Input contract
Indexing input is the same per-file source payload as other C-layers.
Detected entrypoint families today:
- HTTP:
- FastAPI decorators such as
.get,.post,.put,.patch,.delete,.route - Flask
.route
- FastAPI decorators such as
- CLI:
- Typer/Click
.command - Typer/Click
.callback
- Typer/Click
Not detected:
- Django routes
- Celery tasks
- RQ jobs
- cron jobs / scheduler entries
Output contract
Stored document shape:
- top-level:
layer = "C3_ENTRYPOINTS"title = route_or_commandtext = "<framework> <entry_type> <route_or_command>"span.start_linespan.end_linelinkscontains one evidence link of typeCODE_SPAN
- metadata:
entry_identry_type: observedhttporcliframework: observedfastapi,flask,typer,clickroute_or_commandhandler_symbol_idlang_payloadartifact_type = "CODE"
FastAPI-specific observed payload:
lang_payload.methods = [HTTP_METHOD]for.get/.post/...
Defaults & limits
- Retrieval layer rank:
0highest among code layers. - Entrypoint mapping is handler-symbol centric:
- decorator match -> symbol ->
handler_symbol_id - physical location comes from symbol span
- decorator match -> symbol ->
Known issues
- Route parsing is string-based from decorator text, not semantic AST argument parsing.
- No dedicated entrypoint tags beyond
entry_type,framework, and raw decorator-derived payload. - Background jobs and non-decorator entrypoints are not indexed.
Dependency graph / trace current state
Exists or stub?
- C2 exists and is populated.
- It is not a stub.
- It is also not a full-project dependency graph service; it is a set of per-edge documents stored in
rag_chunks.
How the graph is built
- static Python AST analysis
- no runtime instrumentation
- no import graph resolver across modules
- no tree-sitter
Edge types in data
callsimportsinherits
Traversal API
- No traversal API was found in
app/modules/rag/*orapp/modules/agent/*. - No method accepts graph traversal parameters such as depth, start node, edge filters, or BFS/DFS strategy.
- Current access path is only retrieval over indexed edge documents.
Entrypoints current state
Implemented extraction
- HTTP routes:
- FastAPI
- Flask
- CLI:
- Typer
- Click
Mapping model
entrypoint -> handler_symbol_id -> symbol span/path- The entrypoint record itself stores:
- framework
- entry type
- raw route/command string
- handler symbol id
Tags/types
entry_typeis the main normalized tag.- Observed values:
http,cli. frameworkis the second discriminator.- There are no richer endpoint taxonomies such as
job,worker,webhook,scheduler.
Defaults and operational limits
- Query mode default:
docs - Code mode is enabled by keyword heuristics in
RagQueryRouter - Retrieval hard limit:
8 - Fallback limit:
8 - Query term extraction limit:
6 - Ranked source bundle for project QA:
- top
12RAG items - top
10file candidates
- top
- No exposed
namespace,path_prefixes,top_k,max_chars,max_chunks,max_depthin the public/internal retrieval endpoint
ASSUMPTION: the absence of these controls in endpoint and service signatures means they are not part of the current supported contract, even though RagQueryRepository.retrieve(...) has an internal path_prefixes parameter.
Known cross-cutting issues
- Retrieval contract is effectively text-only at API level; structured retrieval exists only as internal SQL parameters.
- Response payload drops explicit line spans even though spans are stored.
- Vector retrieval is coupled to a single provider-specific embedder.
- Docs mode is the default, so code retrieval depends on heuristic query phrasing unless the project/qa graph prepends
по коду. - There is no separate retrieval contract per layer exposed over API; all layer selection is implicit.
Where to plug ExplainPack pipeline
Option 1: replace or extend project_qa/context_analysis
- Code location:
app/modules/agent/engine/graphs/project_qa_step_graphs.py
- Why:
- retrieval is already complete at this step
- input bundle already contains ranked
rag_itemsandfile_candidates - output is already a structured
analysis_brief
- Risk:
- low
- minimal invasion if ExplainPack consumes
source_bundleand emits the sameanalysis_briefshape
Option 2: insert a new orchestrator step between context_retrieval and context_analysis
- Code location:
app/modules/agent/engine/orchestrator/template_registry.pyapp/modules/agent/engine/orchestrator/step_registry.py
- Why:
- preserves current retrieval behavior
- makes ExplainPack an explicit pipeline stage with its own artifact
- cleanest for observability and future A/B migration
- Risk:
- low to medium
- requires one new artifact contract and one extra orchestration step, but no change to retrieval storage
Option 3: introduce ExplainPack inside ExplainActions.extract_logic
- Code location:
app/modules/agent/engine/orchestrator/actions/explain_actions.py
- Why:
- useful if ExplainPack is meant only for explain-style scenarios
- keeps general project QA untouched
- Risk:
- medium
- narrower integration point; may create duplicate reasoning logic separate from project QA analysis path
Bottom line
- C0-C3 are implemented and persisted in one physical store:
rag_chunks. - Retrieval is a hybrid SQL ranking over lexical heuristics plus pgvector distance.
- C2 exists, but only as retrievable edge documents, not as a traversable graph subsystem.
- C3 covers FastAPI/Flask/Typer/Click only.
- The least invasive ExplainPack integration point is after retrieval and before answer composition, preferably as a new explicit orchestrator artifact or as a replacement for
context_analysis.