Files
agent/docs/architecture/retrieval_inventory.md

14 KiB

Retrieval Inventory

Scope and method

This document describes the retrieval and indexing pipeline as implemented in code today. The inventory is based primarily on:

  • app/modules/rag/services/rag_service.py
  • app/modules/rag/persistence/*.py
  • app/modules/rag/indexing/code/**/*.py
  • app/modules/rag/indexing/docs/**/*.py
  • app/modules/rag_session/module.py
  • app/modules/agent/engine/graphs/project_qa_step_graphs.py
  • app/modules/agent/engine/orchestrator/*.py

ASSUMPTION: the intended layer semantics are the ones implied by code and tests, not by future architecture plans. This matters because only C0 through C3 are materially implemented today; C4+ exist only as enum constants.

Current retrieval pipeline

  1. Retrieval entrypoint is POST /internal/rag/retrieve in app/modules/rag_session/module.py.
  2. The endpoint calls RagService.retrieve(rag_session_id, query).
  3. RagQueryRouter chooses docs or code mode from the raw query text.
  4. RagService computes a single embedding for the full query via GigaChatEmbedder.
  5. RagQueryRepository.retrieve(...) runs one SQL query against rag_chunks in PostgreSQL with pgvector.
  6. Ranking order is:
    • lexical rank
    • test-file penalty
    • layer rank
    • vector distance embedding <=> query_embedding
  7. Response items are normalized to {source, content, layer, title, metadata, score}.
  8. If embeddings fail, retrieval falls back to latest chunks from the same layers.
  9. If code retrieval returns nothing, service falls back to docs layers.

Storage and indices

  • Primary store: PostgreSQL from DATABASE_URL, configured in app/modules/shared/db.py.
  • Vector extension: CREATE EXTENSION IF NOT EXISTS vector in app/modules/rag/persistence/schema_repository.py.
  • Primary table: rag_chunks.
  • Cache tables:
    • rag_blob_cache
    • rag_chunk_cache
    • rag_session_chunk_map
  • SQL indexes currently created:
    • (rag_session_id)
    • (rag_session_id, layer)
    • (rag_session_id, layer, path)
    • (qname)
    • (symbol_id)
    • (module_id)
    • (doc_kind)
    • (entrypoint_type, framework)

ASSUMPTION: there is no explicit ANN index for the vector column in schema code. The code creates general SQL indexes, but no ivfflat/hnsw index is defined here.

Layer: C0_SOURCE_CHUNKS

Implementation

  • Produced by CodeIndexingPipeline.index_file(...) in app/modules/rag/indexing/code/pipeline.py.
  • Chunking logic: CodeTextChunker.chunk(...) in app/modules/rag/indexing/code/code_text/chunker.py.
  • Document builder: CodeTextDocumentBuilder.build(...) in app/modules/rag/indexing/code/code_text/document_builder.py.
  • Persisted via RagDocumentRepository.insert_documents(...) into rag_chunks.

Input contract

This is an indexing layer, not a direct public retriever. The observed upstream indexing input is a file dict with at least:

  • required:
    • path: str
    • content: str
  • optional:
    • commit_sha: str | None
    • content_hash: str
    • metadata fields copied through by RagService._document_metadata(...)

For retrieval, the layer is queried only indirectly through:

  • rag_session_id: str
  • query: str
  • inferred mode/layers from RagQueryRouter
  • fixed limit=8

Output contract

Stored document shape:

  • top-level:
    • layer = "C0_SOURCE_CHUNKS"
    • lang = "python"
    • source.repo_id
    • source.commit_sha
    • source.path
    • title
    • text
    • span.start_line
    • span.end_line
    • embedding
  • metadata:
    • chunk_index
    • chunk_type: symbol_block or window
    • module_or_unit
    • artifact_type = "CODE"
    • plus file-level metadata injected by RagService

Returned retrieval item shape:

  • source
  • content
  • layer
  • title
  • metadata
  • score

No line_start / line_end are returned to the caller directly; they remain in DB columns span_start / span_end and are only used in logs.

Defaults & limits

  • AST chunking prefers one chunk per top-level class/function/async function.
  • Fallback window chunking:
    • size = 80 lines
    • overlap = 15 lines
  • Global retrieval limit from RagService.retrieve(...): 8
  • Embedding batch size from env:
    • RAG_EMBED_BATCH_SIZE
    • default 16

Known issues

  • Nested methods/functions are not emitted as C0 chunks unless represented inside a selected top-level block.
  • Returned API payload omits line spans even though storage has them.
  • No direct filter by path, namespace, symbol, or top_k is exposed through the current endpoint.

Layer: C1_SYMBOL_CATALOG

Implementation

  • Symbol extraction: SymbolExtractor.extract(...) in app/modules/rag/indexing/code/symbols/extractor.py.
  • AST parsing: PythonAstParser.parse_module(...).
  • Document builder: SymbolDocumentBuilder.build(...).
  • Retrieval reads rows from rag_chunks; there is no dedicated symbol table.

Input contract

Indexing input is the same per-file payload as C0.

Observed symbol extraction source:

  • Python AST only
  • supported symbol kinds:
    • class
    • function
    • method
    • const for top-level imports/import aliases

Retrieval input is still the generic text query endpoint. Query terms are enriched by extract_query_terms(...):

  • extracts identifier-like tokens from query text
  • normalizes camelCase/PascalCase to snake_case
  • adds special intent terms for management/control-related queries
  • max observed query terms: 6

Output contract

Stored document shape:

  • top-level:
    • layer = "C1_SYMBOL_CATALOG"
    • title = qname
    • text = "<kind> <qname>\n<signature>\n<docstring?>"
    • span.start_line
    • span.end_line
  • metadata:
    • symbol_id
    • qname
    • kind
    • signature
    • decorators_or_annotations
    • docstring_or_javadoc
    • parent_symbol_id
    • package_or_module
    • is_entry_candidate
    • lang_payload
    • artifact_type = "CODE"

Observed lang_payload variants:

  • class:
    • bases
  • function/method:
    • async
  • import alias:
    • imported_from
    • import_alias

Defaults & limits

  • Only Python source files are indexed into C-layers.
  • Import and import-from declarations are materialized as const symbols only at module top level.
  • Retrieval ranking gives C1 priority rank 1, after C3 and before C2/C0.

Known issues

  • No explicit visibility/public-private model.
  • parent_symbol_id currently stores the parent qname string from the stack, not the parent symbol hash. This is an observed implementation detail.
  • Cross-file symbol resolution is not implemented; dst_symbol_id in edges resolves only against symbols extracted from the same file.

Layer: C2_DEPENDENCY_GRAPH

Implementation

  • Edge extraction: EdgeExtractor.extract(...) in app/modules/rag/indexing/code/edges/extractor.py.
  • Document builder: EdgeDocumentBuilder.build(...).
  • Built during CodeIndexingPipeline.index_file(...).

Input contract

Indexing input is the same per-file source payload as C0/C1.

Graph construction method:

  • static analysis only
  • Python AST walk only
  • no runtime tracing
  • no tree-sitter

Observed edge types:

  • calls
  • imports
  • inherits

Output contract

Stored document shape:

  • top-level:
    • layer = "C2_DEPENDENCY_GRAPH"
    • title = "<src_qname>:<edge_type>"
    • text = "<src_qname> <edge_type> <dst>"
    • span.start_line
    • span.end_line
    • links contains one evidence link of type EDGE
  • metadata:
    • edge_id
    • edge_type
    • src_symbol_id
    • src_qname
    • dst_symbol_id
    • dst_ref
    • resolution: resolved or partial
    • lang_payload
    • artifact_type = "CODE"

Observed lang_payload usage:

  • for calls: may include callsite_kind = "function_call"

Defaults & limits

  • Edge extraction is per-file only.
  • imports edges are emitted only while visiting a class/function scope; top-level imports do not become C2 edges.
  • Layer rank in retrieval SQL: 2

Known issues

  • There is no traversal API, graph repository, or query language over C2. Retrieval only treats edges as text/vector rows in rag_chunks.
  • Destination resolution is local to the file-level qname map.
  • Top-level module import relationships are incompletely represented because visit_Import / visit_ImportFrom skip when there is no current scope.

Layer: C3_ENTRYPOINTS

Implementation

  • Detection registry: EntrypointDetectorRegistry.detect_all(...).
  • Detectors:
    • FastApiEntrypointDetector
    • FlaskEntrypointDetector
    • TyperClickEntrypointDetector
  • Document builder: EntrypointDocumentBuilder.build(...).

Input contract

Indexing input is the same per-file source payload as other C-layers.

Detected entrypoint families today:

  • HTTP:
    • FastAPI decorators such as .get, .post, .put, .patch, .delete, .route
    • Flask .route
  • CLI:
    • Typer/Click .command
    • Typer/Click .callback

Not detected:

  • Django routes
  • Celery tasks
  • RQ jobs
  • cron jobs / scheduler entries

Output contract

Stored document shape:

  • top-level:
    • layer = "C3_ENTRYPOINTS"
    • title = route_or_command
    • text = "<framework> <entry_type> <route_or_command>"
    • span.start_line
    • span.end_line
    • links contains one evidence link of type CODE_SPAN
  • metadata:
    • entry_id
    • entry_type: observed http or cli
    • framework: observed fastapi, flask, typer, click
    • route_or_command
    • handler_symbol_id
    • lang_payload
    • artifact_type = "CODE"

FastAPI-specific observed payload:

  • lang_payload.methods = [HTTP_METHOD] for .get/.post/...

Defaults & limits

  • Retrieval layer rank: 0 highest among code layers.
  • Entrypoint mapping is handler-symbol centric:
    • decorator match -> symbol -> handler_symbol_id
    • physical location comes from symbol span

Known issues

  • Route parsing is string-based from decorator text, not semantic AST argument parsing.
  • No dedicated entrypoint tags beyond entry_type, framework, and raw decorator-derived payload.
  • Background jobs and non-decorator entrypoints are not indexed.

Dependency graph / trace current state

Exists or stub?

  • C2 exists and is populated.
  • It is not a stub.
  • It is also not a full-project dependency graph service; it is a set of per-edge documents stored in rag_chunks.

How the graph is built

  • static Python AST analysis
  • no runtime instrumentation
  • no import graph resolver across modules
  • no tree-sitter

Edge types in data

  • calls
  • imports
  • inherits

Traversal API

  • No traversal API was found in app/modules/rag/* or app/modules/agent/*.
  • No method accepts graph traversal parameters such as depth, start node, edge filters, or BFS/DFS strategy.
  • Current access path is only retrieval over indexed edge documents.

Entrypoints current state

Implemented extraction

  • HTTP routes:
    • FastAPI
    • Flask
  • CLI:
    • Typer
    • Click

Mapping model

  • entrypoint -> handler_symbol_id -> symbol span/path
  • The entrypoint record itself stores:
    • framework
    • entry type
    • raw route/command string
    • handler symbol id

Tags/types

  • entry_type is the main normalized tag.
  • Observed values: http, cli.
  • framework is the second discriminator.
  • There are no richer endpoint taxonomies such as job, worker, webhook, scheduler.

Defaults and operational limits

  • Query mode default: docs
  • Code mode is enabled by keyword heuristics in RagQueryRouter
  • Retrieval hard limit: 8
  • Fallback limit: 8
  • Query term extraction limit: 6
  • Ranked source bundle for project QA:
    • top 12 RAG items
    • top 10 file candidates
  • No exposed namespace, path_prefixes, top_k, max_chars, max_chunks, max_depth in the public/internal retrieval endpoint

ASSUMPTION: the absence of these controls in endpoint and service signatures means they are not part of the current supported contract, even though RagQueryRepository.retrieve(...) has an internal path_prefixes parameter.

Known cross-cutting issues

  • Retrieval contract is effectively text-only at API level; structured retrieval exists only as internal SQL parameters.
  • Response payload drops explicit line spans even though spans are stored.
  • Vector retrieval is coupled to a single provider-specific embedder.
  • Docs mode is the default, so code retrieval depends on heuristic query phrasing unless the project/qa graph prepends по коду.
  • There is no separate retrieval contract per layer exposed over API; all layer selection is implicit.

Where to plug ExplainPack pipeline

Option 1: replace or extend project_qa/context_analysis

  • Code location:
    • app/modules/agent/engine/graphs/project_qa_step_graphs.py
  • Why:
    • retrieval is already complete at this step
    • input bundle already contains ranked rag_items and file_candidates
    • output is already a structured analysis_brief
  • Risk:
    • low
    • minimal invasion if ExplainPack consumes source_bundle and emits the same analysis_brief shape

Option 2: insert a new orchestrator step between context_retrieval and context_analysis

  • Code location:
    • app/modules/agent/engine/orchestrator/template_registry.py
    • app/modules/agent/engine/orchestrator/step_registry.py
  • Why:
    • preserves current retrieval behavior
    • makes ExplainPack an explicit pipeline stage with its own artifact
    • cleanest for observability and future A/B migration
  • Risk:
    • low to medium
    • requires one new artifact contract and one extra orchestration step, but no change to retrieval storage

Option 3: introduce ExplainPack inside ExplainActions.extract_logic

  • Code location:
    • app/modules/agent/engine/orchestrator/actions/explain_actions.py
  • Why:
    • useful if ExplainPack is meant only for explain-style scenarios
    • keeps general project QA untouched
  • Risk:
    • medium
    • narrower integration point; may create duplicate reasoning logic separate from project QA analysis path

Bottom line

  • C0-C3 are implemented and persisted in one physical store: rag_chunks.
  • Retrieval is a hybrid SQL ranking over lexical heuristics plus pgvector distance.
  • C2 exists, but only as retrievable edge documents, not as a traversable graph subsystem.
  • C3 covers FastAPI/Flask/Typer/Click only.
  • The least invasive ExplainPack integration point is after retrieval and before answer composition, preferably as a new explicit orchestrator artifact or as a replacement for context_analysis.