work/agent

Files

zosimovaa 417b8b6f72 Фиксация изменений

2026-03-05 11:03:17 +03:00

14 KiB

Raw Blame History

Retrieval Inventory

Scope and method

This document describes the retrieval and indexing pipeline as implemented in code today. The inventory is based primarily on:

app/modules/rag/services/rag_service.py
app/modules/rag/persistence/*.py
app/modules/rag/indexing/code/**/*.py
app/modules/rag/indexing/docs/**/*.py
app/modules/rag_session/module.py
app/modules/agent/engine/graphs/project_qa_step_graphs.py
app/modules/agent/engine/orchestrator/*.py

ASSUMPTION: the intended layer semantics are the ones implied by code and tests, not by future architecture plans. This matters because only C0 through C3 are materially implemented today; C4+ exist only as enum constants.

Current retrieval pipeline

Retrieval entrypoint is POST /internal/rag/retrieve in app/modules/rag_session/module.py.
The endpoint calls RagService.retrieve(rag_session_id, query).
RagQueryRouter chooses docs or code mode from the raw query text.
RagService computes a single embedding for the full query via GigaChatEmbedder.
RagQueryRepository.retrieve(...) runs one SQL query against rag_chunks in PostgreSQL with pgvector.
Ranking order is:
- lexical rank
- test-file penalty
- layer rank
- vector distance embedding <=> query_embedding
Response items are normalized to {source, content, layer, title, metadata, score}.
If embeddings fail, retrieval falls back to latest chunks from the same layers.
If code retrieval returns nothing, service falls back to docs layers.

Storage and indices

Primary store: PostgreSQL from DATABASE_URL, configured in app/modules/shared/db.py.
Vector extension: CREATE EXTENSION IF NOT EXISTS vector in app/modules/rag/persistence/schema_repository.py.
Primary table: rag_chunks.
Cache tables:
- rag_blob_cache
- rag_chunk_cache
- rag_session_chunk_map
SQL indexes currently created:
- (rag_session_id)
- (rag_session_id, layer)
- (rag_session_id, layer, path)
- (qname)
- (symbol_id)
- (module_id)
- (doc_kind)
- (entrypoint_type, framework)

ASSUMPTION: there is no explicit ANN index for the vector column in schema code. The code creates general SQL indexes, but no ivfflat/hnsw index is defined here.

Layer: C0_SOURCE_CHUNKS

Implementation

Produced by CodeIndexingPipeline.index_file(...) in app/modules/rag/indexing/code/pipeline.py.
Chunking logic: CodeTextChunker.chunk(...) in app/modules/rag/indexing/code/code_text/chunker.py.
Document builder: CodeTextDocumentBuilder.build(...) in app/modules/rag/indexing/code/code_text/document_builder.py.
Persisted via RagDocumentRepository.insert_documents(...) into rag_chunks.

Input contract

This is an indexing layer, not a direct public retriever. The observed upstream indexing input is a file dict with at least:

required:
- path: str
- content: str
optional:
- commit_sha: str | None
- content_hash: str
- metadata fields copied through by RagService._document_metadata(...)

For retrieval, the layer is queried only indirectly through:

rag_session_id: str
query: str
inferred mode/layers from RagQueryRouter
fixed limit=8

Output contract

Stored document shape:

top-level:
- layer = "C0_SOURCE_CHUNKS"
- lang = "python"
- source.repo_id
- source.commit_sha
- source.path
- title
- text
- span.start_line
- span.end_line
- embedding
metadata:
- chunk_index
- chunk_type: symbol_block or window
- module_or_unit
- artifact_type = "CODE"
- plus file-level metadata injected by RagService

Returned retrieval item shape:

source
content
layer
title
metadata
score

No line_start / line_end are returned to the caller directly; they remain in DB columns span_start / span_end and are only used in logs.

Defaults & limits

AST chunking prefers one chunk per top-level class/function/async function.
Fallback window chunking:
- size = 80 lines
- overlap = 15 lines
Global retrieval limit from RagService.retrieve(...): 8
Embedding batch size from env:
- RAG_EMBED_BATCH_SIZE
- default 16

Known issues

Nested methods/functions are not emitted as C0 chunks unless represented inside a selected top-level block.
Returned API payload omits line spans even though storage has them.
No direct filter by path, namespace, symbol, or top_k is exposed through the current endpoint.

Layer: C1_SYMBOL_CATALOG

Implementation

Symbol extraction: SymbolExtractor.extract(...) in app/modules/rag/indexing/code/symbols/extractor.py.
AST parsing: PythonAstParser.parse_module(...).
Document builder: SymbolDocumentBuilder.build(...).
Retrieval reads rows from rag_chunks; there is no dedicated symbol table.

Input contract

Indexing input is the same per-file payload as C0.

Observed symbol extraction source:

Python AST only
supported symbol kinds:
- class
- function
- method
- const for top-level imports/import aliases

Retrieval input is still the generic text query endpoint. Query terms are enriched by extract_query_terms(...):

extracts identifier-like tokens from query text
normalizes camelCase/PascalCase to snake_case
adds special intent terms for management/control-related queries
max observed query terms: 6

Output contract

Stored document shape:

top-level:
- layer = "C1_SYMBOL_CATALOG"
- title = qname
- text = "<kind> <qname>\n<signature>\n<docstring?>"
- span.start_line
- span.end_line
metadata:
- symbol_id
- qname
- kind
- signature
- decorators_or_annotations
- docstring_or_javadoc
- parent_symbol_id
- package_or_module
- is_entry_candidate
- lang_payload
- artifact_type = "CODE"

Observed lang_payload variants:

class:
- bases
function/method:
- async
import alias:
- imported_from
- import_alias

Defaults & limits

Only Python source files are indexed into C-layers.
Import and import-from declarations are materialized as const symbols only at module top level.
Retrieval ranking gives C1 priority rank 1, after C3 and before C2/C0.

Known issues

No explicit visibility/public-private model.
parent_symbol_id currently stores the parent qname string from the stack, not the parent symbol hash. This is an observed implementation detail.
Cross-file symbol resolution is not implemented; dst_symbol_id in edges resolves only against symbols extracted from the same file.

Layer: C2_DEPENDENCY_GRAPH

Implementation

Edge extraction: EdgeExtractor.extract(...) in app/modules/rag/indexing/code/edges/extractor.py.
Document builder: EdgeDocumentBuilder.build(...).
Built during CodeIndexingPipeline.index_file(...).

Input contract

Indexing input is the same per-file source payload as C0/C1.

Graph construction method:

static analysis only
Python AST walk only
no runtime tracing
no tree-sitter

Observed edge types:

calls
imports
inherits

Output contract

Stored document shape:

top-level:
- layer = "C2_DEPENDENCY_GRAPH"
- title = "<src_qname>:<edge_type>"
- text = "<src_qname> <edge_type> <dst>"
- span.start_line
- span.end_line
- links contains one evidence link of type EDGE
metadata:
- edge_id
- edge_type
- src_symbol_id
- src_qname
- dst_symbol_id
- dst_ref
- resolution: resolved or partial
- lang_payload
- artifact_type = "CODE"

Observed lang_payload usage:

for calls: may include callsite_kind = "function_call"

Defaults & limits

Edge extraction is per-file only.
imports edges are emitted only while visiting a class/function scope; top-level imports do not become C2 edges.
Layer rank in retrieval SQL: 2

Known issues

There is no traversal API, graph repository, or query language over C2. Retrieval only treats edges as text/vector rows in rag_chunks.
Destination resolution is local to the file-level qname map.
Top-level module import relationships are incompletely represented because visit_Import / visit_ImportFrom skip when there is no current scope.

Layer: C3_ENTRYPOINTS

Implementation

Detection registry: EntrypointDetectorRegistry.detect_all(...).
Detectors:
- FastApiEntrypointDetector
- FlaskEntrypointDetector
- TyperClickEntrypointDetector
Document builder: EntrypointDocumentBuilder.build(...).

Input contract

Indexing input is the same per-file source payload as other C-layers.

Detected entrypoint families today:

HTTP:
- FastAPI decorators such as .get, .post, .put, .patch, .delete, .route
- Flask .route
CLI:
- Typer/Click .command
- Typer/Click .callback

Not detected:

Django routes
Celery tasks
RQ jobs
cron jobs / scheduler entries

Output contract

Stored document shape:

top-level:
- layer = "C3_ENTRYPOINTS"
- title = route_or_command
- text = "<framework> <entry_type> <route_or_command>"
- span.start_line
- span.end_line
- links contains one evidence link of type CODE_SPAN
metadata:
- entry_id
- entry_type: observed http or cli
- framework: observed fastapi, flask, typer, click
- route_or_command
- handler_symbol_id
- lang_payload
- artifact_type = "CODE"

FastAPI-specific observed payload:

lang_payload.methods = [HTTP_METHOD] for .get/.post/...

Defaults & limits

Retrieval layer rank: 0 highest among code layers.
Entrypoint mapping is handler-symbol centric:
- decorator match -> symbol -> handler_symbol_id
- physical location comes from symbol span

Known issues

Route parsing is string-based from decorator text, not semantic AST argument parsing.
No dedicated entrypoint tags beyond entry_type, framework, and raw decorator-derived payload.
Background jobs and non-decorator entrypoints are not indexed.

Dependency graph / trace current state

Exists or stub?

C2 exists and is populated.
It is not a stub.
It is also not a full-project dependency graph service; it is a set of per-edge documents stored in rag_chunks.

How the graph is built

static Python AST analysis
no runtime instrumentation
no import graph resolver across modules
no tree-sitter

Edge types in data

calls
imports
inherits

Traversal API

No traversal API was found in app/modules/rag/* or app/modules/agent/*.
No method accepts graph traversal parameters such as depth, start node, edge filters, or BFS/DFS strategy.
Current access path is only retrieval over indexed edge documents.

Entrypoints current state

Implemented extraction

HTTP routes:
- FastAPI
- Flask
CLI:
- Typer
- Click

Mapping model

entrypoint -> handler_symbol_id -> symbol span/path
The entrypoint record itself stores:
- framework
- entry type
- raw route/command string
- handler symbol id

Tags/types

entry_type is the main normalized tag.
Observed values: http, cli.
framework is the second discriminator.
There are no richer endpoint taxonomies such as job, worker, webhook, scheduler.

Defaults and operational limits

Query mode default: docs
Code mode is enabled by keyword heuristics in RagQueryRouter
Retrieval hard limit: 8
Fallback limit: 8
Query term extraction limit: 6
Ranked source bundle for project QA:
- top 12 RAG items
- top 10 file candidates
No exposed namespace, path_prefixes, top_k, max_chars, max_chunks, max_depth in the public/internal retrieval endpoint

ASSUMPTION: the absence of these controls in endpoint and service signatures means they are not part of the current supported contract, even though RagQueryRepository.retrieve(...) has an internal path_prefixes parameter.

Known cross-cutting issues

Retrieval contract is effectively text-only at API level; structured retrieval exists only as internal SQL parameters.
Response payload drops explicit line spans even though spans are stored.
Vector retrieval is coupled to a single provider-specific embedder.
Docs mode is the default, so code retrieval depends on heuristic query phrasing unless the project/qa graph prepends по коду.
There is no separate retrieval contract per layer exposed over API; all layer selection is implicit.

Where to plug ExplainPack pipeline

Option 1: replace or extend `project_qa/context_analysis`

Code location:
- app/modules/agent/engine/graphs/project_qa_step_graphs.py
Why:
- retrieval is already complete at this step
- input bundle already contains ranked rag_items and file_candidates
- output is already a structured analysis_brief
Risk:
- low
- minimal invasion if ExplainPack consumes source_bundle and emits the same analysis_brief shape

Option 2: insert a new orchestrator step between `context_retrieval` and `context_analysis`

Code location:
- app/modules/agent/engine/orchestrator/template_registry.py
- app/modules/agent/engine/orchestrator/step_registry.py
Why:
- preserves current retrieval behavior
- makes ExplainPack an explicit pipeline stage with its own artifact
- cleanest for observability and future A/B migration
Risk:
- low to medium
- requires one new artifact contract and one extra orchestration step, but no change to retrieval storage

Option 3: introduce ExplainPack inside `ExplainActions.extract_logic`

Code location:
- app/modules/agent/engine/orchestrator/actions/explain_actions.py
Why:
- useful if ExplainPack is meant only for explain-style scenarios
- keeps general project QA untouched
Risk:
- medium
- narrower integration point; may create duplicate reasoning logic separate from project QA analysis path

Bottom line

C0-C3 are implemented and persisted in one physical store: rag_chunks.
Retrieval is a hybrid SQL ranking over lexical heuristics plus pgvector distance.
C2 exists, but only as retrievable edge documents, not as a traversable graph subsystem.
C3 covers FastAPI/Flask/Typer/Click only.
The least invasive ExplainPack integration point is after retrieval and before answer composition, preferably as a new explicit orchestrator artifact or as a replacement for context_analysis.

14 KiB Raw Blame History

Retrieval Inventory

Scope and method

Current retrieval pipeline

Storage and indices

Layer: C0_SOURCE_CHUNKS

Implementation

Input contract

Output contract

Defaults & limits

Known issues

Layer: C1_SYMBOL_CATALOG

Implementation

Input contract

Output contract

Defaults & limits

Known issues

Layer: C2_DEPENDENCY_GRAPH

Implementation

Input contract

Output contract

Defaults & limits

Known issues

Layer: C3_ENTRYPOINTS

Implementation

Input contract

Output contract

Defaults & limits

Known issues

Dependency graph / trace current state

Exists or stub?

How the graph is built

Edge types in data

Traversal API

Entrypoints current state

Implemented extraction

Mapping model

Tags/types

Defaults and operational limits

Known cross-cutting issues

Where to plug ExplainPack pipeline

Option 1: replace or extend project_qa/context_analysis

Option 2: insert a new orchestrator step between context_retrieval and context_analysis

Option 3: introduce ExplainPack inside ExplainActions.extract_logic

Bottom line

14 KiB

Raw Blame History

Option 1: replace or extend `project_qa/context_analysis`

Option 2: insert a new orchestrator step between `context_retrieval` and `context_analysis`

Option 3: introduce ExplainPack inside `ExplainActions.extract_logic`