Новый раг

This commit is contained in:
2026-03-01 14:21:33 +03:00
parent 2728c07ba9
commit 1ef0b4d68c
95 changed files with 3145 additions and 927 deletions

300
app/modules/rag/README.md Normal file
View File

@@ -0,0 +1,300 @@
# Модуль rag
## 1. Функции модуля
- Единое ядро RAG для индексации и retrieval по документации и коду проекта.
- Поддержка двух семейств индексации: `DOCS` и `CODE`, с разными слоями и разными pipeline.
- Хранение `rag_session`, index-jobs, многослойных документов, cache-слоев и retrieval-запросов.
- Поддержка индексации snapshot и changes с переиспользованием cache по `blob_sha`.
- Предоставление контекста для agent/chat, где `DOCS` используется по умолчанию, а `CODE` включается для явных вопросов по реализации.
## 2. Диаграмма классов и взаимосвязей
```mermaid
classDiagram
class RagService
class RagRepository
class RagSchemaRepository
class RagDocumentUpserter
class DocsIndexingPipeline
class CodeIndexingPipeline
class RagQueryRouter
class GigaChatEmbedder
RagService --> RagRepository
RagService --> DocsIndexingPipeline
RagService --> CodeIndexingPipeline
RagService --> RagQueryRouter
RagService --> GigaChatEmbedder
RagRepository --> RagSchemaRepository
RagService --> RagDocumentUpserter
```
## 3. Описание классов
- `RagService`: основной application-service модуля.
Методы: `index_snapshot` — индексирует полный набор файлов; `index_changes` — применяет инкрементальные изменения; `retrieve` — возвращает релевантный контекст из `DOCS` или `CODE`.
- `RagRepository`: фасад persistence-слоя RAG.
Методы: `ensure_tables` — создает/обновляет схему; `upsert_session/get_session/session_exists` — операции по `rag_session`; `create_job/update_job/get_job` — операции по index jobs; `replace_documents/apply_document_changes` — операции по документам; `get_cached_documents/cache_documents` — работа с cache; `retrieve/fallback_chunks` — retrieval.
- `RagSchemaRepository`: управление схемой БД для RAG.
Методы: `ensure_tables` — создает таблицы и индексы; `_ensure_columns` — добавляет новые поля; `_ensure_indexes` — поддерживает индексы для retrieval и фильтрации.
- `RagDocumentUpserter`: батчевый writer многослойных `RagDocument`.
Методы: `replace` — полностью заменяет документы сессии; `apply_changes` — применяет upsert/delete по измененным путям.
- `DocsIndexingPipeline`: pipeline индексации документации.
Методы: `supports` — определяет, относится ли файл к docs; `index_file` — строит документы слоев `D1-D4` для одного файла.
- `CodeIndexingPipeline`: pipeline индексации Python-кода.
Методы: `supports` — определяет, относится ли файл к code; `index_file` — строит документы слоев `C0-C3` для одного файла.
- `RagQueryRouter`: выбирает retrieval mode и активные слои.
Методы: `resolve_mode` — определяет `docs` или `code`; `layers_for_mode` — возвращает набор слоев для retrieval.
- `GigaChatEmbedder`: адаптер embeddings-модели.
Методы: `embed` — возвращает embeddings для списка текстов.
## 4. Сиквенс-диаграммы API и выполнения
### Индексация snapshot через текущий `rag_session` facade
Назначение: создать/обновить `rag_session` и построить многослойный индекс по переданным файлам проекта.
```mermaid
sequenceDiagram
participant Router as RagModule.APIRouter
participant Sessions as RagSessionStore
participant Indexing as IndexingOrchestrator
participant Rag as RagService
participant Docs as DocsIndexingPipeline
participant Code as CodeIndexingPipeline
participant Repo as RagRepository
Router->>Sessions: create(project_id)
Sessions-->>Router: rag_session_id
Router->>Indexing: enqueue_snapshot(rag_session_id, files)
Indexing->>Rag: index_snapshot(rag_session_id, files)
loop for each file
Rag->>Docs: supports/index_file
Rag->>Code: supports/index_file
Rag->>Repo: cache_documents(...)
end
Rag->>Repo: replace_documents(...)
Indexing-->>Router: index_job_id,status
```
### Retrieval для agent/chat
Назначение: вернуть релевантный контекст из нужного семейства слоев.
```mermaid
sequenceDiagram
participant Agent as GraphAgentRuntime
participant Rag as RagService
participant Router as RagQueryRouter
participant Repo as RagRepository
Agent->>Rag: retrieve(rag_session_id, query)
Rag->>Router: resolve_mode(query)
Router-->>Rag: docs|code + layers
Rag->>Repo: retrieve(query_embedding, query_text, layers)
Repo-->>Rag: ranked items
Rag-->>Agent: items
```
## 5. Слои, фиксируемые в RAG
### 5.1. Слои DOCS
#### `D1_MODULE_CATALOG`
Назначение: каталог модулей документации и граф связей между ними.
Основные атрибуты:
- `module_id`
- `type`
- `domain`
- `title`
- `status`
- `version`
- `tags`
- `owners`
- `links`
- `calls_api`
- `called_by`
- `uses_logic`
- `used_by`
- `reads_db`
- `writes_db`
- `integrates_with`
- `emits_events`
- `consumes_events`
- `source_path`
- `summary_text`
#### `D2_FACT_INDEX`
Назначение: атомарные факты `subject-predicate-object` с evidence.
Основные атрибуты:
- `fact_id`
- `subject_id`
- `predicate`
- `object`
- `object_ref`
- `source_path`
- `anchor`
- `line_start`
- `line_end`
- `confidence`
- `tags`
#### `D3_SECTION_INDEX`
Назначение: семантические секции документации, нарезанные по заголовкам.
Основные атрибуты:
- `chunk_id`
- `module_id`
- `section_path`
- `section_title`
- `content`
- `source_path`
- `order`
- `tags`
- `domain`
- `type`
- `embedding`
#### `D4_POLICY_INDEX`
Назначение: глобальные правила и конвенции проекта.
Основные атрибуты:
- `policy_id`
- `applies_to`
- `rules`
- `default_behaviors`
- `source_path`
### 5.2. Слои CODE
#### `C0_SOURCE_CHUNKS`
Назначение: сырой код как источник истины для цитирования и evidence.
Основные атрибуты:
- `lang`
- `repo_id`
- `commit_sha`
- `path`
- `span`
- `title`
- `text`
- `module_or_unit`
- `chunk_type`
- `symbol_id`
- `hash`
#### `C1_SYMBOL_CATALOG`
Назначение: каталог символов кода и их деклараций.
Основные атрибуты:
- `lang`
- `repo_id`
- `commit_sha`
- `symbol_id`
- `qname`
- `kind`
- `decl.path`
- `decl.start_line`
- `decl.end_line`
- `text`
- `visibility`
- `signature`
- `decorators_or_annotations`
- `docstring_or_javadoc`
- `parent_symbol_id`
- `package_or_module`
- `is_entry_candidate`
- `lang_payload`
#### `C2_DEPENDENCY_GRAPH`
Назначение: связи между сущностями кода.
Основные атрибуты:
- `lang`
- `repo_id`
- `commit_sha`
- `edge_id`
- `edge_type`
- `src_symbol_id`
- `dst_symbol_id`
- `dst_ref`
- `evidence.path`
- `evidence.start_line`
- `evidence.end_line`
- `text`
- `resolution`
- `callsite_kind`
- `lang_payload`
#### `C3_ENTRYPOINTS`
Назначение: точки входа приложения и их обработчики.
Основные атрибуты:
- `lang`
- `repo_id`
- `commit_sha`
- `entry_id`
- `entry_type`
- `framework`
- `route_or_command`
- `handler_symbol_id`
- `evidence.path`
- `evidence.start_line`
- `evidence.end_line`
- `text`
- `http.methods`
- `http.auth`
- `request_model`
- `response_model`
- `cli.args_schema`
- `task.queue`
- `task.cron`
- `tags`
- `lang_payload`
#### `C4_PUBLIC_API`
Назначение: публичная поверхность API/экспортируемых символов.
Основные атрибуты:
- `api_id`
- `symbol_id`
- `stability`
- `source_of_truth`
- `versioning_tags`
- `lang_payload`
#### `C5_BEHAVIOR_SUMMARIES`
Назначение: поведенческие summary с обязательными evidence links.
Основные атрибуты:
- `target_type`
- `target_id`
- `text`
- `claims`
- `evidence_links`
- `confidence`
- `generated_by`
- `generated_at`
#### `C6_RUNTIME_TRACES`
Назначение: runtime/trace слой для связи кода и реального исполнения.
Основные атрибуты:
- `env`
- `trace_id`
- `span_id`
- `symbol_id`
- `entry_id`
- `text`
- `timings`
- `service`
- `host`
- `labels`
## 6. Правила retrieval
- По умолчанию retrieval идет в `DOCS`.
- `CODE` используется только для явных вопросов по реализации, устройству кода, endpoint'ам, handler'ам и документации “из кода”.
- Для `DOCS` приоритет слоев: `D1 -> D2 -> D3 -> D4`.
- Для `CODE` приоритет слоев: `C3 -> C1 -> C2 -> C0`.
## 7. Текущий статус реализации
- В первой итерации реализованы `DOCS D1-D4`.
- В первой итерации реализованы `CODE C0-C3`.
- `C4-C6` зафиксированы в контракте и зарезервированы под следующие этапы.
- Текущие `rag_session` и `rag_repo` работают как facade/adapter поверх нового пакета `rag`.

View File

@@ -0,0 +1,25 @@
from app.modules.rag.contracts import (
DocKind,
EvidenceLink,
EvidenceType,
RagDocument,
RagLayer,
RagSource,
RagSpan,
RetrievalItem,
RetrievalMode,
RetrievalQuery,
)
__all__ = [
"DocKind",
"EvidenceLink",
"EvidenceType",
"RagDocument",
"RagLayer",
"RagSource",
"RagSpan",
"RetrievalItem",
"RetrievalMode",
"RetrievalQuery",
]

Binary file not shown.

View File

@@ -0,0 +1,17 @@
from app.modules.rag.contracts.documents import RagDocument, RagSource, RagSpan
from app.modules.rag.contracts.enums import DocKind, EvidenceType, RagLayer, RetrievalMode
from app.modules.rag.contracts.evidence import EvidenceLink
from app.modules.rag.contracts.retrieval import RetrievalItem, RetrievalQuery
__all__ = [
"DocKind",
"EvidenceLink",
"EvidenceType",
"RagDocument",
"RagLayer",
"RagSource",
"RagSpan",
"RetrievalItem",
"RetrievalMode",
"RetrievalQuery",
]

View File

@@ -0,0 +1,77 @@
from __future__ import annotations
from dataclasses import asdict, dataclass, field
from hashlib import sha256
from app.modules.rag.contracts.evidence import EvidenceLink
@dataclass(slots=True)
class RagSource:
repo_id: str
commit_sha: str | None
path: str
def to_dict(self) -> dict:
return asdict(self)
@dataclass(slots=True)
class RagSpan:
start_line: int | None = None
end_line: int | None = None
def to_dict(self) -> dict:
return asdict(self)
@dataclass(slots=True)
class RagDocument:
layer: str
source: RagSource
title: str
text: str
metadata: dict = field(default_factory=dict)
links: list[EvidenceLink] = field(default_factory=list)
span: RagSpan | None = None
doc_id: str | None = None
lang: str | None = None
embedding: list[float] | None = None
def ensure_doc_id(self) -> str:
if self.doc_id:
return self.doc_id
span_key = ""
if self.span is not None:
span_key = f":{self.span.start_line}:{self.span.end_line}"
raw = "|".join(
[
self.layer,
self.lang or "",
self.source.repo_id,
self.source.commit_sha or "",
self.source.path,
self.metadata.get("symbol_id", "") or self.metadata.get("module_id", ""),
self.title,
span_key,
]
)
self.doc_id = sha256(raw.encode("utf-8")).hexdigest()
return self.doc_id
def to_record(self) -> dict:
return {
"doc_id": self.ensure_doc_id(),
"layer": self.layer,
"lang": self.lang,
"repo_id": self.source.repo_id,
"commit_sha": self.source.commit_sha,
"path": self.source.path,
"title": self.title,
"text": self.text,
"metadata": dict(self.metadata),
"links": [link.to_dict() for link in self.links],
"span_start": self.span.start_line if self.span else None,
"span_end": self.span.end_line if self.span else None,
"embedding": self.embedding or [],
}

View File

@@ -0,0 +1,35 @@
from __future__ import annotations
class RagLayer:
DOCS_MODULE_CATALOG = "D1_MODULE_CATALOG"
DOCS_FACT_INDEX = "D2_FACT_INDEX"
DOCS_SECTION_INDEX = "D3_SECTION_INDEX"
DOCS_POLICY_INDEX = "D4_POLICY_INDEX"
CODE_SOURCE_CHUNKS = "C0_SOURCE_CHUNKS"
CODE_SYMBOL_CATALOG = "C1_SYMBOL_CATALOG"
CODE_DEPENDENCY_GRAPH = "C2_DEPENDENCY_GRAPH"
CODE_ENTRYPOINTS = "C3_ENTRYPOINTS"
CODE_PUBLIC_API = "C4_PUBLIC_API"
CODE_BEHAVIOR_SUMMARIES = "C5_BEHAVIOR_SUMMARIES"
CODE_RUNTIME_TRACES = "C6_RUNTIME_TRACES"
class RetrievalMode:
DOCS = "docs"
CODE = "code"
class DocKind:
SPEC = "spec"
RUNBOOK = "runbook"
README = "readme"
MISC = "misc"
class EvidenceType:
CODE_SPAN = "code_span"
SYMBOL = "symbol"
EDGE = "edge"
DOC_SECTION = "doc_section"
DOC_FACT = "doc_fact"

View File

@@ -0,0 +1,16 @@
from __future__ import annotations
from dataclasses import asdict, dataclass
@dataclass(slots=True)
class EvidenceLink:
type: str
target_id: str
path: str | None = None
start_line: int | None = None
end_line: int | None = None
note: str | None = None
def to_dict(self) -> dict:
return asdict(self)

View File

@@ -0,0 +1,23 @@
from __future__ import annotations
from dataclasses import dataclass, field
@dataclass(slots=True)
class RetrievalQuery:
text: str
mode: str
limit: int = 5
layers: list[str] = field(default_factory=list)
path_prefixes: list[str] = field(default_factory=list)
doc_kind: str | None = None
@dataclass(slots=True)
class RetrievalItem:
content: str
path: str
layer: str
title: str
score: float | None = None
metadata: dict | None = None

View File

@@ -0,0 +1,57 @@
from __future__ import annotations
import ast
from dataclasses import dataclass
@dataclass(slots=True)
class CodeChunk:
title: str
text: str
start_line: int
end_line: int
chunk_type: str
class CodeTextChunker:
def chunk(self, path: str, text: str) -> list[CodeChunk]:
try:
tree = ast.parse(text)
except SyntaxError:
return self._window_chunks(path, text)
chunks: list[CodeChunk] = []
lines = text.splitlines()
for node in tree.body:
if not isinstance(node, (ast.ClassDef, ast.FunctionDef, ast.AsyncFunctionDef)):
continue
start = int(getattr(node, "lineno", 1))
end = int(getattr(node, "end_lineno", start))
body = "\n".join(lines[start - 1 : end]).strip()
if not body:
continue
chunks.append(
CodeChunk(
title=f"{path}:{getattr(node, 'name', 'block')}",
text=body,
start_line=start,
end_line=end,
chunk_type="symbol_block",
)
)
return chunks or self._window_chunks(path, text)
def _window_chunks(self, path: str, text: str) -> list[CodeChunk]:
lines = text.splitlines()
chunks: list[CodeChunk] = []
size = 80
overlap = 15
start = 0
while start < len(lines):
end = min(len(lines), start + size)
body = "\n".join(lines[start:end]).strip()
if body:
chunks.append(CodeChunk(f"{path}:{start + 1}-{end}", body, start + 1, end, "window"))
if end >= len(lines):
break
start = max(0, end - overlap)
return chunks

View File

@@ -0,0 +1,22 @@
from __future__ import annotations
from app.modules.rag.contracts import RagDocument, RagLayer, RagSource, RagSpan
from app.modules.rag.indexing.code.code_text.chunker import CodeChunk
class CodeTextDocumentBuilder:
def build(self, source: RagSource, chunk: CodeChunk, *, chunk_index: int) -> RagDocument:
return RagDocument(
layer=RagLayer.CODE_SOURCE_CHUNKS,
lang="python",
source=source,
title=chunk.title,
text=chunk.text,
span=RagSpan(chunk.start_line, chunk.end_line),
metadata={
"chunk_index": chunk_index,
"chunk_type": chunk.chunk_type,
"module_or_unit": source.path.replace("/", ".").removesuffix(".py"),
"artifact_type": "CODE",
},
)

View File

@@ -0,0 +1,29 @@
from __future__ import annotations
from app.modules.rag.contracts import EvidenceLink, EvidenceType, RagDocument, RagLayer, RagSource, RagSpan
from app.modules.rag.indexing.code.edges.extractor import PyEdge
class EdgeDocumentBuilder:
def build(self, source: RagSource, edge: PyEdge) -> RagDocument:
dst = edge.dst_ref or edge.dst_symbol_id or "unknown"
return RagDocument(
layer=RagLayer.CODE_DEPENDENCY_GRAPH,
lang="python",
source=source,
title=f"{edge.src_qname}:{edge.edge_type}",
text=f"{edge.src_qname} {edge.edge_type} {dst}",
span=RagSpan(edge.start_line, edge.end_line),
metadata={
"edge_id": edge.edge_id,
"edge_type": edge.edge_type,
"src_symbol_id": edge.src_symbol_id,
"src_qname": edge.src_qname,
"dst_symbol_id": edge.dst_symbol_id,
"dst_ref": edge.dst_ref,
"resolution": edge.resolution,
"lang_payload": edge.metadata,
"artifact_type": "CODE",
},
links=[EvidenceLink(type=EvidenceType.EDGE, target_id=edge.edge_id, path=source.path, start_line=edge.start_line, end_line=edge.end_line)],
)

View File

@@ -0,0 +1,114 @@
from __future__ import annotations
import ast
from dataclasses import dataclass, field
from hashlib import sha256
@dataclass(slots=True)
class PyEdge:
edge_id: str
edge_type: str
src_symbol_id: str
src_qname: str
dst_symbol_id: str | None
dst_ref: str | None
path: str
start_line: int
end_line: int
resolution: str = "partial"
metadata: dict = field(default_factory=dict)
class EdgeExtractor:
def extract(self, path: str, ast_tree: ast.AST | None, symbols: list) -> list[PyEdge]:
if ast_tree is None:
return []
qname_map = {symbol.qname: symbol.symbol_id for symbol in symbols}
visitor = _EdgeVisitor(path, qname_map)
visitor.visit(ast_tree)
return visitor.edges
class _EdgeVisitor(ast.NodeVisitor):
def __init__(self, path: str, qname_map: dict[str, str]) -> None:
self._path = path
self._qname_map = qname_map
self._scope: list[str] = []
self.edges: list[PyEdge] = []
def visit_ClassDef(self, node: ast.ClassDef) -> None:
current = self._enter(node.name)
for base in node.bases:
self._add_edge("inherits", current, self._name(base), base)
self.generic_visit(node)
self._scope.pop()
def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
self._visit_function(node)
def visit_AsyncFunctionDef(self, node: ast.AsyncFunctionDef) -> None:
self._visit_function(node)
def visit_Import(self, node: ast.Import) -> None:
current = self._current_qname()
if not current:
return
for item in node.names:
self._add_edge("imports", current, item.name, node)
def visit_ImportFrom(self, node: ast.ImportFrom) -> None:
current = self._current_qname()
if not current:
return
module = node.module or ""
for item in node.names:
self._add_edge("imports", current, f"{module}.{item.name}".strip("."), node)
def _visit_function(self, node) -> None:
current = self._enter(node.name)
for inner in ast.walk(node):
if isinstance(inner, ast.Call):
self._add_edge("calls", current, self._name(inner.func), inner, {"callsite_kind": "function_call"})
self.generic_visit(node)
self._scope.pop()
def _enter(self, name: str) -> str:
self._scope.append(name)
return self._current_qname() or name
def _current_qname(self) -> str | None:
if not self._scope:
return None
return ".".join(self._scope)
def _add_edge(self, edge_type: str, src_qname: str, dst_ref: str, node, extra: dict | None = None) -> None:
if not dst_ref:
return
src_symbol_id = self._qname_map.get(src_qname, sha256(src_qname.encode("utf-8")).hexdigest())
dst_symbol_id = self._qname_map.get(dst_ref)
edge_id = sha256(f"{self._path}|{src_qname}|{edge_type}|{dst_ref}|{getattr(node, 'lineno', 1)}".encode("utf-8")).hexdigest()
self.edges.append(
PyEdge(
edge_id=edge_id,
edge_type=edge_type,
src_symbol_id=src_symbol_id,
src_qname=src_qname,
dst_symbol_id=dst_symbol_id,
dst_ref=dst_ref,
path=self._path,
start_line=int(getattr(node, "lineno", 1)),
end_line=int(getattr(node, "end_lineno", getattr(node, "lineno", 1))),
resolution="resolved" if dst_symbol_id else "partial",
metadata=extra or {},
)
)
def _name(self, node) -> str:
if isinstance(node, ast.Name):
return node.id
if isinstance(node, ast.Attribute):
return f"{self._name(node.value)}.{node.attr}"
if isinstance(node, ast.Call):
return self._name(node.func)
return ""

View File

@@ -0,0 +1,26 @@
from __future__ import annotations
from app.modules.rag.contracts import EvidenceLink, EvidenceType, RagDocument, RagLayer, RagSource, RagSpan
from app.modules.rag.indexing.code.entrypoints.registry import Entrypoint
class EntrypointDocumentBuilder:
def build(self, source: RagSource, entrypoint: Entrypoint) -> RagDocument:
return RagDocument(
layer=RagLayer.CODE_ENTRYPOINTS,
lang="python",
source=source,
title=entrypoint.route_or_command,
text=f"{entrypoint.framework} {entrypoint.entry_type} {entrypoint.route_or_command}",
span=RagSpan(entrypoint.start_line, entrypoint.end_line),
metadata={
"entry_id": entrypoint.entry_id,
"entry_type": entrypoint.entry_type,
"framework": entrypoint.framework,
"route_or_command": entrypoint.route_or_command,
"handler_symbol_id": entrypoint.handler_symbol_id,
"lang_payload": entrypoint.metadata,
"artifact_type": "CODE",
},
links=[EvidenceLink(type=EvidenceType.CODE_SPAN, target_id=entrypoint.entry_id, path=source.path, start_line=entrypoint.start_line, end_line=entrypoint.end_line)],
)

View File

@@ -0,0 +1,34 @@
from __future__ import annotations
from hashlib import sha256
from app.modules.rag.indexing.code.entrypoints.registry import Entrypoint
class FastApiEntrypointDetector:
_METHODS = {"get", "post", "put", "patch", "delete"}
def detect(self, *, path: str, symbols: list) -> list[Entrypoint]:
items: list[Entrypoint] = []
for symbol in symbols:
decorators = symbol.decorators or []
for decorator in decorators:
name = decorator.lower()
tail = name.split(".")[-1]
if tail not in self._METHODS and ".route" not in name:
continue
route = decorator.split("(")[-1].rstrip(")") if "(" in decorator else decorator
items.append(
Entrypoint(
entry_id=sha256(f"{path}|fastapi|{symbol.symbol_id}|{decorator}".encode("utf-8")).hexdigest(),
entry_type="http",
framework="fastapi",
route_or_command=route,
handler_symbol_id=symbol.symbol_id,
path=path,
start_line=symbol.start_line,
end_line=symbol.end_line,
metadata={"methods": [tail.upper()] if tail in self._METHODS else []},
)
)
return items

View File

@@ -0,0 +1,28 @@
from __future__ import annotations
from hashlib import sha256
from app.modules.rag.indexing.code.entrypoints.registry import Entrypoint
class FlaskEntrypointDetector:
def detect(self, *, path: str, symbols: list) -> list[Entrypoint]:
items: list[Entrypoint] = []
for symbol in symbols:
for decorator in symbol.decorators or []:
lowered = decorator.lower()
if ".route" not in lowered:
continue
items.append(
Entrypoint(
entry_id=sha256(f"{path}|flask|{symbol.symbol_id}|{decorator}".encode("utf-8")).hexdigest(),
entry_type="http",
framework="flask",
route_or_command=decorator,
handler_symbol_id=symbol.symbol_id,
path=path,
start_line=symbol.start_line,
end_line=symbol.end_line,
)
)
return items

View File

@@ -0,0 +1,27 @@
from __future__ import annotations
from dataclasses import dataclass, field
@dataclass(slots=True)
class Entrypoint:
entry_id: str
entry_type: str
framework: str
route_or_command: str
handler_symbol_id: str
path: str
start_line: int
end_line: int
metadata: dict = field(default_factory=dict)
class EntrypointDetectorRegistry:
def __init__(self, detectors: list) -> None:
self._detectors = detectors
def detect_all(self, *, path: str, symbols: list) -> list[Entrypoint]:
items: list[Entrypoint] = []
for detector in self._detectors:
items.extend(detector.detect(path=path, symbols=symbols))
return items

View File

@@ -0,0 +1,29 @@
from __future__ import annotations
from hashlib import sha256
from app.modules.rag.indexing.code.entrypoints.registry import Entrypoint
class TyperClickEntrypointDetector:
def detect(self, *, path: str, symbols: list) -> list[Entrypoint]:
items: list[Entrypoint] = []
for symbol in symbols:
for decorator in symbol.decorators or []:
lowered = decorator.lower()
if ".command" not in lowered and ".callback" not in lowered:
continue
framework = "typer" if "typer" in lowered else "click"
items.append(
Entrypoint(
entry_id=sha256(f"{path}|{framework}|{symbol.symbol_id}|{decorator}".encode("utf-8")).hexdigest(),
entry_type="cli",
framework=framework,
route_or_command=decorator,
handler_symbol_id=symbol.symbol_id,
path=path,
start_line=symbol.start_line,
end_line=symbol.end_line,
)
)
return items

View File

@@ -0,0 +1,13 @@
from __future__ import annotations
from pathlib import PurePosixPath
class PythonFileFilter:
_EXCLUDE_PARTS = {"venv", ".venv", "__pycache__", "node_modules", ".git", "dist", "build"}
def should_index(self, path: str) -> bool:
candidate = PurePosixPath(path)
if candidate.suffix.lower() != ".py":
return False
return not any(part in self._EXCLUDE_PARTS for part in candidate.parts)

View File

@@ -0,0 +1,52 @@
from __future__ import annotations
from app.modules.rag.contracts import RagDocument, RagSource
from app.modules.rag.indexing.code.code_text.chunker import CodeTextChunker
from app.modules.rag.indexing.code.code_text.document_builder import CodeTextDocumentBuilder
from app.modules.rag.indexing.code.edges.document_builder import EdgeDocumentBuilder
from app.modules.rag.indexing.code.edges.extractor import EdgeExtractor
from app.modules.rag.indexing.code.entrypoints.document_builder import EntrypointDocumentBuilder
from app.modules.rag.indexing.code.entrypoints.fastapi_detector import FastApiEntrypointDetector
from app.modules.rag.indexing.code.entrypoints.flask_detector import FlaskEntrypointDetector
from app.modules.rag.indexing.code.entrypoints.registry import EntrypointDetectorRegistry
from app.modules.rag.indexing.code.entrypoints.typer_click_detector import TyperClickEntrypointDetector
from app.modules.rag.indexing.code.file_filter import PythonFileFilter
from app.modules.rag.indexing.code.symbols.ast_parser import PythonAstParser
from app.modules.rag.indexing.code.symbols.document_builder import SymbolDocumentBuilder
from app.modules.rag.indexing.code.symbols.extractor import SymbolExtractor
class CodeIndexingPipeline:
def __init__(self) -> None:
self._filter = PythonFileFilter()
self._chunker = CodeTextChunker()
self._code_builder = CodeTextDocumentBuilder()
self._parser = PythonAstParser()
self._symbols = SymbolExtractor()
self._symbol_builder = SymbolDocumentBuilder()
self._edges = EdgeExtractor()
self._edge_builder = EdgeDocumentBuilder()
self._entrypoints = EntrypointDetectorRegistry(
[FastApiEntrypointDetector(), FlaskEntrypointDetector(), TyperClickEntrypointDetector()]
)
self._entrypoint_builder = EntrypointDocumentBuilder()
def supports(self, path: str) -> bool:
return self._filter.should_index(path)
def index_file(self, *, repo_id: str, commit_sha: str | None, path: str, content: str) -> list[RagDocument]:
source = RagSource(repo_id=repo_id, commit_sha=commit_sha, path=path)
docs: list[RagDocument] = []
code_chunks = self._chunker.chunk(path, content)
for index, chunk in enumerate(code_chunks):
docs.append(self._code_builder.build(source, chunk, chunk_index=index))
tree = self._parser.parse_module(content)
symbols = self._symbols.extract(path, content, tree)
for symbol in symbols:
docs.append(self._symbol_builder.build(source, symbol))
edges = self._edges.extract(path, tree, symbols)
for edge in edges:
docs.append(self._edge_builder.build(source, edge))
for entrypoint in self._entrypoints.detect_all(path=path, symbols=symbols):
docs.append(self._entrypoint_builder.build(source, entrypoint))
return docs

View File

@@ -0,0 +1,11 @@
from __future__ import annotations
import ast
class PythonAstParser:
def parse_module(self, text: str) -> ast.AST | None:
try:
return ast.parse(text)
except SyntaxError:
return None

View File

@@ -0,0 +1,32 @@
from __future__ import annotations
from app.modules.rag.contracts import RagDocument, RagLayer, RagSource, RagSpan
from app.modules.rag.indexing.code.symbols.extractor import PySymbol
class SymbolDocumentBuilder:
def build(self, source: RagSource, symbol: PySymbol) -> RagDocument:
body = [f"{symbol.kind} {symbol.qname}", symbol.signature]
if symbol.docstring:
body.append(symbol.docstring.strip())
return RagDocument(
layer=RagLayer.CODE_SYMBOL_CATALOG,
lang="python",
source=source,
title=symbol.qname,
text="\n".join(part for part in body if part),
span=RagSpan(symbol.start_line, symbol.end_line),
metadata={
"symbol_id": symbol.symbol_id,
"qname": symbol.qname,
"kind": symbol.kind,
"signature": symbol.signature,
"decorators_or_annotations": symbol.decorators,
"docstring_or_javadoc": symbol.docstring,
"parent_symbol_id": symbol.parent_symbol_id,
"package_or_module": source.path.replace("/", ".").removesuffix(".py"),
"is_entry_candidate": bool(symbol.decorators),
"lang_payload": symbol.lang_payload,
"artifact_type": "CODE",
},
)

View File

@@ -0,0 +1,130 @@
from __future__ import annotations
import ast
from dataclasses import dataclass, field
from hashlib import sha256
@dataclass(slots=True)
class PySymbol:
symbol_id: str
qname: str
kind: str
path: str
start_line: int
end_line: int
signature: str
decorators: list[str] = field(default_factory=list)
docstring: str | None = None
parent_symbol_id: str | None = None
lang_payload: dict = field(default_factory=dict)
class SymbolExtractor:
def extract(self, path: str, text: str, ast_tree: ast.AST | None) -> list[PySymbol]:
if ast_tree is None:
return []
collector = _SymbolVisitor(path)
collector.visit(ast_tree)
return collector.symbols
class _SymbolVisitor(ast.NodeVisitor):
def __init__(self, path: str) -> None:
self._path = path
self._stack: list[tuple[str, str]] = []
self.symbols: list[PySymbol] = []
def visit_ImportFrom(self, node: ast.ImportFrom) -> None:
if self._stack:
return
module = node.module or ""
for item in node.names:
local_name = item.asname or item.name
imported_name = f"{module}.{item.name}".strip(".")
self.symbols.append(
PySymbol(
symbol_id=sha256(f"{self._path}|{local_name}|import_alias".encode("utf-8")).hexdigest(),
qname=local_name,
kind="const",
path=self._path,
start_line=int(getattr(node, "lineno", 1)),
end_line=int(getattr(node, "end_lineno", getattr(node, "lineno", 1))),
signature=f"{local_name} = {imported_name}",
lang_payload={"imported_from": imported_name, "import_alias": True},
)
)
self.generic_visit(node)
def visit_Import(self, node: ast.Import) -> None:
if self._stack:
return
for item in node.names:
local_name = item.asname or item.name
self.symbols.append(
PySymbol(
symbol_id=sha256(f"{self._path}|{local_name}|import".encode("utf-8")).hexdigest(),
qname=local_name,
kind="const",
path=self._path,
start_line=int(getattr(node, "lineno", 1)),
end_line=int(getattr(node, "end_lineno", getattr(node, "lineno", 1))),
signature=f"import {item.name}",
lang_payload={"imported_from": item.name, "import_alias": bool(item.asname)},
)
)
self.generic_visit(node)
def visit_ClassDef(self, node: ast.ClassDef) -> None:
self._add_symbol(node, "class", {"bases": [self._expr_name(base) for base in node.bases]})
self.generic_visit(node)
self._stack.pop()
def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
self._add_function(node, is_async=False)
def visit_AsyncFunctionDef(self, node: ast.AsyncFunctionDef) -> None:
self._add_function(node, is_async=True)
def _add_function(self, node, *, is_async: bool) -> None:
kind = "method" if self._stack and self._stack[-1][0] == "class" else "function"
self._add_symbol(node, kind, {"async": is_async})
self.generic_visit(node)
self._stack.pop()
def _add_symbol(self, node, kind: str, lang_payload: dict) -> None:
names = [name for _, name in self._stack] + [node.name]
qname = ".".join(names)
symbol_id = sha256(f"{self._path}|{qname}|{kind}".encode("utf-8")).hexdigest()
signature = self._signature(node)
symbol = PySymbol(
symbol_id=symbol_id,
qname=qname,
kind=kind,
path=self._path,
start_line=int(getattr(node, "lineno", 1)),
end_line=int(getattr(node, "end_lineno", getattr(node, "lineno", 1))),
signature=signature,
decorators=[self._expr_name(item) for item in getattr(node, "decorator_list", [])],
docstring=ast.get_docstring(node),
parent_symbol_id=self._stack[-1][1] if self._stack else None,
lang_payload=lang_payload,
)
self.symbols.append(symbol)
self._stack.append((kind, qname))
def _signature(self, node) -> str:
if isinstance(node, ast.ClassDef):
bases = ", ".join(self._expr_name(base) for base in node.bases)
return f"{node.name}({bases})" if bases else node.name
args = [arg.arg for arg in getattr(node.args, "args", [])]
return f"{node.name}({', '.join(args)})"
def _expr_name(self, node) -> str:
if isinstance(node, ast.Name):
return node.id
if isinstance(node, ast.Attribute):
return f"{self._expr_name(node.value)}.{node.attr}"
if isinstance(node, ast.Call):
return self._expr_name(node.func)
return ast.dump(node, include_attributes=False)

View File

@@ -0,0 +1,15 @@
from __future__ import annotations
from app.modules.rag.contracts import RagDocument
from app.modules.rag.persistence.repository import RagRepository
class RagDocumentUpserter:
def __init__(self, repository: RagRepository) -> None:
self._repository = repository
def replace(self, rag_session_id: str, docs: list[RagDocument]) -> None:
self._repository.replace_documents(rag_session_id, docs)
def apply_changes(self, rag_session_id: str, delete_paths: list[str], docs: list[RagDocument]) -> None:
self._repository.apply_document_changes(rag_session_id, delete_paths, docs)

View File

@@ -0,0 +1,21 @@
from __future__ import annotations
from dataclasses import dataclass, field
@dataclass(slots=True)
class IndexReport:
indexed_files: int = 0
failed_files: int = 0
cache_hit_files: int = 0
cache_miss_files: int = 0
documents: int = 0
warnings: list[str] = field(default_factory=list)
def as_tuple(self) -> tuple[int, int, int, int]:
return (
self.indexed_files,
self.failed_files,
self.cache_hit_files,
self.cache_miss_files,
)

View File

@@ -0,0 +1,77 @@
from __future__ import annotations
from dataclasses import dataclass
from app.modules.rag.indexing.docs.chunkers.text_chunker import DocTextChunker
@dataclass(slots=True)
class SectionChunk:
section_path: str
section_title: str
content: str
order: int
class MarkdownDocChunker:
def __init__(self, text_chunker: DocTextChunker | None = None) -> None:
self._fallback = text_chunker or DocTextChunker()
def chunk(self, text: str) -> list[SectionChunk]:
lines = text.splitlines()
sections: list[SectionChunk] = []
stack: list[tuple[int, str]] = []
current_title = "Document"
current_lines: list[str] = []
order = 0
for line in lines:
heading = self._heading(line)
if heading is None:
current_lines.append(line)
continue
self._flush_section(sections, stack, current_title, current_lines, order)
order += 1
level, title = heading
stack = [item for item in stack if item[0] < level]
stack.append((level, title))
current_title = title
current_lines = []
self._flush_section(sections, stack, current_title, current_lines, order)
if sections:
return sections
chunks = self._fallback.split(text)
return [
SectionChunk(section_path="Document", section_title="Document", content=chunk, order=index)
for index, chunk in enumerate(chunks)
]
def _flush_section(
self,
sections: list[SectionChunk],
stack: list[tuple[int, str]],
current_title: str,
current_lines: list[str],
order: int,
) -> None:
content = "\n".join(current_lines).strip()
if not content:
return
titles = [title for _, title in stack] or [current_title]
sections.append(
SectionChunk(
section_path=" > ".join(titles),
section_title=titles[-1],
content=content,
order=order,
)
)
def _heading(self, line: str) -> tuple[int, str] | None:
stripped = line.strip()
if not stripped.startswith("#"):
return None
level = len(stripped) - len(stripped.lstrip("#"))
title = stripped[level:].strip()
if not title:
return None
return level, title

View File

@@ -0,0 +1,21 @@
from __future__ import annotations
class DocTextChunker:
def __init__(self, max_chars: int = 4000, overlap_chars: int = 250) -> None:
self._max_chars = max_chars
self._overlap_chars = overlap_chars
def split(self, text: str) -> list[str]:
cleaned = text.strip()
if not cleaned:
return []
chunks: list[str] = []
start = 0
while start < len(cleaned):
end = min(len(cleaned), start + self._max_chars)
chunks.append(cleaned[start:end].strip())
if end >= len(cleaned):
break
start = max(0, end - self._overlap_chars)
return [chunk for chunk in chunks if chunk]

View File

@@ -0,0 +1,18 @@
from __future__ import annotations
from pathlib import PurePosixPath
from app.modules.rag.contracts import DocKind
class DocsClassifier:
def classify(self, path: str) -> str:
upper = PurePosixPath(path).name.upper()
lowered = path.lower()
if "runbook" in lowered or upper.startswith("RUNBOOK"):
return DocKind.RUNBOOK
if upper.startswith("README"):
return DocKind.README
if "spec" in lowered or "architecture" in lowered:
return DocKind.SPEC
return DocKind.MISC

View File

@@ -0,0 +1,115 @@
from __future__ import annotations
from hashlib import sha256
from app.modules.rag.contracts import EvidenceLink, EvidenceType, RagDocument, RagLayer, RagSource
from app.modules.rag.indexing.docs.chunkers.markdown_chunker import SectionChunk
class DocsDocumentBuilder:
def build_module_catalog(self, source: RagSource, frontmatter: dict, summary_text: str, doc_kind: str) -> RagDocument | None:
module_id = str(frontmatter.get("id") or "").strip()
module_type = str(frontmatter.get("type") or "").strip()
domain = str(frontmatter.get("domain") or "").strip()
if not module_id or not module_type or not domain:
return None
links = frontmatter.get("links") or {}
metadata = {
"module_id": module_id,
"type": module_type,
"domain": domain,
"status": frontmatter.get("status"),
"version": frontmatter.get("version"),
"tags": frontmatter.get("tags") or [],
"owners": frontmatter.get("owners") or [],
"links": links,
"source_path": source.path,
"summary_text": summary_text[:4000],
"doc_kind": doc_kind,
}
metadata.update({name: links.get(name, []) for name in (
"calls_api", "called_by", "uses_logic", "used_by", "reads_db", "writes_db",
"integrates_with", "emits_events", "consumes_events",
)})
return RagDocument(
layer=RagLayer.DOCS_MODULE_CATALOG,
source=source,
title=module_id,
text=summary_text[:4000] or module_id,
metadata=metadata,
)
def build_section(self, source: RagSource, chunk: SectionChunk, frontmatter: dict, doc_kind: str) -> RagDocument:
module_id = str(frontmatter.get("id") or source.path)
metadata = {
"module_id": module_id,
"type": frontmatter.get("type"),
"domain": frontmatter.get("domain"),
"tags": frontmatter.get("tags") or [],
"section_path": chunk.section_path,
"section_title": chunk.section_title,
"order": chunk.order,
"doc_kind": doc_kind,
"source_path": source.path,
"artifact_type": "DOCS",
}
return RagDocument(
layer=RagLayer.DOCS_SECTION_INDEX,
source=source,
title=f"{module_id}:{chunk.section_title}",
text=chunk.content,
metadata=metadata,
)
def build_policy(self, source: RagSource, frontmatter: dict, chunk: SectionChunk, doc_kind: str) -> RagDocument | None:
policy_id = str(frontmatter.get("id") or "").strip()
applies_to = frontmatter.get("applies_to") or frontmatter.get("type")
if not policy_id:
return None
metadata = {
"policy_id": policy_id,
"applies_to": applies_to,
"rules": chunk.content[:4000],
"default_behaviors": frontmatter.get("default_behaviors") or [],
"doc_kind": doc_kind,
"section_path": chunk.section_path,
"source_path": source.path,
}
return RagDocument(
layer=RagLayer.DOCS_POLICY_INDEX,
source=source,
title=policy_id,
text=chunk.content[:4000],
metadata=metadata,
)
def build_fact(
self,
source: RagSource,
*,
subject_id: str,
predicate: str,
obj: str,
object_ref: str | None,
anchor: str,
tags: list[str] | None = None,
) -> RagDocument:
fact_id = sha256(f"{subject_id}|{predicate}|{obj}|{source.path}|{anchor}".encode("utf-8")).hexdigest()
metadata = {
"fact_id": fact_id,
"subject_id": subject_id,
"predicate": predicate,
"object": obj,
"object_ref": object_ref,
"anchor": anchor,
"tags": tags or [],
"source_path": source.path,
}
return RagDocument(
layer=RagLayer.DOCS_FACT_INDEX,
source=source,
title=f"{subject_id}:{predicate}",
text=f"{subject_id} {predicate} {obj}".strip(),
metadata=metadata,
links=[EvidenceLink(type=EvidenceType.DOC_FACT, target_id=fact_id, path=source.path, note=anchor)],
)

View File

@@ -0,0 +1,21 @@
from __future__ import annotations
from pathlib import PurePosixPath
class DocsFileFilter:
_EXTENSIONS = {".md", ".rst", ".txt", ".adoc"}
_NAMES = ("README", "CHANGELOG", "CONTRIBUTING", "ARCHITECTURE", "SPEC", "RUNBOOK")
_EXCLUDE_PARTS = {"node_modules", ".git", "vendor", "dist", "build", "target", "__pycache__"}
def should_index(self, path: str) -> bool:
candidate = PurePosixPath(path)
if any(part in self._EXCLUDE_PARTS for part in candidate.parts):
return False
if candidate.suffix.lower() in self._EXTENSIONS:
return True
upper_name = candidate.name.upper()
if any(upper_name.startswith(prefix) for prefix in self._NAMES):
return True
joined = "/".join(candidate.parts).lower()
return any(marker in joined for marker in ("docs/", "doc/", "documentation/"))

View File

@@ -0,0 +1,135 @@
from __future__ import annotations
import re
import yaml
from app.modules.rag.contracts import RagDocument, RagSource
from app.modules.rag.indexing.docs.chunkers.markdown_chunker import MarkdownDocChunker
from app.modules.rag.indexing.docs.classifier import DocsClassifier
from app.modules.rag.indexing.docs.document_builder import DocsDocumentBuilder
from app.modules.rag.indexing.docs.file_filter import DocsFileFilter
class DocsIndexingPipeline:
def __init__(self) -> None:
self._filter = DocsFileFilter()
self._classifier = DocsClassifier()
self._chunker = MarkdownDocChunker()
self._builder = DocsDocumentBuilder()
def supports(self, path: str) -> bool:
return self._filter.should_index(path)
def index_file(self, *, repo_id: str, commit_sha: str | None, path: str, content: str) -> list[RagDocument]:
source = RagSource(repo_id=repo_id, commit_sha=commit_sha, path=path)
frontmatter, body = self._split_frontmatter(content)
doc_kind = self._classifier.classify(path)
sections = self._chunker.chunk(body)
summary_text = self._summary_from_sections(sections)
docs: list[RagDocument] = []
module_doc = self._builder.build_module_catalog(source, frontmatter, summary_text, doc_kind)
if module_doc is not None:
docs.append(module_doc)
for section in sections:
docs.append(self._builder.build_section(source, section, frontmatter, doc_kind))
if str(frontmatter.get("type") or "").strip() == "policy":
for section in sections[:1]:
policy = self._builder.build_policy(source, frontmatter, section, doc_kind)
if policy is not None:
docs.append(policy)
docs.extend(self._extract_facts(source, frontmatter, sections))
return docs
def _split_frontmatter(self, content: str) -> tuple[dict, str]:
if not content.startswith("---\n"):
return {}, content
_, raw, body = content.split("---", 2)
payload = yaml.safe_load(raw) or {}
return payload if isinstance(payload, dict) else {}, body.strip()
def _summary_from_sections(self, sections) -> str:
text = "\n\n".join(section.content for section in sections[:2]).strip()
return text[:4000]
def _extract_facts(self, source: RagSource, frontmatter: dict, sections) -> list[RagDocument]:
subject_id = str(frontmatter.get("id") or source.path)
docs: list[RagDocument] = []
links = frontmatter.get("links") or {}
for predicate, values in links.items():
for value in values or []:
docs.append(
self._builder.build_fact(
source,
subject_id=subject_id,
predicate=predicate,
obj=str(value),
object_ref=str(value),
anchor="frontmatter.links",
)
)
for section in sections:
docs.extend(self._facts_from_table(source, subject_id, section.section_title, section.content))
docs.extend(self._facts_from_lists(source, subject_id, section.section_title, section.content))
return docs
def _facts_from_table(self, source: RagSource, subject_id: str, title: str, content: str) -> list[RagDocument]:
lines = [line.strip() for line in content.splitlines() if line.strip()]
if len(lines) < 3 or "|" not in lines[0]:
return []
headers = [part.strip().lower() for part in lines[0].strip("|").split("|")]
if not all(headers):
return []
docs: list[RagDocument] = []
for row in lines[2:]:
if "|" not in row:
continue
values = [part.strip() for part in row.strip("|").split("|")]
if len(values) != len(headers):
continue
payload = dict(zip(headers, values))
docs.extend(self._facts_from_row(source, subject_id, title, payload))
return docs
def _facts_from_row(self, source: RagSource, subject_id: str, title: str, payload: dict) -> list[RagDocument]:
docs: list[RagDocument] = []
name = payload.get("field") or payload.get("name") or payload.get("column") or payload.get("code")
if "request" in title.lower() or "response" in title.lower():
if name:
docs.append(self._builder.build_fact(source, subject_id=subject_id, predicate="has_field", obj=name, object_ref=None, anchor=title))
if payload.get("required"):
docs.append(self._builder.build_fact(source, subject_id=subject_id, predicate="field_required", obj=f"{name}:{payload['required']}", object_ref=None, anchor=title))
if payload.get("type"):
docs.append(self._builder.build_fact(source, subject_id=subject_id, predicate="field_type", obj=f"{name}:{payload['type']}", object_ref=None, anchor=title))
if payload.get("validation"):
docs.append(self._builder.build_fact(source, subject_id=subject_id, predicate="field_validation", obj=f"{name}:{payload['validation']}", object_ref=None, anchor=title))
if "error" in title.lower():
if payload.get("status"):
docs.append(self._builder.build_fact(source, subject_id=subject_id, predicate="returns_status", obj=payload["status"], object_ref=None, anchor=title))
if payload.get("error") or payload.get("code"):
error_value = payload.get("error") or payload.get("code")
docs.append(self._builder.build_fact(source, subject_id=subject_id, predicate="returns_error", obj=error_value, object_ref=None, anchor=title))
if payload.get("client action"):
docs.append(self._builder.build_fact(source, subject_id=subject_id, predicate="client_action", obj=payload["client action"], object_ref=None, anchor=title))
if "constraint" in title.lower() and name:
docs.append(self._builder.build_fact(source, subject_id=subject_id, predicate="has_constraint", obj=name, object_ref=None, anchor=title))
return docs
def _facts_from_lists(self, source: RagSource, subject_id: str, title: str, content: str) -> list[RagDocument]:
docs: list[RagDocument] = []
for line in content.splitlines():
item = line.strip()
if not item.startswith(("-", "*", "1.", "2.", "3.")):
continue
normalized = re.sub(r"^[-*0-9. ]+", "", item).strip()
lowered = normalized.lower()
if lowered.startswith("metric:"):
predicate = "emits_metric"
elif lowered.startswith("event:"):
predicate = "emits_analytics_event"
elif lowered.startswith("log:"):
predicate = "logs_event"
else:
predicate = "validates_rule" if "rule" in title.lower() else "client_action"
docs.append(self._builder.build_fact(source, subject_id=subject_id, predicate=predicate, obj=normalized, object_ref=None, anchor=title))
return docs

View File

@@ -0,0 +1,189 @@
from __future__ import annotations
import json
from sqlalchemy import text
from app.modules.rag.contracts import EvidenceLink, RagDocument, RagSource, RagSpan
from app.modules.rag.contracts.enums import RagLayer
class RagCacheRepository:
def get_cached_documents(self, repo_id: str, blob_sha: str) -> list[RagDocument]:
with self._engine().connect() as conn:
rows = conn.execute(
text(
"""
SELECT layer, lang, path, title, content, metadata_json, links_json, span_start, span_end,
repo_id, commit_sha, embedding::text AS embedding_txt
FROM rag_chunk_cache
WHERE repo_id = :repo_id AND blob_sha = :blob_sha
ORDER BY chunk_index ASC
"""
),
{"repo_id": repo_id, "blob_sha": blob_sha},
).mappings().fetchall()
docs: list[RagDocument] = []
for row in rows:
metadata = self._loads(row.get("metadata_json"))
docs.append(
RagDocument(
layer=str(row["layer"]),
lang=row.get("lang"),
source=RagSource(
repo_id=str(row["repo_id"]),
commit_sha=row.get("commit_sha"),
path=str(row["path"]),
),
title=str(row["title"] or row["path"]),
text=str(row["content"] or ""),
metadata=metadata,
links=[EvidenceLink(**item) for item in self._loads(row.get("links_json"), default=[])],
span=RagSpan(row.get("span_start"), row.get("span_end")),
embedding=self._parse_vector(str(row["embedding_txt"] or "")),
)
)
return docs
def cache_documents(self, repo_id: str, path: str, blob_sha: str, docs: list[RagDocument]) -> None:
if not docs:
return
with self._engine().connect() as conn:
first = docs[0].to_record()
first_meta = first["metadata"]
conn.execute(
text(
"""
INSERT INTO rag_blob_cache (
repo_id, blob_sha, path, artifact_type, section, doc_id, doc_version, owner,
system_component, last_modified, staleness_score, layer, lang, metadata_json
)
VALUES (
:repo_id, :blob_sha, :path, :artifact_type, :section, :doc_id, :doc_version, :owner,
:system_component, :last_modified, :staleness_score, :layer, :lang, :metadata_json
)
ON CONFLICT (repo_id, blob_sha, path) DO UPDATE SET
artifact_type = EXCLUDED.artifact_type,
section = EXCLUDED.section,
doc_id = EXCLUDED.doc_id,
doc_version = EXCLUDED.doc_version,
owner = EXCLUDED.owner,
system_component = EXCLUDED.system_component,
last_modified = EXCLUDED.last_modified,
staleness_score = EXCLUDED.staleness_score,
layer = EXCLUDED.layer,
lang = EXCLUDED.lang,
metadata_json = EXCLUDED.metadata_json,
updated_at = CURRENT_TIMESTAMP
"""
),
{
"repo_id": repo_id,
"blob_sha": blob_sha,
"path": path,
"artifact_type": first_meta.get("artifact_type"),
"section": first_meta.get("section") or first_meta.get("section_title"),
"doc_id": first_meta.get("doc_id"),
"doc_version": first_meta.get("doc_version"),
"owner": first_meta.get("owner"),
"system_component": first_meta.get("system_component"),
"last_modified": first_meta.get("last_modified"),
"staleness_score": first_meta.get("staleness_score"),
"layer": first["layer"],
"lang": first["lang"],
"metadata_json": json.dumps(first_meta, ensure_ascii=True),
},
)
conn.execute(
text("DELETE FROM rag_chunk_cache WHERE repo_id = :repo_id AND blob_sha = :blob_sha"),
{"repo_id": repo_id, "blob_sha": blob_sha},
)
for idx, doc in enumerate(docs):
row = doc.to_record()
metadata = row["metadata"]
emb = row["embedding"] or []
emb_str = "[" + ",".join(str(x) for x in emb) + "]" if emb else None
conn.execute(
text(
"""
INSERT INTO rag_chunk_cache (
repo_id, blob_sha, chunk_index, content, embedding, section, layer, lang, path, title,
metadata_json, links_json, span_start, span_end, commit_sha
)
VALUES (
:repo_id, :blob_sha, :chunk_index, :content, CAST(:embedding AS vector), :section, :layer,
:lang, :path, :title, :metadata_json, :links_json, :span_start, :span_end, :commit_sha
)
"""
),
{
"repo_id": repo_id,
"blob_sha": blob_sha,
"chunk_index": idx,
"content": row["text"],
"embedding": emb_str,
"section": metadata.get("section") or metadata.get("section_title"),
"layer": row["layer"],
"lang": row["lang"],
"path": row["path"],
"title": row["title"],
"metadata_json": json.dumps(metadata, ensure_ascii=True),
"links_json": json.dumps(row["links"], ensure_ascii=True),
"span_start": row["span_start"],
"span_end": row["span_end"],
"commit_sha": row["commit_sha"],
},
)
conn.commit()
def record_repo_cache(
self,
*,
project_id: str,
commit_sha: str | None,
changed_files: list[str],
summary: str,
) -> None:
docs: list[RagDocument] = []
for idx, path in enumerate(changed_files):
docs.append(
RagDocument(
layer=RagLayer.CODE_SOURCE_CHUNKS,
lang="python" if path.endswith(".py") else None,
source=RagSource(project_id, commit_sha, path),
title=path,
text=f"repo_webhook:{path}:{summary[:300]}",
metadata={"chunk_index": idx, "artifact_type": "CODE", "section": "repo_webhook"},
)
)
for doc in docs:
blob_sha = self._blob_sha(commit_sha, doc.source.path)
doc.metadata["blob_sha"] = blob_sha
self.cache_documents(project_id, doc.source.path, blob_sha, [doc])
def _blob_sha(self, commit_sha: str | None, path: str) -> str:
from hashlib import sha256
return sha256(f"{commit_sha or 'no-commit'}:{path}".encode("utf-8")).hexdigest()
def _engine(self):
from app.modules.shared.db import get_engine
return get_engine()
def _loads(self, value, default=None):
if default is None:
default = {}
if not value:
return default
return json.loads(str(value))
def _parse_vector(self, value: str) -> list[float]:
text_value = value.strip()
if not text_value:
return []
if text_value.startswith("[") and text_value.endswith("]"):
text_value = text_value[1:-1]
if not text_value:
return []
return [float(part.strip()) for part in text_value.split(",") if part.strip()]

View File

@@ -0,0 +1,122 @@
from __future__ import annotations
import json
from sqlalchemy import text
from app.modules.rag.contracts import RagDocument
class RagDocumentRepository:
def replace_documents(self, conn, rag_session_id: str, docs: list[RagDocument]) -> None:
conn.execute(text("DELETE FROM rag_chunks WHERE rag_session_id = :sid"), {"sid": rag_session_id})
conn.execute(text("DELETE FROM rag_session_chunk_map WHERE rag_session_id = :sid"), {"sid": rag_session_id})
self.insert_documents(conn, rag_session_id, docs)
def apply_document_changes(
self,
conn,
rag_session_id: str,
delete_paths: list[str],
docs: list[RagDocument],
) -> None:
if delete_paths:
conn.execute(
text("DELETE FROM rag_chunks WHERE rag_session_id = :sid AND path = ANY(:paths)"),
{"sid": rag_session_id, "paths": delete_paths},
)
conn.execute(
text("DELETE FROM rag_session_chunk_map WHERE rag_session_id = :sid AND path = ANY(:paths)"),
{"sid": rag_session_id, "paths": delete_paths},
)
if not docs:
return
paths = sorted({doc.source.path for doc in docs})
conn.execute(
text("DELETE FROM rag_chunks WHERE rag_session_id = :sid AND path = ANY(:paths)"),
{"sid": rag_session_id, "paths": paths},
)
conn.execute(
text("DELETE FROM rag_session_chunk_map WHERE rag_session_id = :sid AND path = ANY(:paths)"),
{"sid": rag_session_id, "paths": paths},
)
self.insert_documents(conn, rag_session_id, docs)
def insert_documents(self, conn, rag_session_id: str, docs: list[RagDocument]) -> None:
for doc in docs:
row = doc.to_record()
metadata = row["metadata"]
links = row["links"]
emb = row["embedding"] or []
emb_str = "[" + ",".join(str(x) for x in emb) + "]" if emb else None
conn.execute(
text(
"""
INSERT INTO rag_chunks (
rag_session_id, path, chunk_index, content, embedding, artifact_type, section, doc_id,
doc_version, owner, system_component, last_modified, staleness_score, created_at, updated_at,
rag_doc_id, layer, lang, repo_id, commit_sha, title, metadata_json, links_json, span_start,
span_end, symbol_id, qname, kind, framework, entrypoint_type, module_id, section_path, doc_kind
)
VALUES (
:sid, :path, :chunk_index, :content, CAST(:emb AS vector), :artifact_type, :section, :doc_id,
:doc_version, :owner, :system_component, :last_modified, :staleness_score, CURRENT_TIMESTAMP,
CURRENT_TIMESTAMP, :rag_doc_id, :layer, :lang, :repo_id, :commit_sha, :title, :metadata_json,
:links_json, :span_start, :span_end, :symbol_id, :qname, :kind, :framework, :entrypoint_type,
:module_id, :section_path, :doc_kind
)
"""
),
{
"sid": rag_session_id,
"path": row["path"],
"chunk_index": int(metadata.get("chunk_index", 0)),
"content": row["text"],
"emb": emb_str,
"artifact_type": metadata.get("artifact_type"),
"section": metadata.get("section") or metadata.get("section_title"),
"doc_id": metadata.get("doc_id"),
"doc_version": metadata.get("doc_version"),
"owner": metadata.get("owner"),
"system_component": metadata.get("system_component"),
"last_modified": metadata.get("last_modified"),
"staleness_score": metadata.get("staleness_score"),
"rag_doc_id": row["doc_id"],
"layer": row["layer"],
"lang": row["lang"],
"repo_id": row["repo_id"],
"commit_sha": row["commit_sha"],
"title": row["title"],
"metadata_json": json.dumps(metadata, ensure_ascii=True),
"links_json": json.dumps(links, ensure_ascii=True),
"span_start": row["span_start"],
"span_end": row["span_end"],
"symbol_id": metadata.get("symbol_id"),
"qname": metadata.get("qname"),
"kind": metadata.get("kind") or metadata.get("type"),
"framework": metadata.get("framework"),
"entrypoint_type": metadata.get("entry_type") or metadata.get("entrypoint_type"),
"module_id": metadata.get("module_id") or metadata.get("policy_id"),
"section_path": metadata.get("section_path"),
"doc_kind": metadata.get("doc_kind"),
},
)
repo_id = str(row["repo_id"] or "").strip()
blob_sha = str(metadata.get("blob_sha") or "").strip()
if repo_id and blob_sha:
conn.execute(
text(
"""
INSERT INTO rag_session_chunk_map (
rag_session_id, repo_id, blob_sha, chunk_index, path
) VALUES (:sid, :repo_id, :blob_sha, :chunk_index, :path)
"""
),
{
"sid": rag_session_id,
"repo_id": repo_id,
"blob_sha": blob_sha,
"chunk_index": int(metadata.get("chunk_index", 0)),
"path": row["path"],
},
)

View File

@@ -0,0 +1,95 @@
from __future__ import annotations
from dataclasses import dataclass
from sqlalchemy import text
from app.modules.shared.db import get_engine
@dataclass
class RagJobRow:
index_job_id: str
rag_session_id: str
status: str
indexed_files: int
failed_files: int
cache_hit_files: int
cache_miss_files: int
error_code: str | None
error_desc: str | None
error_module: str | None
class RagJobRepository:
def create_job(self, index_job_id: str, rag_session_id: str, status: str) -> None:
with get_engine().connect() as conn:
conn.execute(
text(
"""
INSERT INTO rag_index_jobs (index_job_id, rag_session_id, status)
VALUES (:jid, :sid, :status)
"""
),
{"jid": index_job_id, "sid": rag_session_id, "status": status},
)
conn.commit()
def update_job(
self,
index_job_id: str,
*,
status: str,
indexed_files: int,
failed_files: int,
cache_hit_files: int = 0,
cache_miss_files: int = 0,
error_code: str | None = None,
error_desc: str | None = None,
error_module: str | None = None,
) -> None:
with get_engine().connect() as conn:
conn.execute(
text(
"""
UPDATE rag_index_jobs
SET status = :status,
indexed_files = :indexed,
failed_files = :failed,
cache_hit_files = :cache_hit_files,
cache_miss_files = :cache_miss_files,
error_code = :ecode,
error_desc = :edesc,
error_module = :emodule,
updated_at = CURRENT_TIMESTAMP
WHERE index_job_id = :jid
"""
),
{
"jid": index_job_id,
"status": status,
"indexed": indexed_files,
"failed": failed_files,
"cache_hit_files": cache_hit_files,
"cache_miss_files": cache_miss_files,
"ecode": error_code,
"edesc": error_desc,
"emodule": error_module,
},
)
conn.commit()
def get_job(self, index_job_id: str) -> RagJobRow | None:
with get_engine().connect() as conn:
row = conn.execute(
text(
"""
SELECT index_job_id, rag_session_id, status, indexed_files, failed_files,
cache_hit_files, cache_miss_files, error_code, error_desc, error_module
FROM rag_index_jobs
WHERE index_job_id = :jid
"""
),
{"jid": index_job_id},
).mappings().fetchone()
return RagJobRow(**dict(row)) if row else None

View File

@@ -0,0 +1,111 @@
from __future__ import annotations
import json
from sqlalchemy import text
from app.modules.rag.retrieval.query_terms import extract_query_terms
from app.modules.shared.db import get_engine
class RagQueryRepository:
def retrieve(
self,
rag_session_id: str,
query_embedding: list[float],
*,
query_text: str = "",
limit: int = 5,
layers: list[str] | None = None,
path_prefixes: list[str] | None = None,
prefer_non_tests: bool = False,
) -> list[dict]:
emb = "[" + ",".join(str(x) for x in query_embedding) + "]"
filters = ["rag_session_id = :sid"]
params: dict = {"sid": rag_session_id, "emb": emb, "lim": limit}
if layers:
filters.append("layer = ANY(:layers)")
params["layers"] = layers
if path_prefixes:
or_filters = []
for idx, prefix in enumerate(path_prefixes):
key = f"path_{idx}"
params[key] = f"{prefix}%"
or_filters.append(f"path LIKE :{key}")
filters.append("(" + " OR ".join(or_filters) + ")")
term_filters = []
terms = extract_query_terms(query_text)
for idx, term in enumerate(terms):
exact_key = f"term_exact_{idx}"
prefix_key = f"term_prefix_{idx}"
contains_key = f"term_contains_{idx}"
params[exact_key] = term
params[prefix_key] = f"{term}%"
params[contains_key] = f"%{term}%"
term_filters.append(
"CASE "
f"WHEN lower(COALESCE(qname, '')) = :{exact_key} THEN 0 "
f"WHEN lower(COALESCE(symbol_id, '')) = :{exact_key} THEN 1 "
f"WHEN lower(COALESCE(title, '')) = :{exact_key} THEN 2 "
f"WHEN lower(COALESCE(qname, '')) LIKE :{prefix_key} THEN 3 "
f"WHEN lower(COALESCE(title, '')) LIKE :{prefix_key} THEN 4 "
f"WHEN lower(COALESCE(path, '')) LIKE :{contains_key} THEN 5 "
f"WHEN lower(COALESCE(content, '')) LIKE :{contains_key} THEN 6 "
"ELSE 100 END"
)
lexical_sql = "LEAST(" + ", ".join(term_filters) + ")" if term_filters else "100"
test_penalty_sql = (
"CASE "
"WHEN lower(path) LIKE 'tests/%' OR lower(path) LIKE '%/tests/%' OR lower(path) LIKE 'test_%' OR lower(path) LIKE '%/test_%' "
"THEN 1 ELSE 0 END"
if prefer_non_tests
else "0"
)
layer_rank_sql = (
"CASE "
"WHEN layer = 'C3_ENTRYPOINTS' THEN 0 "
"WHEN layer = 'C1_SYMBOL_CATALOG' THEN 1 "
"WHEN layer = 'C2_DEPENDENCY_GRAPH' THEN 2 "
"WHEN layer = 'C0_SOURCE_CHUNKS' THEN 3 "
"WHEN layer = 'D1_MODULE_CATALOG' THEN 0 "
"WHEN layer = 'D2_FACT_INDEX' THEN 1 "
"WHEN layer = 'D3_SECTION_INDEX' THEN 2 "
"WHEN layer = 'D4_POLICY_INDEX' THEN 3 "
"ELSE 10 END"
)
sql = f"""
SELECT path, content, layer, title, metadata_json, span_start, span_end,
{lexical_sql} AS lexical_rank,
{test_penalty_sql} AS test_penalty,
{layer_rank_sql} AS layer_rank,
(embedding <=> CAST(:emb AS vector)) AS distance
FROM rag_chunks
WHERE {' AND '.join(filters)}
ORDER BY lexical_rank ASC, test_penalty ASC, layer_rank ASC, embedding <=> CAST(:emb AS vector)
LIMIT :lim
"""
with get_engine().connect() as conn:
rows = conn.execute(text(sql), params).mappings().fetchall()
return [self._row_to_dict(row) for row in rows]
def fallback_chunks(self, rag_session_id: str, *, limit: int = 5, layers: list[str] | None = None) -> list[dict]:
filters = ["rag_session_id = :sid"]
params: dict = {"sid": rag_session_id, "lim": limit}
if layers:
filters.append("layer = ANY(:layers)")
params["layers"] = layers
sql = f"""
SELECT path, content, layer, title, metadata_json, span_start, span_end
FROM rag_chunks
WHERE {' AND '.join(filters)}
ORDER BY id DESC
LIMIT :lim
"""
with get_engine().connect() as conn:
rows = conn.execute(text(sql), params).mappings().fetchall()
return [self._row_to_dict(row) for row in rows]
def _row_to_dict(self, row) -> dict:
data = dict(row)
data["metadata"] = json.loads(str(data.pop("metadata_json") or "{}"))
return data

View File

@@ -0,0 +1,82 @@
from __future__ import annotations
from app.modules.rag.contracts import RagDocument
from app.modules.rag.persistence.cache_repository import RagCacheRepository
from app.modules.rag.persistence.document_repository import RagDocumentRepository
from app.modules.rag.persistence.job_repository import RagJobRepository, RagJobRow
from app.modules.rag.persistence.query_repository import RagQueryRepository
from app.modules.rag.persistence.schema_repository import RagSchemaRepository
from app.modules.rag.persistence.session_repository import RagSessionRepository
from app.modules.shared.db import get_engine
class RagRepository:
def __init__(self) -> None:
self._schema = RagSchemaRepository()
self._sessions = RagSessionRepository()
self._jobs = RagJobRepository()
self._documents = RagDocumentRepository()
self._cache = RagCacheRepository()
self._query = RagQueryRepository()
def ensure_tables(self) -> None:
self._schema.ensure_tables()
def upsert_session(self, rag_session_id: str, project_id: str) -> None:
self._sessions.upsert_session(rag_session_id, project_id)
def session_exists(self, rag_session_id: str) -> bool:
return self._sessions.session_exists(rag_session_id)
def get_session(self, rag_session_id: str) -> dict | None:
return self._sessions.get_session(rag_session_id)
def create_job(self, index_job_id: str, rag_session_id: str, status: str) -> None:
self._jobs.create_job(index_job_id, rag_session_id, status)
def update_job(self, index_job_id: str, **kwargs) -> None:
self._jobs.update_job(index_job_id, **kwargs)
def get_job(self, index_job_id: str) -> RagJobRow | None:
return self._jobs.get_job(index_job_id)
def replace_documents(self, rag_session_id: str, docs: list[RagDocument]) -> None:
with get_engine().connect() as conn:
self._documents.replace_documents(conn, rag_session_id, docs)
conn.commit()
def apply_document_changes(self, rag_session_id: str, delete_paths: list[str], docs: list[RagDocument]) -> None:
with get_engine().connect() as conn:
self._documents.apply_document_changes(conn, rag_session_id, delete_paths, docs)
conn.commit()
def get_cached_documents(self, repo_id: str, blob_sha: str) -> list[RagDocument]:
return self._cache.get_cached_documents(repo_id, blob_sha)
def cache_documents(self, repo_id: str, path: str, blob_sha: str, docs: list[RagDocument]) -> None:
self._cache.cache_documents(repo_id, path, blob_sha, docs)
def record_repo_cache(self, **kwargs) -> None:
self._cache.record_repo_cache(**kwargs)
def retrieve(
self,
rag_session_id: str,
query_embedding: list[float],
*,
query_text: str = "",
limit: int = 5,
layers: list[str] | None = None,
prefer_non_tests: bool = False,
) -> list[dict]:
return self._query.retrieve(
rag_session_id,
query_embedding,
query_text=query_text,
limit=limit,
layers=layers,
prefer_non_tests=prefer_non_tests,
)
def fallback_chunks(self, rag_session_id: str, limit: int = 5, layers: list[str] | None = None) -> list[dict]:
return self._query.fallback_chunks(rag_session_id, limit=limit, layers=layers)

View File

@@ -0,0 +1,179 @@
from __future__ import annotations
from sqlalchemy import text
from app.modules.shared.db import get_engine
class RagSchemaRepository:
def ensure_tables(self) -> None:
engine = get_engine()
with engine.connect() as conn:
conn.execute(text("CREATE EXTENSION IF NOT EXISTS vector"))
conn.execute(
text(
"""
CREATE TABLE IF NOT EXISTS rag_sessions (
rag_session_id VARCHAR(64) PRIMARY KEY,
project_id VARCHAR(512) NOT NULL,
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
)
"""
)
)
conn.execute(
text(
"""
CREATE TABLE IF NOT EXISTS rag_index_jobs (
index_job_id VARCHAR(64) PRIMARY KEY,
rag_session_id VARCHAR(64) NOT NULL,
status VARCHAR(16) NOT NULL,
indexed_files INTEGER NOT NULL DEFAULT 0,
failed_files INTEGER NOT NULL DEFAULT 0,
cache_hit_files INTEGER NOT NULL DEFAULT 0,
cache_miss_files INTEGER NOT NULL DEFAULT 0,
error_code VARCHAR(128) NULL,
error_desc TEXT NULL,
error_module VARCHAR(64) NULL,
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
)
"""
)
)
conn.execute(
text(
"""
CREATE TABLE IF NOT EXISTS rag_chunks (
id BIGSERIAL PRIMARY KEY,
rag_session_id VARCHAR(64) NOT NULL,
path TEXT NOT NULL,
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
embedding vector NULL,
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
)
"""
)
)
conn.execute(
text(
"""
CREATE TABLE IF NOT EXISTS rag_blob_cache (
id BIGSERIAL PRIMARY KEY,
repo_id VARCHAR(512) NOT NULL,
blob_sha VARCHAR(128) NOT NULL,
path TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT uq_rag_blob_cache UNIQUE (repo_id, blob_sha, path)
)
"""
)
)
conn.execute(
text(
"""
CREATE TABLE IF NOT EXISTS rag_chunk_cache (
id BIGSERIAL PRIMARY KEY,
repo_id VARCHAR(512) NOT NULL,
blob_sha VARCHAR(128) NOT NULL,
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
embedding vector NULL,
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT uq_rag_chunk_cache UNIQUE (repo_id, blob_sha, chunk_index)
)
"""
)
)
conn.execute(
text(
"""
CREATE TABLE IF NOT EXISTS rag_session_chunk_map (
id BIGSERIAL PRIMARY KEY,
rag_session_id VARCHAR(64) NOT NULL,
repo_id VARCHAR(512) NOT NULL,
blob_sha VARCHAR(128) NOT NULL,
chunk_index INTEGER NOT NULL,
path TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
)
"""
)
)
self._ensure_columns(conn)
self._ensure_indexes(conn)
conn.commit()
def _ensure_columns(self, conn) -> None:
for statement in (
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS artifact_type VARCHAR(16) NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS section TEXT NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS doc_id TEXT NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS doc_version TEXT NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS owner TEXT NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS system_component TEXT NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS last_modified TIMESTAMPTZ NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS staleness_score DOUBLE PRECISION NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS rag_doc_id VARCHAR(128) NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS layer VARCHAR(64) NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS lang VARCHAR(32) NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS repo_id VARCHAR(512) NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS commit_sha VARCHAR(128) NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS title TEXT NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS metadata_json TEXT NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS links_json TEXT NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS span_start INTEGER NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS span_end INTEGER NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS symbol_id TEXT NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS qname TEXT NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS kind TEXT NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS framework TEXT NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS entrypoint_type TEXT NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS module_id TEXT NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS section_path TEXT NULL",
"ALTER TABLE rag_chunks ADD COLUMN IF NOT EXISTS doc_kind TEXT NULL",
"ALTER TABLE rag_blob_cache ADD COLUMN IF NOT EXISTS artifact_type VARCHAR(16) NULL",
"ALTER TABLE rag_blob_cache ADD COLUMN IF NOT EXISTS section TEXT NULL",
"ALTER TABLE rag_blob_cache ADD COLUMN IF NOT EXISTS doc_id TEXT NULL",
"ALTER TABLE rag_blob_cache ADD COLUMN IF NOT EXISTS doc_version TEXT NULL",
"ALTER TABLE rag_blob_cache ADD COLUMN IF NOT EXISTS owner TEXT NULL",
"ALTER TABLE rag_blob_cache ADD COLUMN IF NOT EXISTS system_component TEXT NULL",
"ALTER TABLE rag_blob_cache ADD COLUMN IF NOT EXISTS last_modified TIMESTAMPTZ NULL",
"ALTER TABLE rag_blob_cache ADD COLUMN IF NOT EXISTS staleness_score DOUBLE PRECISION NULL",
"ALTER TABLE rag_blob_cache ADD COLUMN IF NOT EXISTS layer VARCHAR(64) NULL",
"ALTER TABLE rag_blob_cache ADD COLUMN IF NOT EXISTS lang VARCHAR(32) NULL",
"ALTER TABLE rag_blob_cache ADD COLUMN IF NOT EXISTS metadata_json TEXT NULL",
"ALTER TABLE rag_chunk_cache ADD COLUMN IF NOT EXISTS section TEXT NULL",
"ALTER TABLE rag_chunk_cache ADD COLUMN IF NOT EXISTS layer VARCHAR(64) NULL",
"ALTER TABLE rag_chunk_cache ADD COLUMN IF NOT EXISTS lang VARCHAR(32) NULL",
"ALTER TABLE rag_chunk_cache ADD COLUMN IF NOT EXISTS path TEXT NULL",
"ALTER TABLE rag_chunk_cache ADD COLUMN IF NOT EXISTS title TEXT NULL",
"ALTER TABLE rag_chunk_cache ADD COLUMN IF NOT EXISTS metadata_json TEXT NULL",
"ALTER TABLE rag_chunk_cache ADD COLUMN IF NOT EXISTS links_json TEXT NULL",
"ALTER TABLE rag_chunk_cache ADD COLUMN IF NOT EXISTS span_start INTEGER NULL",
"ALTER TABLE rag_chunk_cache ADD COLUMN IF NOT EXISTS span_end INTEGER NULL",
"ALTER TABLE rag_chunk_cache ADD COLUMN IF NOT EXISTS commit_sha VARCHAR(128) NULL",
"ALTER TABLE rag_index_jobs ADD COLUMN IF NOT EXISTS cache_hit_files INTEGER NOT NULL DEFAULT 0",
"ALTER TABLE rag_index_jobs ADD COLUMN IF NOT EXISTS cache_miss_files INTEGER NOT NULL DEFAULT 0",
):
conn.execute(text(statement))
def _ensure_indexes(self, conn) -> None:
for statement in (
"CREATE INDEX IF NOT EXISTS idx_rag_chunks_session ON rag_chunks (rag_session_id)",
"CREATE INDEX IF NOT EXISTS idx_rag_chunks_layer ON rag_chunks (rag_session_id, layer)",
"CREATE INDEX IF NOT EXISTS idx_rag_chunks_layer_path ON rag_chunks (rag_session_id, layer, path)",
"CREATE INDEX IF NOT EXISTS idx_rag_chunks_qname ON rag_chunks (qname)",
"CREATE INDEX IF NOT EXISTS idx_rag_chunks_symbol_id ON rag_chunks (symbol_id)",
"CREATE INDEX IF NOT EXISTS idx_rag_chunks_module_id ON rag_chunks (module_id)",
"CREATE INDEX IF NOT EXISTS idx_rag_chunks_doc_kind ON rag_chunks (doc_kind)",
"CREATE INDEX IF NOT EXISTS idx_rag_chunks_entrypoint ON rag_chunks (entrypoint_type, framework)",
"CREATE INDEX IF NOT EXISTS idx_rag_blob_cache_repo_blob ON rag_blob_cache (repo_id, blob_sha)",
"CREATE INDEX IF NOT EXISTS idx_rag_chunk_cache_repo_blob ON rag_chunk_cache (repo_id, blob_sha, chunk_index)",
"CREATE INDEX IF NOT EXISTS idx_rag_session_chunk_map_session ON rag_session_chunk_map (rag_session_id, created_at DESC)",
):
conn.execute(text(statement))

View File

@@ -0,0 +1,37 @@
from __future__ import annotations
from sqlalchemy import text
from app.modules.shared.db import get_engine
class RagSessionRepository:
def upsert_session(self, rag_session_id: str, project_id: str) -> None:
with get_engine().connect() as conn:
conn.execute(
text(
"""
INSERT INTO rag_sessions (rag_session_id, project_id)
VALUES (:sid, :pid)
ON CONFLICT (rag_session_id) DO UPDATE SET project_id = EXCLUDED.project_id
"""
),
{"sid": rag_session_id, "pid": project_id},
)
conn.commit()
def session_exists(self, rag_session_id: str) -> bool:
with get_engine().connect() as conn:
row = conn.execute(
text("SELECT 1 FROM rag_sessions WHERE rag_session_id = :sid"),
{"sid": rag_session_id},
).fetchone()
return bool(row)
def get_session(self, rag_session_id: str) -> dict | None:
with get_engine().connect() as conn:
row = conn.execute(
text("SELECT rag_session_id, project_id FROM rag_sessions WHERE rag_session_id = :sid"),
{"sid": rag_session_id},
).mappings().fetchone()
return dict(row) if row else None

View File

@@ -0,0 +1,43 @@
from __future__ import annotations
from app.modules.rag.contracts import RagLayer, RetrievalMode
class RagQueryRouter:
_CODE_HINTS = (
"как работает код",
"explain code",
"explain the code",
"по коду",
"из кода",
"построй документацию по коду",
"документацию по коду",
"where is implemented",
"где реализовано",
"endpoint",
"handler",
"symbol",
"function",
"class",
"method",
)
_DOCS_LAYERS = [
RagLayer.DOCS_MODULE_CATALOG,
RagLayer.DOCS_FACT_INDEX,
RagLayer.DOCS_SECTION_INDEX,
RagLayer.DOCS_POLICY_INDEX,
]
_CODE_LAYERS = [
RagLayer.CODE_ENTRYPOINTS,
RagLayer.CODE_SYMBOL_CATALOG,
RagLayer.CODE_DEPENDENCY_GRAPH,
RagLayer.CODE_SOURCE_CHUNKS,
]
def resolve_mode(self, query: str) -> str:
lowered = query.lower()
return RetrievalMode.CODE if any(hint in lowered for hint in self._CODE_HINTS) else RetrievalMode.DOCS
def layers_for_mode(self, mode: str) -> list[str]:
return list(self._CODE_LAYERS if mode == RetrievalMode.CODE else self._DOCS_LAYERS)

View File

@@ -0,0 +1,45 @@
from __future__ import annotations
import re
def extract_query_terms(query_text: str) -> list[str]:
raw_terms = re.findall(r"[A-Za-z_][A-Za-z0-9_]{2,}", query_text or "")
normalized: list[str] = []
for term in raw_terms:
for variant in _identifier_variants(term):
if variant not in normalized:
normalized.append(variant)
for variant in _intent_variants(query_text):
if variant not in normalized:
normalized.append(variant)
return normalized[:6]
def _identifier_variants(term: str) -> list[str]:
lowered = term.lower()
variants = [lowered]
snake = _camel_to_snake(term)
if snake and snake not in variants:
variants.append(snake)
if lowered.endswith("manager") and len(lowered) > len("manager"):
manager_split = lowered[: -len("manager")] + "_manager"
if manager_split not in variants:
variants.append(manager_split)
compact = snake.replace("_", "") if snake else ""
if compact and compact not in variants:
variants.append(compact)
return variants
def _camel_to_snake(term: str) -> str:
first = re.sub(r"(.)([A-Z][a-z]+)", r"\1_\2", term)
return re.sub(r"([a-z0-9])([A-Z])", r"\1_\2", first).lower()
def _intent_variants(query_text: str) -> list[str]:
lowered = (query_text or "").lower()
variants: list[str] = []
if any(token in lowered for token in ("управ", "control", "manage", "management")):
variants.extend(["control", "management", "start", "stop", "status"])
return variants

View File

@@ -0,0 +1,197 @@
from __future__ import annotations
import asyncio
import hashlib
import os
from collections.abc import Awaitable, Callable
from inspect import isawaitable
from app.modules.rag.contracts import RagDocument
from app.modules.rag.indexing.code.pipeline import CodeIndexingPipeline
from app.modules.rag.indexing.common.report import IndexReport
from app.modules.rag.indexing.docs.pipeline import DocsIndexingPipeline
from app.modules.rag.persistence.repository import RagRepository
from app.modules.rag.retrieval.query_router import RagQueryRouter
from app.modules.rag_session.embedding.gigachat_embedder import GigaChatEmbedder
class RagService:
def __init__(
self,
embedder: GigaChatEmbedder,
repository: RagRepository,
chunker=None,
) -> None:
self._embedder = embedder
self._repo = repository
self._docs = DocsIndexingPipeline()
self._code = CodeIndexingPipeline()
self._queries = RagQueryRouter()
async def index_snapshot(
self,
rag_session_id: str,
files: list[dict],
progress_cb: Callable[[int, int, str], Awaitable[None] | None] | None = None,
) -> tuple[int, int, int, int]:
report = await self._index_files(rag_session_id, files, progress_cb=progress_cb)
self._repo.replace_documents(rag_session_id, report.documents_list)
return report.as_tuple()
async def index_changes(
self,
rag_session_id: str,
changed_files: list[dict],
progress_cb: Callable[[int, int, str], Awaitable[None] | None] | None = None,
) -> tuple[int, int, int, int]:
delete_paths: list[str] = []
upserts: list[dict] = []
for item in changed_files:
if str(item.get("op")) == "delete":
delete_paths.append(str(item.get("path", "")))
else:
upserts.append(item)
report = await self._index_files(rag_session_id, upserts, progress_cb=progress_cb)
self._repo.apply_document_changes(rag_session_id, delete_paths, report.documents_list)
return report.as_tuple()
async def retrieve(self, rag_session_id: str, query: str) -> list[dict]:
mode = self._queries.resolve_mode(query)
layers = self._queries.layers_for_mode(mode)
prefer_non_tests = mode == "code" and "test" not in query.lower() and "тест" not in query.lower()
try:
query_embedding = self._embedder.embed([query])[0]
rows = self._repo.retrieve(
rag_session_id,
query_embedding,
query_text=query,
limit=8,
layers=layers,
prefer_non_tests=prefer_non_tests,
)
except Exception:
rows = self._repo.fallback_chunks(rag_session_id, limit=8, layers=layers)
if not rows and mode != "docs":
rows = self._repo.fallback_chunks(rag_session_id, limit=8, layers=self._queries.layers_for_mode("docs"))
return [
{
"source": row["path"],
"content": row["content"],
"layer": row.get("layer"),
"title": row.get("title"),
"metadata": row.get("metadata", {}),
"score": row.get("distance"),
}
for row in rows
]
async def _index_files(
self,
rag_session_id: str,
files: list[dict],
progress_cb: Callable[[int, int, str], Awaitable[None] | None] | None = None,
) -> "_PipelineReport":
total_files = len(files)
report = _PipelineReport()
repo_id = self._resolve_repo_id(rag_session_id)
for index, file in enumerate(files, start=1):
path = str(file.get("path", ""))
try:
blob_sha = self._blob_sha(file)
cached = await asyncio.to_thread(self._repo.get_cached_documents, repo_id, blob_sha)
if cached:
report.documents_list.extend(self._with_file_metadata(cached, file, repo_id, blob_sha))
report.cache_hit_files += 1
else:
built = self._build_documents(repo_id, path, file)
embedded = await asyncio.to_thread(self._embed_documents, built, file, repo_id, blob_sha)
report.documents_list.extend(embedded)
await asyncio.to_thread(self._repo.cache_documents, repo_id, path, blob_sha, embedded)
report.cache_miss_files += 1
report.indexed_files += 1
except Exception as exc:
report.failed_files += 1
report.warnings.append(f"{path}: {exc}")
await self._notify_progress(progress_cb, index, total_files, path)
report.documents = len(report.documents_list)
return report
def _build_documents(self, repo_id: str, path: str, file: dict) -> list[RagDocument]:
content = str(file.get("content") or "")
commit_sha = file.get("commit_sha")
docs: list[RagDocument] = []
if self._docs.supports(path):
docs.extend(self._docs.index_file(repo_id=repo_id, commit_sha=commit_sha, path=path, content=content))
if self._code.supports(path):
docs.extend(self._code.index_file(repo_id=repo_id, commit_sha=commit_sha, path=path, content=content))
if not docs:
docs.extend(self._docs.index_file(repo_id=repo_id, commit_sha=commit_sha, path=path, content=content))
return docs
def _embed_documents(self, docs: list[RagDocument], file: dict, repo_id: str, blob_sha: str) -> list[RagDocument]:
if not docs:
return []
batch_size = max(1, int(os.getenv("RAG_EMBED_BATCH_SIZE", "16")))
metadata = self._document_metadata(file, repo_id, blob_sha)
for doc in docs:
doc.metadata.update(metadata)
for start in range(0, len(docs), batch_size):
batch = docs[start : start + batch_size]
vectors = self._embedder.embed([doc.text for doc in batch])
for doc, vector in zip(batch, vectors):
doc.embedding = vector
return docs
def _with_file_metadata(self, docs: list[RagDocument], file: dict, repo_id: str, blob_sha: str) -> list[RagDocument]:
metadata = self._document_metadata(file, repo_id, blob_sha)
for doc in docs:
doc.metadata.update(metadata)
doc.source.repo_id = repo_id
doc.source.path = str(file.get("path", doc.source.path))
return docs
def _document_metadata(self, file: dict, repo_id: str, blob_sha: str) -> dict:
return {
"blob_sha": blob_sha,
"repo_id": repo_id,
"artifact_type": file.get("artifact_type"),
"section": file.get("section"),
"doc_id": file.get("doc_id"),
"doc_version": file.get("doc_version"),
"owner": file.get("owner"),
"system_component": file.get("system_component"),
"last_modified": file.get("last_modified"),
"staleness_score": file.get("staleness_score"),
}
def _resolve_repo_id(self, rag_session_id: str) -> str:
session = self._repo.get_session(rag_session_id)
if not session:
return rag_session_id
return str(session.get("project_id") or rag_session_id)
def _blob_sha(self, file: dict) -> str:
raw = str(file.get("content_hash") or "").strip()
if raw:
return raw
content = str(file.get("content") or "")
return hashlib.sha256(content.encode("utf-8")).hexdigest()
async def _notify_progress(
self,
progress_cb: Callable[[int, int, str], Awaitable[None] | None] | None,
current_file_index: int,
total_files: int,
current_file_name: str,
) -> None:
if not progress_cb:
return
result = progress_cb(current_file_index, total_files, current_file_name)
if isawaitable(result):
await result
class _PipelineReport(IndexReport):
def __init__(self) -> None:
super().__init__()
self.documents_list: list[RagDocument] = []