14 KiB
Iteration 2 — Calibration Harness Report
1. Executive Summary
This iteration adds calibration and evaluation infrastructure for the canonical CODE_QA pipeline. The pipeline remains test-first and is not integrated into the UI or production runtime.
Added:
- A small deterministic fixture repository (
tests/fixtures/code_qa_repo/) for reproducible tests. - Golden case format and initial cases for OPEN_FILE, EXPLAIN, FIND_TESTS, FIND_ENTRYPOINTS, and GENERAL_QA (positive, borderline, negative).
- An evaluation harness that indexes a repo (fixture or user-provided path), runs golden cases through
CodeQAPipelineRunnerwith the real retrieval adapter (RagDbAdapter), and compares actual vs expected (intent, sub_intent, answer_mode, path_scope, symbol_candidates). - Diagnostics artifact dumping per run (Markdown + JSON) under
tests/artifacts/code_qa_eval/<run_id>/. - A batch evaluation summary (Markdown table + failure list) for manual review.
- Two modes: fixture repo by default; optional
CODE_QA_REPO_PATHfor a local real repository.
Now possible:
- Run the canonical pipeline end-to-end on the fixture repo with real indexing and retrieval.
- Run the same harness against a user-provided repo path (no hardcoded external repo).
- Inspect per-case diagnostics and batch summary to tune routing, retrieval, evidence gate, and answer mode.
Still manual / out of scope:
- Tuning prompts and retrieval heuristics (harness supports observation, not automatic tuning).
- UI integration, docs runtime retrieval, production router replacement.
- Exact LLM answer matching (we assert routing, retrieval alignment, evidence sufficiency, answer mode only).
2. Fixture Repository
Location: tests/fixtures/code_qa_repo/
Structure:
tests/fixtures/code_qa_repo/
├── app/
│ └── main.py # Entrypoint: create_app(), app.run()
├── api/
│ └── orders.py # Handlers: create_order, get_order; OrderService, OrderRepository
├── services/
│ └── order_service.py # OrderService: create_order, get_order
├── repositories/
│ └── order_repository.py # OrderRepository: save, find_by_id
├── domain/
│ └── order.py # Order: id, product_id, quantity, status
├── tests/
│ └── test_order_service.py # test_create_order, test_get_order_returns_saved_order
└── utils/
└── helpers.py # format_order_id
Purpose of each file:
| File | Purpose |
|---|---|
app/main.py |
Single clear entrypoint for FIND_ENTRYPOINTS and “open main” style queries. |
api/orders.py |
API/handler layer; distinct symbols create_order, get_order, create_app. |
services/order_service.py |
Service calling repository; symbol OrderService. |
repositories/order_repository.py |
Persistence; symbol OrderRepository. |
domain/order.py |
Domain model; symbol Order. |
tests/test_order_service.py |
Tests tied to production code for FIND_TESTS. |
utils/helpers.py |
Extra module for bounded GENERAL_QA and path/symbol variety. |
Scenarios covered:
- File by path:
app/main.py,api/orders.py(OPEN_FILE). - Symbol explanation:
Order,OrderService,create_order(EXPLAIN). - Import/call relations: service → repository → domain (EXPLAIN / GENERAL_QA).
- Entrypoint:
app/main.py(FIND_ENTRYPOINTS). - Related tests:
tests/test_order_service.pyfor OrderService/Order (FIND_TESTS). - Fallback: “Что делает этот проект?” (GENERAL_QA with bounded context).
The fixture is small and structured so routing and retrieval expectations are unambiguous for calibration.
3. Real Adapter Integration
The canonical pipeline runs with the existing retrieval/index stack:
- Indexing:
RagSessionIndexer(intests/pipeline_intent_rag/helpers/repo_indexer.py) usesRagServiceandLocalRepoFileCollectorto index a directory. The fixture (orCODE_QA_REPO_PATH) is indexed once per eval run. - Retrieval:
RagDbAdapter(intests/pipeline_intent_rag/helpers/rag_db_adapter.py) implements the pipeline’sRetrievalAdapterprotocol:retrieve_with_plan,retrieve_exact_files,hydrate_resolved_symbol_sources,force_symbol_context_c0,consume_retrieval_report. It usesRagRepositoryand the same layer logic as the rest of the project. - Pipeline:
CodeQAPipelineRunner(inapp/modules/rag/code_qa_pipeline/pipeline.py) takesIntentRouterV2and this adapter, buildsRetrievalRequestfrom the router, runs retrieval, buildsEvidenceBundle, runs the evidence gate, and produces diagnostics.
Fixture repo: The harness indexes tests/fixtures/code_qa_repo by default and runs all golden cases against that index. No external repo is required.
User-provided repo: Set CODE_QA_REPO_PATH to a local directory. The harness indexes that path and runs the same golden cases (or the user can add repo-specific cases). Optional CODE_QA_PROJECT_ID sets the project id for the session. The codebase does not depend on any private or external repo being present.
4. Golden Case Format
Location: tests/golden/code_qa/
File: cases.yaml
Fields per case:
| Field | Meaning |
|---|---|
id |
Unique case id. |
query |
User query text. |
expected_intent |
Expected top-level intent (e.g. CODE_QA). |
expected_sub_intent |
OPEN_FILE | EXPLAIN | FIND_TESTS | FIND_ENTRYPOINTS | GENERAL_QA. |
expected_answer_mode |
normal | degraded | insufficient. |
expected_target_hint |
Optional: path, symbol, or test-like. |
expected_path_scope_contains |
Optional list of substrings that must appear in path_scope. |
expected_symbol_candidates_contain |
Optional list of symbols that must appear in symbol_candidates. |
expected_layers |
Optional list of layer ids expected in the retrieval plan. |
notes |
Optional: borderline, negative, or calibration hint. |
Expected results: We assert routing (intent, sub_intent), retrieval alignment (path_scope, symbol_candidates, layers when specified), evidence sufficiency (via answer_mode), and diagnostics shape. We do not assert exact LLM wording.
Not asserted (yet): Exact chunk content, relation counts, or full evidence bundle structure beyond what drives answer_mode and target hints.
5. Golden Runner / Evaluation Harness
Entrypoints:
- Programmatic:
tests.code_qa_eval.runner.run_eval(config)— runs all golden cases and returnslist[EvalCaseResult]. - CLI:
python -m tests.code_qa_eval.run(from project root) — loads config, runs eval, writes artifacts and summary, exits 0 only if all pass.
Fixture mode (default):
- Do not set
CODE_QA_REPO_PATH. - Run:
python -m tests.code_qa_eval.run(or callrun_eval(EvalConfig.from_env())). - Repo used:
tests/fixtures/code_qa_repo. It is indexed once; then each golden case is run through the pipeline and compared to expected.
User-provided repo:
- Set
CODE_QA_REPO_PATHto the repository root (e.g.export CODE_QA_REPO_PATH=/path/to/your/repo). - Optionally set
CODE_QA_PROJECT_ID. - Run the same command. The harness indexes that path and runs the same golden cases (or you can point to a different
cases.yamlby changingEvalConfig.golden_cases_pathin code).
Outputs:
- Per case: under
tests/artifacts/code_qa_eval/<run_id>/:<case_id>.mdand<case_id>.json(query, expected/actual, router, retrieval, evidence gate, timings, mismatches). - Batch:
tests/artifacts/code_qa_eval/summary_<run_id>.md— table (case id, query, expected/actual scenario, target, evidence, answer mode, pass/fail) and a failure list. - Exit code: 0 if all cases pass, 1 otherwise; failures are printed to stderr.
6. Diagnostics Artifacts
Generated artifacts:
- Per run (per case):
<run_id>/<case_id>.mdand<case_id>.json. - Batch:
summary_<run_id>.mdintests/artifacts/code_qa_eval/.
Location: tests/artifacts/code_qa_eval/ (created if missing).
Markdown (per case) contains:
- Query, expected (intent, sub_intent, answer_mode), actual (intent, sub_intent, answer_mode, evidence_gate_passed, evidence_count).
- Pass/fail and list of mismatches.
- Router: path_scope, layers.
- Retrieval: requested_layers, chunk_count, layer_outcomes.
- Evidence gate: failure_reasons.
- Timings (ms).
JSON (per case) adds machine-readable detail: full expected/actual, passed, mismatches, router_result, retrieval_request, per_layer_outcome, failure_reasons, timings_ms.
Useful for calibration:
- Router: path_scope and layers — confirm OPEN_FILE vs EXPLAIN vs FIND_* routing and plan.
- Retrieval: layer_outcomes and chunk_count — see which layers returned hits.
- Evidence gate: failure_reasons and evidence_count — see why answer_mode is degraded/insufficient.
- Mismatches: quick list of what to fix (routing vs retrieval vs gate).
Example snippet (Markdown):
# open_file_main_positive
## Query
Открой файл app/main.py
## Expected
- intent: CODE_QA, sub_intent: OPEN_FILE
- answer_mode: normal
## Actual
- intent: CODE_QA, sub_intent: OPEN_FILE
- answer_mode: normal
- evidence_gate_passed: True
- evidence_count: 2
## Result
PASS
7. Tests Added
| File | What it validates |
|---|---|
tests/code_qa_eval/test_eval_harness.py |
Golden loader, compare logic, config, fixture-mode run structure. |
Test groups:
- Golden loader:
test_load_golden_cases_returns_list— loadscases.yaml, checks count and field presence (id, query, expected_intent, expected_sub_intent, expected_answer_mode). - Compare logic:
test_compare_passed_when_all_match,test_compare_fails_on_intent_mismatch,test_compare_fails_on_answer_mode_mismatch,test_compare_path_scope_contains— assert pass/fail and mismatch messages for intent, sub_intent, answer_mode, path_scope. - Config:
test_eval_config_fixture_mode_by_default— default config uses fixture path, golden path, and artifacts dir undertests/. - Fixture-mode run:
test_run_eval_fixture_mode_structure— runsrun_eval(config)with fixture config; asserts result list and that each item isEvalCaseResultwith case, pipeline_result, passed, mismatches. Skips if DB or dependencies (e.g. sqlalchemy) are unavailable.
Modes: Loader and compare tests are unit (no DB). Config test uses paths only. Fixture-mode test is integration-style with real adapter and DB; it is skipped when the environment cannot connect or import.
8. Known Limitations
- LLM answer: The harness does not call the LLM;
answer_modeis derived from the evidence gate only. No assertion on final answer text. - Routing stability: Golden expectations (especially borderline/negative) may need manual adjustment as the router or retrieval changes.
- Real DB required: Full eval (index + retrieve) needs a configured DB; otherwise the integration test and CLI run skip or fail. No in-memory SQLite path is implemented in this iteration.
- Single session per run: Each run indexes the repo once and reuses one RAG session for all cases. Cross-session or re-index behaviour is not exercised.
- Docs / cross-domain: Golden cases and harness are CODE_QA only; docs retrieval and cross-domain flows are out of scope.
- Performance: No timings or regression assertions; artifacts are for manual inspection and tuning.
9. How to Use for Manual Calibration
-
Run fixture evaluation
From project root:python -m tests.code_qa_eval.run. Check exit code and console output (pass/fail counts and failure lines). -
Inspect diagnostics
Opentests/artifacts/code_qa_eval/<run_id>/*.mdfor failing (or borderline) cases. Use router (path_scope, layers), retrieval (layer_outcomes, chunk_count), and evidence gate (failure_reasons) to see why a case failed. -
Run against a real local repo
SetCODE_QA_REPO_PATH=/path/to/repo, then run the same command. Compare behaviour to the fixture run. -
Compare mismatches
Use the batch summary and per-case mismatches to decide what to tune: intent/sub_intent (router/prompts), path_scope/symbol_candidates (router or retrieval), or evidence thresholds (evidence gate). -
Adjust and re-run
Update router, retrieval, or evidence policy; add/edit golden cases if needed; re-run the harness and confirm improvements in the summary and artifacts.
10. Changed Files Index
| File | Purpose |
|---|---|
tests/fixtures/code_qa_repo/app/main.py |
Fixture entrypoint. |
tests/fixtures/code_qa_repo/api/orders.py |
Fixture API handlers. |
tests/fixtures/code_qa_repo/services/order_service.py |
Fixture service layer. |
tests/fixtures/code_qa_repo/repositories/order_repository.py |
Fixture repository. |
tests/fixtures/code_qa_repo/domain/order.py |
Fixture domain model. |
tests/fixtures/code_qa_repo/tests/test_order_service.py |
Fixture tests. |
tests/fixtures/code_qa_repo/utils/helpers.py |
Fixture utility. |
tests/golden/code_qa/README.md |
Golden case format description. |
tests/golden/code_qa/cases.yaml |
Golden cases for all MVP scenarios. |
tests/code_qa_eval/__init__.py |
Package init. |
tests/code_qa_eval/config.py |
EvalConfig: repo path (fixture vs CODE_QA_REPO_PATH), artifacts dir, golden path. |
tests/code_qa_eval/golden_loader.py |
Load and parse golden cases from YAML. |
tests/code_qa_eval/runner.py |
run_eval: index repo, run pipeline, compare to golden; _compare logic. |
tests/code_qa_eval/artifacts.py |
dump_run_artifact (md+json), write_batch_summary. |
tests/code_qa_eval/run.py |
CLI entrypoint: load config, run eval, write artifacts and summary. |
tests/code_qa_eval/test_eval_harness.py |
Tests for loader, compare, config, fixture-mode run. |
pytest.ini |
Added marker code_qa_eval. |
iteration2_calibration_harness_report.md |
This report. |
No changes were made to production router, UI, or docs retrieval. The canonical pipeline and existing retrieval/index stack are reused; the harness is test-side only.