# Iteration 2 — Calibration Harness Report ## 1. Executive Summary This iteration adds **calibration and evaluation infrastructure** for the canonical CODE_QA pipeline. The pipeline remains test-first and is not integrated into the UI or production runtime. **Added:** - A small **deterministic fixture repository** (`tests/fixtures/code_qa_repo/`) for reproducible tests. - **Golden case format and initial cases** for OPEN_FILE, EXPLAIN, FIND_TESTS, FIND_ENTRYPOINTS, and GENERAL_QA (positive, borderline, negative). - An **evaluation harness** that indexes a repo (fixture or user-provided path), runs golden cases through `CodeQAPipelineRunner` with the **real retrieval adapter** (`RagDbAdapter`), and compares actual vs expected (intent, sub_intent, answer_mode, path_scope, symbol_candidates). - **Diagnostics artifact dumping** per run (Markdown + JSON) under `tests/artifacts/code_qa_eval//`. - A **batch evaluation summary** (Markdown table + failure list) for manual review. - **Two modes:** fixture repo by default; optional `CODE_QA_REPO_PATH` for a local real repository. **Now possible:** - Run the canonical pipeline end-to-end on the fixture repo with real indexing and retrieval. - Run the same harness against a user-provided repo path (no hardcoded external repo). - Inspect per-case diagnostics and batch summary to tune routing, retrieval, evidence gate, and answer mode. **Still manual / out of scope:** - Tuning prompts and retrieval heuristics (harness supports observation, not automatic tuning). - UI integration, docs runtime retrieval, production router replacement. - Exact LLM answer matching (we assert routing, retrieval alignment, evidence sufficiency, answer mode only). --- ## 2. Fixture Repository **Location:** `tests/fixtures/code_qa_repo/` **Structure:** ``` tests/fixtures/code_qa_repo/ ├── app/ │ └── main.py # Entrypoint: create_app(), app.run() ├── api/ │ └── orders.py # Handlers: create_order, get_order; OrderService, OrderRepository ├── services/ │ └── order_service.py # OrderService: create_order, get_order ├── repositories/ │ └── order_repository.py # OrderRepository: save, find_by_id ├── domain/ │ └── order.py # Order: id, product_id, quantity, status ├── tests/ │ └── test_order_service.py # test_create_order, test_get_order_returns_saved_order └── utils/ └── helpers.py # format_order_id ``` **Purpose of each file:** | File | Purpose | |------|--------| | `app/main.py` | Single clear entrypoint for FIND_ENTRYPOINTS and “open main” style queries. | | `api/orders.py` | API/handler layer; distinct symbols `create_order`, `get_order`, `create_app`. | | `services/order_service.py` | Service calling repository; symbol `OrderService`. | | `repositories/order_repository.py` | Persistence; symbol `OrderRepository`. | | `domain/order.py` | Domain model; symbol `Order`. | | `tests/test_order_service.py` | Tests tied to production code for FIND_TESTS. | | `utils/helpers.py` | Extra module for bounded GENERAL_QA and path/symbol variety. | **Scenarios covered:** - **File by path:** `app/main.py`, `api/orders.py` (OPEN_FILE). - **Symbol explanation:** `Order`, `OrderService`, `create_order` (EXPLAIN). - **Import/call relations:** service → repository → domain (EXPLAIN / GENERAL_QA). - **Entrypoint:** `app/main.py` (FIND_ENTRYPOINTS). - **Related tests:** `tests/test_order_service.py` for OrderService/Order (FIND_TESTS). - **Fallback:** “Что делает этот проект?” (GENERAL_QA with bounded context). The fixture is small and structured so routing and retrieval expectations are unambiguous for calibration. --- ## 3. Real Adapter Integration The canonical pipeline runs with the **existing** retrieval/index stack: - **Indexing:** `RagSessionIndexer` (in `tests/pipeline_intent_rag/helpers/repo_indexer.py`) uses `RagService` and `LocalRepoFileCollector` to index a directory. The fixture (or `CODE_QA_REPO_PATH`) is indexed once per eval run. - **Retrieval:** `RagDbAdapter` (in `tests/pipeline_intent_rag/helpers/rag_db_adapter.py`) implements the pipeline’s `RetrievalAdapter` protocol: `retrieve_with_plan`, `retrieve_exact_files`, `hydrate_resolved_symbol_sources`, `force_symbol_context_c0`, `consume_retrieval_report`. It uses `RagRepository` and the same layer logic as the rest of the project. - **Pipeline:** `CodeQAPipelineRunner` (in `app/modules/rag/code_qa_pipeline/pipeline.py`) takes `IntentRouterV2` and this adapter, builds `RetrievalRequest` from the router, runs retrieval, builds `EvidenceBundle`, runs the evidence gate, and produces diagnostics. **Fixture repo:** The harness indexes `tests/fixtures/code_qa_repo` by default and runs all golden cases against that index. No external repo is required. **User-provided repo:** Set `CODE_QA_REPO_PATH` to a local directory. The harness indexes that path and runs the same golden cases (or the user can add repo-specific cases). Optional `CODE_QA_PROJECT_ID` sets the project id for the session. The codebase does **not** depend on any private or external repo being present. --- ## 4. Golden Case Format **Location:** `tests/golden/code_qa/` **File:** `cases.yaml` **Fields per case:** | Field | Meaning | |-------|--------| | `id` | Unique case id. | | `query` | User query text. | | `expected_intent` | Expected top-level intent (e.g. CODE_QA). | | `expected_sub_intent` | OPEN_FILE \| EXPLAIN \| FIND_TESTS \| FIND_ENTRYPOINTS \| GENERAL_QA. | | `expected_answer_mode` | normal \| degraded \| insufficient. | | `expected_target_hint` | Optional: path, symbol, or test-like. | | `expected_path_scope_contains` | Optional list of substrings that must appear in path_scope. | | `expected_symbol_candidates_contain` | Optional list of symbols that must appear in symbol_candidates. | | `expected_layers` | Optional list of layer ids expected in the retrieval plan. | | `notes` | Optional: borderline, negative, or calibration hint. | **Expected results:** We assert routing (intent, sub_intent), retrieval alignment (path_scope, symbol_candidates, layers when specified), evidence sufficiency (via answer_mode), and diagnostics shape. We do **not** assert exact LLM wording. **Not asserted (yet):** Exact chunk content, relation counts, or full evidence bundle structure beyond what drives answer_mode and target hints. --- ## 5. Golden Runner / Evaluation Harness **Entrypoints:** - **Programmatic:** `tests.code_qa_eval.runner.run_eval(config)` — runs all golden cases and returns `list[EvalCaseResult]`. - **CLI:** `python -m tests.code_qa_eval.run` (from project root) — loads config, runs eval, writes artifacts and summary, exits 0 only if all pass. **Fixture mode (default):** 1. Do not set `CODE_QA_REPO_PATH`. 2. Run: `python -m tests.code_qa_eval.run` (or call `run_eval(EvalConfig.from_env())`). 3. Repo used: `tests/fixtures/code_qa_repo`. It is indexed once; then each golden case is run through the pipeline and compared to expected. **User-provided repo:** 1. Set `CODE_QA_REPO_PATH` to the repository root (e.g. `export CODE_QA_REPO_PATH=/path/to/your/repo`). 2. Optionally set `CODE_QA_PROJECT_ID`. 3. Run the same command. The harness indexes that path and runs the same golden cases (or you can point to a different `cases.yaml` by changing `EvalConfig.golden_cases_path` in code). **Outputs:** - **Per case:** under `tests/artifacts/code_qa_eval//`: `.md` and `.json` (query, expected/actual, router, retrieval, evidence gate, timings, mismatches). - **Batch:** `tests/artifacts/code_qa_eval/summary_.md` — table (case id, query, expected/actual scenario, target, evidence, answer mode, pass/fail) and a failure list. - **Exit code:** 0 if all cases pass, 1 otherwise; failures are printed to stderr. --- ## 6. Diagnostics Artifacts **Generated artifacts:** - **Per run (per case):** `/.md` and `.json`. - **Batch:** `summary_.md` in `tests/artifacts/code_qa_eval/`. **Location:** `tests/artifacts/code_qa_eval/` (created if missing). **Markdown (per case) contains:** - Query, expected (intent, sub_intent, answer_mode), actual (intent, sub_intent, answer_mode, evidence_gate_passed, evidence_count). - Pass/fail and list of mismatches. - Router: path_scope, layers. - Retrieval: requested_layers, chunk_count, layer_outcomes. - Evidence gate: failure_reasons. - Timings (ms). **JSON (per case)** adds machine-readable detail: full expected/actual, passed, mismatches, router_result, retrieval_request, per_layer_outcome, failure_reasons, timings_ms. **Useful for calibration:** - **Router:** path_scope and layers — confirm OPEN_FILE vs EXPLAIN vs FIND_* routing and plan. - **Retrieval:** layer_outcomes and chunk_count — see which layers returned hits. - **Evidence gate:** failure_reasons and evidence_count — see why answer_mode is degraded/insufficient. - **Mismatches:** quick list of what to fix (routing vs retrieval vs gate). **Example snippet (Markdown):** ```markdown # open_file_main_positive ## Query Открой файл app/main.py ## Expected - intent: CODE_QA, sub_intent: OPEN_FILE - answer_mode: normal ## Actual - intent: CODE_QA, sub_intent: OPEN_FILE - answer_mode: normal - evidence_gate_passed: True - evidence_count: 2 ## Result PASS ``` --- ## 7. Tests Added | File | What it validates | |------|-------------------| | `tests/code_qa_eval/test_eval_harness.py` | Golden loader, compare logic, config, fixture-mode run structure. | **Test groups:** - **Golden loader:** `test_load_golden_cases_returns_list` — loads `cases.yaml`, checks count and field presence (id, query, expected_intent, expected_sub_intent, expected_answer_mode). - **Compare logic:** `test_compare_passed_when_all_match`, `test_compare_fails_on_intent_mismatch`, `test_compare_fails_on_answer_mode_mismatch`, `test_compare_path_scope_contains` — assert pass/fail and mismatch messages for intent, sub_intent, answer_mode, path_scope. - **Config:** `test_eval_config_fixture_mode_by_default` — default config uses fixture path, golden path, and artifacts dir under `tests/`. - **Fixture-mode run:** `test_run_eval_fixture_mode_structure` — runs `run_eval(config)` with fixture config; asserts result list and that each item is `EvalCaseResult` with case, pipeline_result, passed, mismatches. **Skips** if DB or dependencies (e.g. sqlalchemy) are unavailable. **Modes:** Loader and compare tests are unit (no DB). Config test uses paths only. Fixture-mode test is integration-style with real adapter and DB; it is skipped when the environment cannot connect or import. --- ## 8. Known Limitations - **LLM answer:** The harness does not call the LLM; `answer_mode` is derived from the evidence gate only. No assertion on final answer text. - **Routing stability:** Golden expectations (especially borderline/negative) may need manual adjustment as the router or retrieval changes. - **Real DB required:** Full eval (index + retrieve) needs a configured DB; otherwise the integration test and CLI run skip or fail. No in-memory SQLite path is implemented in this iteration. - **Single session per run:** Each run indexes the repo once and reuses one RAG session for all cases. Cross-session or re-index behaviour is not exercised. - **Docs / cross-domain:** Golden cases and harness are CODE_QA only; docs retrieval and cross-domain flows are out of scope. - **Performance:** No timings or regression assertions; artifacts are for manual inspection and tuning. --- ## 9. How to Use for Manual Calibration 1. **Run fixture evaluation** From project root: `python -m tests.code_qa_eval.run`. Check exit code and console output (pass/fail counts and failure lines). 2. **Inspect diagnostics** Open `tests/artifacts/code_qa_eval//*.md` for failing (or borderline) cases. Use router (path_scope, layers), retrieval (layer_outcomes, chunk_count), and evidence gate (failure_reasons) to see why a case failed. 3. **Run against a real local repo** Set `CODE_QA_REPO_PATH=/path/to/repo`, then run the same command. Compare behaviour to the fixture run. 4. **Compare mismatches** Use the batch summary and per-case mismatches to decide what to tune: intent/sub_intent (router/prompts), path_scope/symbol_candidates (router or retrieval), or evidence thresholds (evidence gate). 5. **Adjust and re-run** Update router, retrieval, or evidence policy; add/edit golden cases if needed; re-run the harness and confirm improvements in the summary and artifacts. --- ## 10. Changed Files Index | File | Purpose | |------|--------| | `tests/fixtures/code_qa_repo/app/main.py` | Fixture entrypoint. | | `tests/fixtures/code_qa_repo/api/orders.py` | Fixture API handlers. | | `tests/fixtures/code_qa_repo/services/order_service.py` | Fixture service layer. | | `tests/fixtures/code_qa_repo/repositories/order_repository.py` | Fixture repository. | | `tests/fixtures/code_qa_repo/domain/order.py` | Fixture domain model. | | `tests/fixtures/code_qa_repo/tests/test_order_service.py` | Fixture tests. | | `tests/fixtures/code_qa_repo/utils/helpers.py` | Fixture utility. | | `tests/golden/code_qa/README.md` | Golden case format description. | | `tests/golden/code_qa/cases.yaml` | Golden cases for all MVP scenarios. | | `tests/code_qa_eval/__init__.py` | Package init. | | `tests/code_qa_eval/config.py` | EvalConfig: repo path (fixture vs CODE_QA_REPO_PATH), artifacts dir, golden path. | | `tests/code_qa_eval/golden_loader.py` | Load and parse golden cases from YAML. | | `tests/code_qa_eval/runner.py` | run_eval: index repo, run pipeline, compare to golden; _compare logic. | | `tests/code_qa_eval/artifacts.py` | dump_run_artifact (md+json), write_batch_summary. | | `tests/code_qa_eval/run.py` | CLI entrypoint: load config, run eval, write artifacts and summary. | | `tests/code_qa_eval/test_eval_harness.py` | Tests for loader, compare, config, fixture-mode run. | | `pytest.ini` | Added marker `code_qa_eval`. | | `iteration2_calibration_harness_report.md` | This report. | No changes were made to production router, UI, or docs retrieval. The canonical pipeline and existing retrieval/index stack are reused; the harness is test-side only.