agent/iteration2_calibration_harness_report.md

# Iteration 2 — Calibration Harness Report

## 1. Executive Summary

This iteration adds **calibration and evaluation infrastructure** for the canonical CODE_QA pipeline. The pipeline remains test-first and is not integrated into the UI or production runtime.

**Added:**

- A small **deterministic fixture repository** (`tests/fixtures/code_qa_repo/`) for reproducible tests.
- **Golden case format and initial cases** for OPEN_FILE, EXPLAIN, FIND_TESTS, FIND_ENTRYPOINTS, and GENERAL_QA (positive, borderline, negative).
- An **evaluation harness** that indexes a repo (fixture or user-provided path), runs golden cases through `CodeQAPipelineRunner` with the **real retrieval adapter** (`RagDbAdapter`), and compares actual vs expected (intent, sub_intent, answer_mode, path_scope, symbol_candidates).
- **Diagnostics artifact dumping** per run (Markdown + JSON) under `tests/artifacts/code_qa_eval/<run_id>/`.
- A **batch evaluation summary** (Markdown table + failure list) for manual review.
- **Two modes:** fixture repo by default; optional `CODE_QA_REPO_PATH` for a local real repository.

**Now possible:**

- Run the canonical pipeline end-to-end on the fixture repo with real indexing and retrieval.
- Run the same harness against a user-provided repo path (no hardcoded external repo).
- Inspect per-case diagnostics and batch summary to tune routing, retrieval, evidence gate, and answer mode.

**Still manual / out of scope:**

- Tuning prompts and retrieval heuristics (harness supports observation, not automatic tuning).
- UI integration, docs runtime retrieval, production router replacement.
- Exact LLM answer matching (we assert routing, retrieval alignment, evidence sufficiency, answer mode only).

---

## 2. Fixture Repository

**Location:** `tests/fixtures/code_qa_repo/`

**Structure:**

```
tests/fixtures/code_qa_repo/
├── app/
│   └── main.py          # Entrypoint: create_app(), app.run()
├── api/
│   └── orders.py        # Handlers: create_order, get_order; OrderService, OrderRepository
├── services/
│   └── order_service.py  # OrderService: create_order, get_order
├── repositories/
│   └── order_repository.py  # OrderRepository: save, find_by_id
├── domain/
│   └── order.py         # Order: id, product_id, quantity, status
├── tests/
│   └── test_order_service.py  # test_create_order, test_get_order_returns_saved_order
└── utils/
    └── helpers.py       # format_order_id
```

**Purpose of each file:**

| File | Purpose |
|------|--------|
| `app/main.py` | Single clear entrypoint for FIND_ENTRYPOINTS and “open main” style queries. |
| `api/orders.py` | API/handler layer; distinct symbols `create_order`, `get_order`, `create_app`. |
| `services/order_service.py` | Service calling repository; symbol `OrderService`. |
| `repositories/order_repository.py` | Persistence; symbol `OrderRepository`. |
| `domain/order.py` | Domain model; symbol `Order`. |
| `tests/test_order_service.py` | Tests tied to production code for FIND_TESTS. |
| `utils/helpers.py` | Extra module for bounded GENERAL_QA and path/symbol variety. |

**Scenarios covered:**

- **File by path:** `app/main.py`, `api/orders.py` (OPEN_FILE).
- **Symbol explanation:** `Order`, `OrderService`, `create_order` (EXPLAIN).
- **Import/call relations:** service → repository → domain (EXPLAIN / GENERAL_QA).
- **Entrypoint:** `app/main.py` (FIND_ENTRYPOINTS).
- **Related tests:** `tests/test_order_service.py` for OrderService/Order (FIND_TESTS).
- **Fallback:** “Что делает этот проект?” (GENERAL_QA with bounded context).

The fixture is small and structured so routing and retrieval expectations are unambiguous for calibration.

---

## 3. Real Adapter Integration

The canonical pipeline runs with the **existing** retrieval/index stack:

- **Indexing:** `RagSessionIndexer` (in `tests/pipeline_intent_rag/helpers/repo_indexer.py`) uses `RagService` and `LocalRepoFileCollector` to index a directory. The fixture (or `CODE_QA_REPO_PATH`) is indexed once per eval run.
- **Retrieval:** `RagDbAdapter` (in `tests/pipeline_intent_rag/helpers/rag_db_adapter.py`) implements the pipeline’s `RetrievalAdapter` protocol: `retrieve_with_plan`, `retrieve_exact_files`, `hydrate_resolved_symbol_sources`, `force_symbol_context_c0`, `consume_retrieval_report`. It uses `RagRepository` and the same layer logic as the rest of the project.
- **Pipeline:** `CodeQAPipelineRunner` (in `app/modules/rag/code_qa_pipeline/pipeline.py`) takes `IntentRouterV2` and this adapter, builds `RetrievalRequest` from the router, runs retrieval, builds `EvidenceBundle`, runs the evidence gate, and produces diagnostics.

**Fixture repo:** The harness indexes `tests/fixtures/code_qa_repo` by default and runs all golden cases against that index. No external repo is required.

**User-provided repo:** Set `CODE_QA_REPO_PATH` to a local directory. The harness indexes that path and runs the same golden cases (or the user can add repo-specific cases). Optional `CODE_QA_PROJECT_ID` sets the project id for the session. The codebase does **not** depend on any private or external repo being present.

---

## 4. Golden Case Format

**Location:** `tests/golden/code_qa/`
**File:** `cases.yaml`

**Fields per case:**

| Field | Meaning |
|-------|--------|
| `id` | Unique case id. |
| `query` | User query text. |
| `expected_intent` | Expected top-level intent (e.g. CODE_QA). |
| `expected_sub_intent` | OPEN_FILE \| EXPLAIN \| FIND_TESTS \| FIND_ENTRYPOINTS \| GENERAL_QA. |
| `expected_answer_mode` | normal \| degraded \| insufficient. |
| `expected_target_hint` | Optional: path, symbol, or test-like. |
| `expected_path_scope_contains` | Optional list of substrings that must appear in path_scope. |
| `expected_symbol_candidates_contain` | Optional list of symbols that must appear in symbol_candidates. |
| `expected_layers` | Optional list of layer ids expected in the retrieval plan. |
| `notes` | Optional: borderline, negative, or calibration hint. |

**Expected results:** We assert routing (intent, sub_intent), retrieval alignment (path_scope, symbol_candidates, layers when specified), evidence sufficiency (via answer_mode), and diagnostics shape. We do **not** assert exact LLM wording.

**Not asserted (yet):** Exact chunk content, relation counts, or full evidence bundle structure beyond what drives answer_mode and target hints.

---

## 5. Golden Runner / Evaluation Harness

**Entrypoints:**

- **Programmatic:** `tests.code_qa_eval.runner.run_eval(config)` — runs all golden cases and returns `list[EvalCaseResult]`.
- **CLI:** `python -m tests.code_qa_eval.run` (from project root) — loads config, runs eval, writes artifacts and summary, exits 0 only if all pass.

**Fixture mode (default):**

1. Do not set `CODE_QA_REPO_PATH`.
2. Run: `python -m tests.code_qa_eval.run` (or call `run_eval(EvalConfig.from_env())`).
3. Repo used: `tests/fixtures/code_qa_repo`. It is indexed once; then each golden case is run through the pipeline and compared to expected.

**User-provided repo:**

1. Set `CODE_QA_REPO_PATH` to the repository root (e.g. `export CODE_QA_REPO_PATH=/path/to/your/repo`).
2. Optionally set `CODE_QA_PROJECT_ID`.
3. Run the same command. The harness indexes that path and runs the same golden cases (or you can point to a different `cases.yaml` by changing `EvalConfig.golden_cases_path` in code).

**Outputs:**

- **Per case:** under `tests/artifacts/code_qa_eval/<run_id>/`: `<case_id>.md` and `<case_id>.json` (query, expected/actual, router, retrieval, evidence gate, timings, mismatches).
- **Batch:** `tests/artifacts/code_qa_eval/summary_<run_id>.md` — table (case id, query, expected/actual scenario, target, evidence, answer mode, pass/fail) and a failure list.
- **Exit code:** 0 if all cases pass, 1 otherwise; failures are printed to stderr.

---

## 6. Diagnostics Artifacts

**Generated artifacts:**

- **Per run (per case):** `<run_id>/<case_id>.md` and `<case_id>.json`.
- **Batch:** `summary_<run_id>.md` in `tests/artifacts/code_qa_eval/`.

**Location:** `tests/artifacts/code_qa_eval/` (created if missing).

**Markdown (per case) contains:**

- Query, expected (intent, sub_intent, answer_mode), actual (intent, sub_intent, answer_mode, evidence_gate_passed, evidence_count).
- Pass/fail and list of mismatches.
- Router: path_scope, layers.
- Retrieval: requested_layers, chunk_count, layer_outcomes.
- Evidence gate: failure_reasons.
- Timings (ms).

**JSON (per case)** adds machine-readable detail: full expected/actual, passed, mismatches, router_result, retrieval_request, per_layer_outcome, failure_reasons, timings_ms.

**Useful for calibration:**

- **Router:** path_scope and layers — confirm OPEN_FILE vs EXPLAIN vs FIND_* routing and plan.
- **Retrieval:** layer_outcomes and chunk_count — see which layers returned hits.
- **Evidence gate:** failure_reasons and evidence_count — see why answer_mode is degraded/insufficient.
- **Mismatches:** quick list of what to fix (routing vs retrieval vs gate).

**Example snippet (Markdown):**

```markdown
# open_file_main_positive

## Query
Открой файл app/main.py

## Expected
- intent: CODE_QA, sub_intent: OPEN_FILE
- answer_mode: normal

## Actual
- intent: CODE_QA, sub_intent: OPEN_FILE
- answer_mode: normal
- evidence_gate_passed: True
- evidence_count: 2

## Result
PASS
```

---

## 7. Tests Added

| File | What it validates |
|------|-------------------|
| `tests/code_qa_eval/test_eval_harness.py` | Golden loader, compare logic, config, fixture-mode run structure. |

**Test groups:**

- **Golden loader:** `test_load_golden_cases_returns_list` — loads `cases.yaml`, checks count and field presence (id, query, expected_intent, expected_sub_intent, expected_answer_mode).
- **Compare logic:** `test_compare_passed_when_all_match`, `test_compare_fails_on_intent_mismatch`, `test_compare_fails_on_answer_mode_mismatch`, `test_compare_path_scope_contains` — assert pass/fail and mismatch messages for intent, sub_intent, answer_mode, path_scope.
- **Config:** `test_eval_config_fixture_mode_by_default` — default config uses fixture path, golden path, and artifacts dir under `tests/`.
- **Fixture-mode run:** `test_run_eval_fixture_mode_structure` — runs `run_eval(config)` with fixture config; asserts result list and that each item is `EvalCaseResult` with case, pipeline_result, passed, mismatches. **Skips** if DB or dependencies (e.g. sqlalchemy) are unavailable.

**Modes:** Loader and compare tests are unit (no DB). Config test uses paths only. Fixture-mode test is integration-style with real adapter and DB; it is skipped when the environment cannot connect or import.

---

## 8. Known Limitations

- **LLM answer:** The harness does not call the LLM; `answer_mode` is derived from the evidence gate only. No assertion on final answer text.
- **Routing stability:** Golden expectations (especially borderline/negative) may need manual adjustment as the router or retrieval changes.
- **Real DB required:** Full eval (index + retrieve) needs a configured DB; otherwise the integration test and CLI run skip or fail. No in-memory SQLite path is implemented in this iteration.
- **Single session per run:** Each run indexes the repo once and reuses one RAG session for all cases. Cross-session or re-index behaviour is not exercised.
- **Docs / cross-domain:** Golden cases and harness are CODE_QA only; docs retrieval and cross-domain flows are out of scope.
- **Performance:** No timings or regression assertions; artifacts are for manual inspection and tuning.

---

## 9. How to Use for Manual Calibration

1. **Run fixture evaluation**
   From project root: `python -m tests.code_qa_eval.run`. Check exit code and console output (pass/fail counts and failure lines).

2. **Inspect diagnostics**
   Open `tests/artifacts/code_qa_eval/<run_id>/*.md` for failing (or borderline) cases. Use router (path_scope, layers), retrieval (layer_outcomes, chunk_count), and evidence gate (failure_reasons) to see why a case failed.

3. **Run against a real local repo**
   Set `CODE_QA_REPO_PATH=/path/to/repo`, then run the same command. Compare behaviour to the fixture run.

4. **Compare mismatches**
   Use the batch summary and per-case mismatches to decide what to tune: intent/sub_intent (router/prompts), path_scope/symbol_candidates (router or retrieval), or evidence thresholds (evidence gate).

5. **Adjust and re-run**
   Update router, retrieval, or evidence policy; add/edit golden cases if needed; re-run the harness and confirm improvements in the summary and artifacts.

---

## 10. Changed Files Index

| File | Purpose |
|------|--------|
| `tests/fixtures/code_qa_repo/app/main.py` | Fixture entrypoint. |
| `tests/fixtures/code_qa_repo/api/orders.py` | Fixture API handlers. |
| `tests/fixtures/code_qa_repo/services/order_service.py` | Fixture service layer. |
| `tests/fixtures/code_qa_repo/repositories/order_repository.py` | Fixture repository. |
| `tests/fixtures/code_qa_repo/domain/order.py` | Fixture domain model. |
| `tests/fixtures/code_qa_repo/tests/test_order_service.py` | Fixture tests. |
| `tests/fixtures/code_qa_repo/utils/helpers.py` | Fixture utility. |
| `tests/golden/code_qa/README.md` | Golden case format description. |
| `tests/golden/code_qa/cases.yaml` | Golden cases for all MVP scenarios. |
| `tests/code_qa_eval/__init__.py` | Package init. |
| `tests/code_qa_eval/config.py` | EvalConfig: repo path (fixture vs CODE_QA_REPO_PATH), artifacts dir, golden path. |
| `tests/code_qa_eval/golden_loader.py` | Load and parse golden cases from YAML. |
| `tests/code_qa_eval/runner.py` | run_eval: index repo, run pipeline, compare to golden; _compare logic. |
| `tests/code_qa_eval/artifacts.py` | dump_run_artifact (md+json), write_batch_summary. |
| `tests/code_qa_eval/run.py` | CLI entrypoint: load config, run eval, write artifacts and summary. |
| `tests/code_qa_eval/test_eval_harness.py` | Tests for loader, compare, config, fixture-mode run. |
| `pytest.ini` | Added marker `code_qa_eval`. |
| `iteration2_calibration_harness_report.md` | This report. |

No changes were made to production router, UI, or docs retrieval. The canonical pipeline and existing retrieval/index stack are reused; the harness is test-side only.