Files
agent/iteration2_calibration_harness_report.md
2026-03-12 16:55:23 +03:00

268 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iteration 2 — Calibration Harness Report
## 1. Executive Summary
This iteration adds **calibration and evaluation infrastructure** for the canonical CODE_QA pipeline. The pipeline remains test-first and is not integrated into the UI or production runtime.
**Added:**
- A small **deterministic fixture repository** (`tests/fixtures/code_qa_repo/`) for reproducible tests.
- **Golden case format and initial cases** for OPEN_FILE, EXPLAIN, FIND_TESTS, FIND_ENTRYPOINTS, and GENERAL_QA (positive, borderline, negative).
- An **evaluation harness** that indexes a repo (fixture or user-provided path), runs golden cases through `CodeQAPipelineRunner` with the **real retrieval adapter** (`RagDbAdapter`), and compares actual vs expected (intent, sub_intent, answer_mode, path_scope, symbol_candidates).
- **Diagnostics artifact dumping** per run (Markdown + JSON) under `tests/artifacts/code_qa_eval/<run_id>/`.
- A **batch evaluation summary** (Markdown table + failure list) for manual review.
- **Two modes:** fixture repo by default; optional `CODE_QA_REPO_PATH` for a local real repository.
**Now possible:**
- Run the canonical pipeline end-to-end on the fixture repo with real indexing and retrieval.
- Run the same harness against a user-provided repo path (no hardcoded external repo).
- Inspect per-case diagnostics and batch summary to tune routing, retrieval, evidence gate, and answer mode.
**Still manual / out of scope:**
- Tuning prompts and retrieval heuristics (harness supports observation, not automatic tuning).
- UI integration, docs runtime retrieval, production router replacement.
- Exact LLM answer matching (we assert routing, retrieval alignment, evidence sufficiency, answer mode only).
---
## 2. Fixture Repository
**Location:** `tests/fixtures/code_qa_repo/`
**Structure:**
```
tests/fixtures/code_qa_repo/
├── app/
│ └── main.py # Entrypoint: create_app(), app.run()
├── api/
│ └── orders.py # Handlers: create_order, get_order; OrderService, OrderRepository
├── services/
│ └── order_service.py # OrderService: create_order, get_order
├── repositories/
│ └── order_repository.py # OrderRepository: save, find_by_id
├── domain/
│ └── order.py # Order: id, product_id, quantity, status
├── tests/
│ └── test_order_service.py # test_create_order, test_get_order_returns_saved_order
└── utils/
└── helpers.py # format_order_id
```
**Purpose of each file:**
| File | Purpose |
|------|--------|
| `app/main.py` | Single clear entrypoint for FIND_ENTRYPOINTS and “open main” style queries. |
| `api/orders.py` | API/handler layer; distinct symbols `create_order`, `get_order`, `create_app`. |
| `services/order_service.py` | Service calling repository; symbol `OrderService`. |
| `repositories/order_repository.py` | Persistence; symbol `OrderRepository`. |
| `domain/order.py` | Domain model; symbol `Order`. |
| `tests/test_order_service.py` | Tests tied to production code for FIND_TESTS. |
| `utils/helpers.py` | Extra module for bounded GENERAL_QA and path/symbol variety. |
**Scenarios covered:**
- **File by path:** `app/main.py`, `api/orders.py` (OPEN_FILE).
- **Symbol explanation:** `Order`, `OrderService`, `create_order` (EXPLAIN).
- **Import/call relations:** service → repository → domain (EXPLAIN / GENERAL_QA).
- **Entrypoint:** `app/main.py` (FIND_ENTRYPOINTS).
- **Related tests:** `tests/test_order_service.py` for OrderService/Order (FIND_TESTS).
- **Fallback:** “Что делает этот проект?” (GENERAL_QA with bounded context).
The fixture is small and structured so routing and retrieval expectations are unambiguous for calibration.
---
## 3. Real Adapter Integration
The canonical pipeline runs with the **existing** retrieval/index stack:
- **Indexing:** `RagSessionIndexer` (in `tests/pipeline_intent_rag/helpers/repo_indexer.py`) uses `RagService` and `LocalRepoFileCollector` to index a directory. The fixture (or `CODE_QA_REPO_PATH`) is indexed once per eval run.
- **Retrieval:** `RagDbAdapter` (in `tests/pipeline_intent_rag/helpers/rag_db_adapter.py`) implements the pipelines `RetrievalAdapter` protocol: `retrieve_with_plan`, `retrieve_exact_files`, `hydrate_resolved_symbol_sources`, `force_symbol_context_c0`, `consume_retrieval_report`. It uses `RagRepository` and the same layer logic as the rest of the project.
- **Pipeline:** `CodeQAPipelineRunner` (in `app/modules/rag/code_qa_pipeline/pipeline.py`) takes `IntentRouterV2` and this adapter, builds `RetrievalRequest` from the router, runs retrieval, builds `EvidenceBundle`, runs the evidence gate, and produces diagnostics.
**Fixture repo:** The harness indexes `tests/fixtures/code_qa_repo` by default and runs all golden cases against that index. No external repo is required.
**User-provided repo:** Set `CODE_QA_REPO_PATH` to a local directory. The harness indexes that path and runs the same golden cases (or the user can add repo-specific cases). Optional `CODE_QA_PROJECT_ID` sets the project id for the session. The codebase does **not** depend on any private or external repo being present.
---
## 4. Golden Case Format
**Location:** `tests/golden/code_qa/`
**File:** `cases.yaml`
**Fields per case:**
| Field | Meaning |
|-------|--------|
| `id` | Unique case id. |
| `query` | User query text. |
| `expected_intent` | Expected top-level intent (e.g. CODE_QA). |
| `expected_sub_intent` | OPEN_FILE \| EXPLAIN \| FIND_TESTS \| FIND_ENTRYPOINTS \| GENERAL_QA. |
| `expected_answer_mode` | normal \| degraded \| insufficient. |
| `expected_target_hint` | Optional: path, symbol, or test-like. |
| `expected_path_scope_contains` | Optional list of substrings that must appear in path_scope. |
| `expected_symbol_candidates_contain` | Optional list of symbols that must appear in symbol_candidates. |
| `expected_layers` | Optional list of layer ids expected in the retrieval plan. |
| `notes` | Optional: borderline, negative, or calibration hint. |
**Expected results:** We assert routing (intent, sub_intent), retrieval alignment (path_scope, symbol_candidates, layers when specified), evidence sufficiency (via answer_mode), and diagnostics shape. We do **not** assert exact LLM wording.
**Not asserted (yet):** Exact chunk content, relation counts, or full evidence bundle structure beyond what drives answer_mode and target hints.
---
## 5. Golden Runner / Evaluation Harness
**Entrypoints:**
- **Programmatic:** `tests.code_qa_eval.runner.run_eval(config)` — runs all golden cases and returns `list[EvalCaseResult]`.
- **CLI:** `python -m tests.code_qa_eval.run` (from project root) — loads config, runs eval, writes artifacts and summary, exits 0 only if all pass.
**Fixture mode (default):**
1. Do not set `CODE_QA_REPO_PATH`.
2. Run: `python -m tests.code_qa_eval.run` (or call `run_eval(EvalConfig.from_env())`).
3. Repo used: `tests/fixtures/code_qa_repo`. It is indexed once; then each golden case is run through the pipeline and compared to expected.
**User-provided repo:**
1. Set `CODE_QA_REPO_PATH` to the repository root (e.g. `export CODE_QA_REPO_PATH=/path/to/your/repo`).
2. Optionally set `CODE_QA_PROJECT_ID`.
3. Run the same command. The harness indexes that path and runs the same golden cases (or you can point to a different `cases.yaml` by changing `EvalConfig.golden_cases_path` in code).
**Outputs:**
- **Per case:** under `tests/artifacts/code_qa_eval/<run_id>/`: `<case_id>.md` and `<case_id>.json` (query, expected/actual, router, retrieval, evidence gate, timings, mismatches).
- **Batch:** `tests/artifacts/code_qa_eval/summary_<run_id>.md` — table (case id, query, expected/actual scenario, target, evidence, answer mode, pass/fail) and a failure list.
- **Exit code:** 0 if all cases pass, 1 otherwise; failures are printed to stderr.
---
## 6. Diagnostics Artifacts
**Generated artifacts:**
- **Per run (per case):** `<run_id>/<case_id>.md` and `<case_id>.json`.
- **Batch:** `summary_<run_id>.md` in `tests/artifacts/code_qa_eval/`.
**Location:** `tests/artifacts/code_qa_eval/` (created if missing).
**Markdown (per case) contains:**
- Query, expected (intent, sub_intent, answer_mode), actual (intent, sub_intent, answer_mode, evidence_gate_passed, evidence_count).
- Pass/fail and list of mismatches.
- Router: path_scope, layers.
- Retrieval: requested_layers, chunk_count, layer_outcomes.
- Evidence gate: failure_reasons.
- Timings (ms).
**JSON (per case)** adds machine-readable detail: full expected/actual, passed, mismatches, router_result, retrieval_request, per_layer_outcome, failure_reasons, timings_ms.
**Useful for calibration:**
- **Router:** path_scope and layers — confirm OPEN_FILE vs EXPLAIN vs FIND_* routing and plan.
- **Retrieval:** layer_outcomes and chunk_count — see which layers returned hits.
- **Evidence gate:** failure_reasons and evidence_count — see why answer_mode is degraded/insufficient.
- **Mismatches:** quick list of what to fix (routing vs retrieval vs gate).
**Example snippet (Markdown):**
```markdown
# open_file_main_positive
## Query
Открой файл app/main.py
## Expected
- intent: CODE_QA, sub_intent: OPEN_FILE
- answer_mode: normal
## Actual
- intent: CODE_QA, sub_intent: OPEN_FILE
- answer_mode: normal
- evidence_gate_passed: True
- evidence_count: 2
## Result
PASS
```
---
## 7. Tests Added
| File | What it validates |
|------|-------------------|
| `tests/code_qa_eval/test_eval_harness.py` | Golden loader, compare logic, config, fixture-mode run structure. |
**Test groups:**
- **Golden loader:** `test_load_golden_cases_returns_list` — loads `cases.yaml`, checks count and field presence (id, query, expected_intent, expected_sub_intent, expected_answer_mode).
- **Compare logic:** `test_compare_passed_when_all_match`, `test_compare_fails_on_intent_mismatch`, `test_compare_fails_on_answer_mode_mismatch`, `test_compare_path_scope_contains` — assert pass/fail and mismatch messages for intent, sub_intent, answer_mode, path_scope.
- **Config:** `test_eval_config_fixture_mode_by_default` — default config uses fixture path, golden path, and artifacts dir under `tests/`.
- **Fixture-mode run:** `test_run_eval_fixture_mode_structure` — runs `run_eval(config)` with fixture config; asserts result list and that each item is `EvalCaseResult` with case, pipeline_result, passed, mismatches. **Skips** if DB or dependencies (e.g. sqlalchemy) are unavailable.
**Modes:** Loader and compare tests are unit (no DB). Config test uses paths only. Fixture-mode test is integration-style with real adapter and DB; it is skipped when the environment cannot connect or import.
---
## 8. Known Limitations
- **LLM answer:** The harness does not call the LLM; `answer_mode` is derived from the evidence gate only. No assertion on final answer text.
- **Routing stability:** Golden expectations (especially borderline/negative) may need manual adjustment as the router or retrieval changes.
- **Real DB required:** Full eval (index + retrieve) needs a configured DB; otherwise the integration test and CLI run skip or fail. No in-memory SQLite path is implemented in this iteration.
- **Single session per run:** Each run indexes the repo once and reuses one RAG session for all cases. Cross-session or re-index behaviour is not exercised.
- **Docs / cross-domain:** Golden cases and harness are CODE_QA only; docs retrieval and cross-domain flows are out of scope.
- **Performance:** No timings or regression assertions; artifacts are for manual inspection and tuning.
---
## 9. How to Use for Manual Calibration
1. **Run fixture evaluation**
From project root: `python -m tests.code_qa_eval.run`. Check exit code and console output (pass/fail counts and failure lines).
2. **Inspect diagnostics**
Open `tests/artifacts/code_qa_eval/<run_id>/*.md` for failing (or borderline) cases. Use router (path_scope, layers), retrieval (layer_outcomes, chunk_count), and evidence gate (failure_reasons) to see why a case failed.
3. **Run against a real local repo**
Set `CODE_QA_REPO_PATH=/path/to/repo`, then run the same command. Compare behaviour to the fixture run.
4. **Compare mismatches**
Use the batch summary and per-case mismatches to decide what to tune: intent/sub_intent (router/prompts), path_scope/symbol_candidates (router or retrieval), or evidence thresholds (evidence gate).
5. **Adjust and re-run**
Update router, retrieval, or evidence policy; add/edit golden cases if needed; re-run the harness and confirm improvements in the summary and artifacts.
---
## 10. Changed Files Index
| File | Purpose |
|------|--------|
| `tests/fixtures/code_qa_repo/app/main.py` | Fixture entrypoint. |
| `tests/fixtures/code_qa_repo/api/orders.py` | Fixture API handlers. |
| `tests/fixtures/code_qa_repo/services/order_service.py` | Fixture service layer. |
| `tests/fixtures/code_qa_repo/repositories/order_repository.py` | Fixture repository. |
| `tests/fixtures/code_qa_repo/domain/order.py` | Fixture domain model. |
| `tests/fixtures/code_qa_repo/tests/test_order_service.py` | Fixture tests. |
| `tests/fixtures/code_qa_repo/utils/helpers.py` | Fixture utility. |
| `tests/golden/code_qa/README.md` | Golden case format description. |
| `tests/golden/code_qa/cases.yaml` | Golden cases for all MVP scenarios. |
| `tests/code_qa_eval/__init__.py` | Package init. |
| `tests/code_qa_eval/config.py` | EvalConfig: repo path (fixture vs CODE_QA_REPO_PATH), artifacts dir, golden path. |
| `tests/code_qa_eval/golden_loader.py` | Load and parse golden cases from YAML. |
| `tests/code_qa_eval/runner.py` | run_eval: index repo, run pipeline, compare to golden; _compare logic. |
| `tests/code_qa_eval/artifacts.py` | dump_run_artifact (md+json), write_batch_summary. |
| `tests/code_qa_eval/run.py` | CLI entrypoint: load config, run eval, write artifacts and summary. |
| `tests/code_qa_eval/test_eval_harness.py` | Tests for loader, compare, config, fixture-mode run. |
| `pytest.ini` | Added marker `code_qa_eval`. |
| `iteration2_calibration_harness_report.md` | This report. |
No changes were made to production router, UI, or docs retrieval. The canonical pipeline and existing retrieval/index stack are reused; the harness is test-side only.