268 lines
14 KiB
Markdown
268 lines
14 KiB
Markdown
# Iteration 2 — Calibration Harness Report
|
||
|
||
## 1. Executive Summary
|
||
|
||
This iteration adds **calibration and evaluation infrastructure** for the canonical CODE_QA pipeline. The pipeline remains test-first and is not integrated into the UI or production runtime.
|
||
|
||
**Added:**
|
||
|
||
- A small **deterministic fixture repository** (`tests/fixtures/code_qa_repo/`) for reproducible tests.
|
||
- **Golden case format and initial cases** for OPEN_FILE, EXPLAIN, FIND_TESTS, FIND_ENTRYPOINTS, and GENERAL_QA (positive, borderline, negative).
|
||
- An **evaluation harness** that indexes a repo (fixture or user-provided path), runs golden cases through `CodeQAPipelineRunner` with the **real retrieval adapter** (`RagDbAdapter`), and compares actual vs expected (intent, sub_intent, answer_mode, path_scope, symbol_candidates).
|
||
- **Diagnostics artifact dumping** per run (Markdown + JSON) under `tests/artifacts/code_qa_eval/<run_id>/`.
|
||
- A **batch evaluation summary** (Markdown table + failure list) for manual review.
|
||
- **Two modes:** fixture repo by default; optional `CODE_QA_REPO_PATH` for a local real repository.
|
||
|
||
**Now possible:**
|
||
|
||
- Run the canonical pipeline end-to-end on the fixture repo with real indexing and retrieval.
|
||
- Run the same harness against a user-provided repo path (no hardcoded external repo).
|
||
- Inspect per-case diagnostics and batch summary to tune routing, retrieval, evidence gate, and answer mode.
|
||
|
||
**Still manual / out of scope:**
|
||
|
||
- Tuning prompts and retrieval heuristics (harness supports observation, not automatic tuning).
|
||
- UI integration, docs runtime retrieval, production router replacement.
|
||
- Exact LLM answer matching (we assert routing, retrieval alignment, evidence sufficiency, answer mode only).
|
||
|
||
---
|
||
|
||
## 2. Fixture Repository
|
||
|
||
**Location:** `tests/fixtures/code_qa_repo/`
|
||
|
||
**Structure:**
|
||
|
||
```
|
||
tests/fixtures/code_qa_repo/
|
||
├── app/
|
||
│ └── main.py # Entrypoint: create_app(), app.run()
|
||
├── api/
|
||
│ └── orders.py # Handlers: create_order, get_order; OrderService, OrderRepository
|
||
├── services/
|
||
│ └── order_service.py # OrderService: create_order, get_order
|
||
├── repositories/
|
||
│ └── order_repository.py # OrderRepository: save, find_by_id
|
||
├── domain/
|
||
│ └── order.py # Order: id, product_id, quantity, status
|
||
├── tests/
|
||
│ └── test_order_service.py # test_create_order, test_get_order_returns_saved_order
|
||
└── utils/
|
||
└── helpers.py # format_order_id
|
||
```
|
||
|
||
**Purpose of each file:**
|
||
|
||
| File | Purpose |
|
||
|------|--------|
|
||
| `app/main.py` | Single clear entrypoint for FIND_ENTRYPOINTS and “open main” style queries. |
|
||
| `api/orders.py` | API/handler layer; distinct symbols `create_order`, `get_order`, `create_app`. |
|
||
| `services/order_service.py` | Service calling repository; symbol `OrderService`. |
|
||
| `repositories/order_repository.py` | Persistence; symbol `OrderRepository`. |
|
||
| `domain/order.py` | Domain model; symbol `Order`. |
|
||
| `tests/test_order_service.py` | Tests tied to production code for FIND_TESTS. |
|
||
| `utils/helpers.py` | Extra module for bounded GENERAL_QA and path/symbol variety. |
|
||
|
||
**Scenarios covered:**
|
||
|
||
- **File by path:** `app/main.py`, `api/orders.py` (OPEN_FILE).
|
||
- **Symbol explanation:** `Order`, `OrderService`, `create_order` (EXPLAIN).
|
||
- **Import/call relations:** service → repository → domain (EXPLAIN / GENERAL_QA).
|
||
- **Entrypoint:** `app/main.py` (FIND_ENTRYPOINTS).
|
||
- **Related tests:** `tests/test_order_service.py` for OrderService/Order (FIND_TESTS).
|
||
- **Fallback:** “Что делает этот проект?” (GENERAL_QA with bounded context).
|
||
|
||
The fixture is small and structured so routing and retrieval expectations are unambiguous for calibration.
|
||
|
||
---
|
||
|
||
## 3. Real Adapter Integration
|
||
|
||
The canonical pipeline runs with the **existing** retrieval/index stack:
|
||
|
||
- **Indexing:** `RagSessionIndexer` (in `tests/pipeline_intent_rag/helpers/repo_indexer.py`) uses `RagService` and `LocalRepoFileCollector` to index a directory. The fixture (or `CODE_QA_REPO_PATH`) is indexed once per eval run.
|
||
- **Retrieval:** `RagDbAdapter` (in `tests/pipeline_intent_rag/helpers/rag_db_adapter.py`) implements the pipeline’s `RetrievalAdapter` protocol: `retrieve_with_plan`, `retrieve_exact_files`, `hydrate_resolved_symbol_sources`, `force_symbol_context_c0`, `consume_retrieval_report`. It uses `RagRepository` and the same layer logic as the rest of the project.
|
||
- **Pipeline:** `CodeQAPipelineRunner` (in `app/modules/rag/code_qa_pipeline/pipeline.py`) takes `IntentRouterV2` and this adapter, builds `RetrievalRequest` from the router, runs retrieval, builds `EvidenceBundle`, runs the evidence gate, and produces diagnostics.
|
||
|
||
**Fixture repo:** The harness indexes `tests/fixtures/code_qa_repo` by default and runs all golden cases against that index. No external repo is required.
|
||
|
||
**User-provided repo:** Set `CODE_QA_REPO_PATH` to a local directory. The harness indexes that path and runs the same golden cases (or the user can add repo-specific cases). Optional `CODE_QA_PROJECT_ID` sets the project id for the session. The codebase does **not** depend on any private or external repo being present.
|
||
|
||
---
|
||
|
||
## 4. Golden Case Format
|
||
|
||
**Location:** `tests/golden/code_qa/`
|
||
**File:** `cases.yaml`
|
||
|
||
**Fields per case:**
|
||
|
||
| Field | Meaning |
|
||
|-------|--------|
|
||
| `id` | Unique case id. |
|
||
| `query` | User query text. |
|
||
| `expected_intent` | Expected top-level intent (e.g. CODE_QA). |
|
||
| `expected_sub_intent` | OPEN_FILE \| EXPLAIN \| FIND_TESTS \| FIND_ENTRYPOINTS \| GENERAL_QA. |
|
||
| `expected_answer_mode` | normal \| degraded \| insufficient. |
|
||
| `expected_target_hint` | Optional: path, symbol, or test-like. |
|
||
| `expected_path_scope_contains` | Optional list of substrings that must appear in path_scope. |
|
||
| `expected_symbol_candidates_contain` | Optional list of symbols that must appear in symbol_candidates. |
|
||
| `expected_layers` | Optional list of layer ids expected in the retrieval plan. |
|
||
| `notes` | Optional: borderline, negative, or calibration hint. |
|
||
|
||
**Expected results:** We assert routing (intent, sub_intent), retrieval alignment (path_scope, symbol_candidates, layers when specified), evidence sufficiency (via answer_mode), and diagnostics shape. We do **not** assert exact LLM wording.
|
||
|
||
**Not asserted (yet):** Exact chunk content, relation counts, or full evidence bundle structure beyond what drives answer_mode and target hints.
|
||
|
||
---
|
||
|
||
## 5. Golden Runner / Evaluation Harness
|
||
|
||
**Entrypoints:**
|
||
|
||
- **Programmatic:** `tests.code_qa_eval.runner.run_eval(config)` — runs all golden cases and returns `list[EvalCaseResult]`.
|
||
- **CLI:** `python -m tests.code_qa_eval.run` (from project root) — loads config, runs eval, writes artifacts and summary, exits 0 only if all pass.
|
||
|
||
**Fixture mode (default):**
|
||
|
||
1. Do not set `CODE_QA_REPO_PATH`.
|
||
2. Run: `python -m tests.code_qa_eval.run` (or call `run_eval(EvalConfig.from_env())`).
|
||
3. Repo used: `tests/fixtures/code_qa_repo`. It is indexed once; then each golden case is run through the pipeline and compared to expected.
|
||
|
||
**User-provided repo:**
|
||
|
||
1. Set `CODE_QA_REPO_PATH` to the repository root (e.g. `export CODE_QA_REPO_PATH=/path/to/your/repo`).
|
||
2. Optionally set `CODE_QA_PROJECT_ID`.
|
||
3. Run the same command. The harness indexes that path and runs the same golden cases (or you can point to a different `cases.yaml` by changing `EvalConfig.golden_cases_path` in code).
|
||
|
||
**Outputs:**
|
||
|
||
- **Per case:** under `tests/artifacts/code_qa_eval/<run_id>/`: `<case_id>.md` and `<case_id>.json` (query, expected/actual, router, retrieval, evidence gate, timings, mismatches).
|
||
- **Batch:** `tests/artifacts/code_qa_eval/summary_<run_id>.md` — table (case id, query, expected/actual scenario, target, evidence, answer mode, pass/fail) and a failure list.
|
||
- **Exit code:** 0 if all cases pass, 1 otherwise; failures are printed to stderr.
|
||
|
||
---
|
||
|
||
## 6. Diagnostics Artifacts
|
||
|
||
**Generated artifacts:**
|
||
|
||
- **Per run (per case):** `<run_id>/<case_id>.md` and `<case_id>.json`.
|
||
- **Batch:** `summary_<run_id>.md` in `tests/artifacts/code_qa_eval/`.
|
||
|
||
**Location:** `tests/artifacts/code_qa_eval/` (created if missing).
|
||
|
||
**Markdown (per case) contains:**
|
||
|
||
- Query, expected (intent, sub_intent, answer_mode), actual (intent, sub_intent, answer_mode, evidence_gate_passed, evidence_count).
|
||
- Pass/fail and list of mismatches.
|
||
- Router: path_scope, layers.
|
||
- Retrieval: requested_layers, chunk_count, layer_outcomes.
|
||
- Evidence gate: failure_reasons.
|
||
- Timings (ms).
|
||
|
||
**JSON (per case)** adds machine-readable detail: full expected/actual, passed, mismatches, router_result, retrieval_request, per_layer_outcome, failure_reasons, timings_ms.
|
||
|
||
**Useful for calibration:**
|
||
|
||
- **Router:** path_scope and layers — confirm OPEN_FILE vs EXPLAIN vs FIND_* routing and plan.
|
||
- **Retrieval:** layer_outcomes and chunk_count — see which layers returned hits.
|
||
- **Evidence gate:** failure_reasons and evidence_count — see why answer_mode is degraded/insufficient.
|
||
- **Mismatches:** quick list of what to fix (routing vs retrieval vs gate).
|
||
|
||
**Example snippet (Markdown):**
|
||
|
||
```markdown
|
||
# open_file_main_positive
|
||
|
||
## Query
|
||
Открой файл app/main.py
|
||
|
||
## Expected
|
||
- intent: CODE_QA, sub_intent: OPEN_FILE
|
||
- answer_mode: normal
|
||
|
||
## Actual
|
||
- intent: CODE_QA, sub_intent: OPEN_FILE
|
||
- answer_mode: normal
|
||
- evidence_gate_passed: True
|
||
- evidence_count: 2
|
||
|
||
## Result
|
||
PASS
|
||
```
|
||
|
||
---
|
||
|
||
## 7. Tests Added
|
||
|
||
| File | What it validates |
|
||
|------|-------------------|
|
||
| `tests/code_qa_eval/test_eval_harness.py` | Golden loader, compare logic, config, fixture-mode run structure. |
|
||
|
||
**Test groups:**
|
||
|
||
- **Golden loader:** `test_load_golden_cases_returns_list` — loads `cases.yaml`, checks count and field presence (id, query, expected_intent, expected_sub_intent, expected_answer_mode).
|
||
- **Compare logic:** `test_compare_passed_when_all_match`, `test_compare_fails_on_intent_mismatch`, `test_compare_fails_on_answer_mode_mismatch`, `test_compare_path_scope_contains` — assert pass/fail and mismatch messages for intent, sub_intent, answer_mode, path_scope.
|
||
- **Config:** `test_eval_config_fixture_mode_by_default` — default config uses fixture path, golden path, and artifacts dir under `tests/`.
|
||
- **Fixture-mode run:** `test_run_eval_fixture_mode_structure` — runs `run_eval(config)` with fixture config; asserts result list and that each item is `EvalCaseResult` with case, pipeline_result, passed, mismatches. **Skips** if DB or dependencies (e.g. sqlalchemy) are unavailable.
|
||
|
||
**Modes:** Loader and compare tests are unit (no DB). Config test uses paths only. Fixture-mode test is integration-style with real adapter and DB; it is skipped when the environment cannot connect or import.
|
||
|
||
---
|
||
|
||
## 8. Known Limitations
|
||
|
||
- **LLM answer:** The harness does not call the LLM; `answer_mode` is derived from the evidence gate only. No assertion on final answer text.
|
||
- **Routing stability:** Golden expectations (especially borderline/negative) may need manual adjustment as the router or retrieval changes.
|
||
- **Real DB required:** Full eval (index + retrieve) needs a configured DB; otherwise the integration test and CLI run skip or fail. No in-memory SQLite path is implemented in this iteration.
|
||
- **Single session per run:** Each run indexes the repo once and reuses one RAG session for all cases. Cross-session or re-index behaviour is not exercised.
|
||
- **Docs / cross-domain:** Golden cases and harness are CODE_QA only; docs retrieval and cross-domain flows are out of scope.
|
||
- **Performance:** No timings or regression assertions; artifacts are for manual inspection and tuning.
|
||
|
||
---
|
||
|
||
## 9. How to Use for Manual Calibration
|
||
|
||
1. **Run fixture evaluation**
|
||
From project root: `python -m tests.code_qa_eval.run`. Check exit code and console output (pass/fail counts and failure lines).
|
||
|
||
2. **Inspect diagnostics**
|
||
Open `tests/artifacts/code_qa_eval/<run_id>/*.md` for failing (or borderline) cases. Use router (path_scope, layers), retrieval (layer_outcomes, chunk_count), and evidence gate (failure_reasons) to see why a case failed.
|
||
|
||
3. **Run against a real local repo**
|
||
Set `CODE_QA_REPO_PATH=/path/to/repo`, then run the same command. Compare behaviour to the fixture run.
|
||
|
||
4. **Compare mismatches**
|
||
Use the batch summary and per-case mismatches to decide what to tune: intent/sub_intent (router/prompts), path_scope/symbol_candidates (router or retrieval), or evidence thresholds (evidence gate).
|
||
|
||
5. **Adjust and re-run**
|
||
Update router, retrieval, or evidence policy; add/edit golden cases if needed; re-run the harness and confirm improvements in the summary and artifacts.
|
||
|
||
---
|
||
|
||
## 10. Changed Files Index
|
||
|
||
| File | Purpose |
|
||
|------|--------|
|
||
| `tests/fixtures/code_qa_repo/app/main.py` | Fixture entrypoint. |
|
||
| `tests/fixtures/code_qa_repo/api/orders.py` | Fixture API handlers. |
|
||
| `tests/fixtures/code_qa_repo/services/order_service.py` | Fixture service layer. |
|
||
| `tests/fixtures/code_qa_repo/repositories/order_repository.py` | Fixture repository. |
|
||
| `tests/fixtures/code_qa_repo/domain/order.py` | Fixture domain model. |
|
||
| `tests/fixtures/code_qa_repo/tests/test_order_service.py` | Fixture tests. |
|
||
| `tests/fixtures/code_qa_repo/utils/helpers.py` | Fixture utility. |
|
||
| `tests/golden/code_qa/README.md` | Golden case format description. |
|
||
| `tests/golden/code_qa/cases.yaml` | Golden cases for all MVP scenarios. |
|
||
| `tests/code_qa_eval/__init__.py` | Package init. |
|
||
| `tests/code_qa_eval/config.py` | EvalConfig: repo path (fixture vs CODE_QA_REPO_PATH), artifacts dir, golden path. |
|
||
| `tests/code_qa_eval/golden_loader.py` | Load and parse golden cases from YAML. |
|
||
| `tests/code_qa_eval/runner.py` | run_eval: index repo, run pipeline, compare to golden; _compare logic. |
|
||
| `tests/code_qa_eval/artifacts.py` | dump_run_artifact (md+json), write_batch_summary. |
|
||
| `tests/code_qa_eval/run.py` | CLI entrypoint: load config, run eval, write artifacts and summary. |
|
||
| `tests/code_qa_eval/test_eval_harness.py` | Tests for loader, compare, config, fixture-mode run. |
|
||
| `pytest.ini` | Added marker `code_qa_eval`. |
|
||
| `iteration2_calibration_harness_report.md` | This report. |
|
||
|
||
No changes were made to production router, UI, or docs retrieval. The canonical pipeline and existing retrieval/index stack are reused; the harness is test-side only.
|