Files
agent/iteration2_calibration_harness_report.md
2026-03-12 16:55:23 +03:00

14 KiB
Raw Blame History

Iteration 2 — Calibration Harness Report

1. Executive Summary

This iteration adds calibration and evaluation infrastructure for the canonical CODE_QA pipeline. The pipeline remains test-first and is not integrated into the UI or production runtime.

Added:

  • A small deterministic fixture repository (tests/fixtures/code_qa_repo/) for reproducible tests.
  • Golden case format and initial cases for OPEN_FILE, EXPLAIN, FIND_TESTS, FIND_ENTRYPOINTS, and GENERAL_QA (positive, borderline, negative).
  • An evaluation harness that indexes a repo (fixture or user-provided path), runs golden cases through CodeQAPipelineRunner with the real retrieval adapter (RagDbAdapter), and compares actual vs expected (intent, sub_intent, answer_mode, path_scope, symbol_candidates).
  • Diagnostics artifact dumping per run (Markdown + JSON) under tests/artifacts/code_qa_eval/<run_id>/.
  • A batch evaluation summary (Markdown table + failure list) for manual review.
  • Two modes: fixture repo by default; optional CODE_QA_REPO_PATH for a local real repository.

Now possible:

  • Run the canonical pipeline end-to-end on the fixture repo with real indexing and retrieval.
  • Run the same harness against a user-provided repo path (no hardcoded external repo).
  • Inspect per-case diagnostics and batch summary to tune routing, retrieval, evidence gate, and answer mode.

Still manual / out of scope:

  • Tuning prompts and retrieval heuristics (harness supports observation, not automatic tuning).
  • UI integration, docs runtime retrieval, production router replacement.
  • Exact LLM answer matching (we assert routing, retrieval alignment, evidence sufficiency, answer mode only).

2. Fixture Repository

Location: tests/fixtures/code_qa_repo/

Structure:

tests/fixtures/code_qa_repo/
├── app/
│   └── main.py          # Entrypoint: create_app(), app.run()
├── api/
│   └── orders.py        # Handlers: create_order, get_order; OrderService, OrderRepository
├── services/
│   └── order_service.py  # OrderService: create_order, get_order
├── repositories/
│   └── order_repository.py  # OrderRepository: save, find_by_id
├── domain/
│   └── order.py         # Order: id, product_id, quantity, status
├── tests/
│   └── test_order_service.py  # test_create_order, test_get_order_returns_saved_order
└── utils/
    └── helpers.py       # format_order_id

Purpose of each file:

File Purpose
app/main.py Single clear entrypoint for FIND_ENTRYPOINTS and “open main” style queries.
api/orders.py API/handler layer; distinct symbols create_order, get_order, create_app.
services/order_service.py Service calling repository; symbol OrderService.
repositories/order_repository.py Persistence; symbol OrderRepository.
domain/order.py Domain model; symbol Order.
tests/test_order_service.py Tests tied to production code for FIND_TESTS.
utils/helpers.py Extra module for bounded GENERAL_QA and path/symbol variety.

Scenarios covered:

  • File by path: app/main.py, api/orders.py (OPEN_FILE).
  • Symbol explanation: Order, OrderService, create_order (EXPLAIN).
  • Import/call relations: service → repository → domain (EXPLAIN / GENERAL_QA).
  • Entrypoint: app/main.py (FIND_ENTRYPOINTS).
  • Related tests: tests/test_order_service.py for OrderService/Order (FIND_TESTS).
  • Fallback: “Что делает этот проект?” (GENERAL_QA with bounded context).

The fixture is small and structured so routing and retrieval expectations are unambiguous for calibration.


3. Real Adapter Integration

The canonical pipeline runs with the existing retrieval/index stack:

  • Indexing: RagSessionIndexer (in tests/pipeline_intent_rag/helpers/repo_indexer.py) uses RagService and LocalRepoFileCollector to index a directory. The fixture (or CODE_QA_REPO_PATH) is indexed once per eval run.
  • Retrieval: RagDbAdapter (in tests/pipeline_intent_rag/helpers/rag_db_adapter.py) implements the pipelines RetrievalAdapter protocol: retrieve_with_plan, retrieve_exact_files, hydrate_resolved_symbol_sources, force_symbol_context_c0, consume_retrieval_report. It uses RagRepository and the same layer logic as the rest of the project.
  • Pipeline: CodeQAPipelineRunner (in app/modules/rag/code_qa_pipeline/pipeline.py) takes IntentRouterV2 and this adapter, builds RetrievalRequest from the router, runs retrieval, builds EvidenceBundle, runs the evidence gate, and produces diagnostics.

Fixture repo: The harness indexes tests/fixtures/code_qa_repo by default and runs all golden cases against that index. No external repo is required.

User-provided repo: Set CODE_QA_REPO_PATH to a local directory. The harness indexes that path and runs the same golden cases (or the user can add repo-specific cases). Optional CODE_QA_PROJECT_ID sets the project id for the session. The codebase does not depend on any private or external repo being present.


4. Golden Case Format

Location: tests/golden/code_qa/
File: cases.yaml

Fields per case:

Field Meaning
id Unique case id.
query User query text.
expected_intent Expected top-level intent (e.g. CODE_QA).
expected_sub_intent OPEN_FILE | EXPLAIN | FIND_TESTS | FIND_ENTRYPOINTS | GENERAL_QA.
expected_answer_mode normal | degraded | insufficient.
expected_target_hint Optional: path, symbol, or test-like.
expected_path_scope_contains Optional list of substrings that must appear in path_scope.
expected_symbol_candidates_contain Optional list of symbols that must appear in symbol_candidates.
expected_layers Optional list of layer ids expected in the retrieval plan.
notes Optional: borderline, negative, or calibration hint.

Expected results: We assert routing (intent, sub_intent), retrieval alignment (path_scope, symbol_candidates, layers when specified), evidence sufficiency (via answer_mode), and diagnostics shape. We do not assert exact LLM wording.

Not asserted (yet): Exact chunk content, relation counts, or full evidence bundle structure beyond what drives answer_mode and target hints.


5. Golden Runner / Evaluation Harness

Entrypoints:

  • Programmatic: tests.code_qa_eval.runner.run_eval(config) — runs all golden cases and returns list[EvalCaseResult].
  • CLI: python -m tests.code_qa_eval.run (from project root) — loads config, runs eval, writes artifacts and summary, exits 0 only if all pass.

Fixture mode (default):

  1. Do not set CODE_QA_REPO_PATH.
  2. Run: python -m tests.code_qa_eval.run (or call run_eval(EvalConfig.from_env())).
  3. Repo used: tests/fixtures/code_qa_repo. It is indexed once; then each golden case is run through the pipeline and compared to expected.

User-provided repo:

  1. Set CODE_QA_REPO_PATH to the repository root (e.g. export CODE_QA_REPO_PATH=/path/to/your/repo).
  2. Optionally set CODE_QA_PROJECT_ID.
  3. Run the same command. The harness indexes that path and runs the same golden cases (or you can point to a different cases.yaml by changing EvalConfig.golden_cases_path in code).

Outputs:

  • Per case: under tests/artifacts/code_qa_eval/<run_id>/: <case_id>.md and <case_id>.json (query, expected/actual, router, retrieval, evidence gate, timings, mismatches).
  • Batch: tests/artifacts/code_qa_eval/summary_<run_id>.md — table (case id, query, expected/actual scenario, target, evidence, answer mode, pass/fail) and a failure list.
  • Exit code: 0 if all cases pass, 1 otherwise; failures are printed to stderr.

6. Diagnostics Artifacts

Generated artifacts:

  • Per run (per case): <run_id>/<case_id>.md and <case_id>.json.
  • Batch: summary_<run_id>.md in tests/artifacts/code_qa_eval/.

Location: tests/artifacts/code_qa_eval/ (created if missing).

Markdown (per case) contains:

  • Query, expected (intent, sub_intent, answer_mode), actual (intent, sub_intent, answer_mode, evidence_gate_passed, evidence_count).
  • Pass/fail and list of mismatches.
  • Router: path_scope, layers.
  • Retrieval: requested_layers, chunk_count, layer_outcomes.
  • Evidence gate: failure_reasons.
  • Timings (ms).

JSON (per case) adds machine-readable detail: full expected/actual, passed, mismatches, router_result, retrieval_request, per_layer_outcome, failure_reasons, timings_ms.

Useful for calibration:

  • Router: path_scope and layers — confirm OPEN_FILE vs EXPLAIN vs FIND_* routing and plan.
  • Retrieval: layer_outcomes and chunk_count — see which layers returned hits.
  • Evidence gate: failure_reasons and evidence_count — see why answer_mode is degraded/insufficient.
  • Mismatches: quick list of what to fix (routing vs retrieval vs gate).

Example snippet (Markdown):

# open_file_main_positive

## Query
Открой файл app/main.py

## Expected
- intent: CODE_QA, sub_intent: OPEN_FILE
- answer_mode: normal

## Actual
- intent: CODE_QA, sub_intent: OPEN_FILE
- answer_mode: normal
- evidence_gate_passed: True
- evidence_count: 2

## Result
PASS

7. Tests Added

File What it validates
tests/code_qa_eval/test_eval_harness.py Golden loader, compare logic, config, fixture-mode run structure.

Test groups:

  • Golden loader: test_load_golden_cases_returns_list — loads cases.yaml, checks count and field presence (id, query, expected_intent, expected_sub_intent, expected_answer_mode).
  • Compare logic: test_compare_passed_when_all_match, test_compare_fails_on_intent_mismatch, test_compare_fails_on_answer_mode_mismatch, test_compare_path_scope_contains — assert pass/fail and mismatch messages for intent, sub_intent, answer_mode, path_scope.
  • Config: test_eval_config_fixture_mode_by_default — default config uses fixture path, golden path, and artifacts dir under tests/.
  • Fixture-mode run: test_run_eval_fixture_mode_structure — runs run_eval(config) with fixture config; asserts result list and that each item is EvalCaseResult with case, pipeline_result, passed, mismatches. Skips if DB or dependencies (e.g. sqlalchemy) are unavailable.

Modes: Loader and compare tests are unit (no DB). Config test uses paths only. Fixture-mode test is integration-style with real adapter and DB; it is skipped when the environment cannot connect or import.


8. Known Limitations

  • LLM answer: The harness does not call the LLM; answer_mode is derived from the evidence gate only. No assertion on final answer text.
  • Routing stability: Golden expectations (especially borderline/negative) may need manual adjustment as the router or retrieval changes.
  • Real DB required: Full eval (index + retrieve) needs a configured DB; otherwise the integration test and CLI run skip or fail. No in-memory SQLite path is implemented in this iteration.
  • Single session per run: Each run indexes the repo once and reuses one RAG session for all cases. Cross-session or re-index behaviour is not exercised.
  • Docs / cross-domain: Golden cases and harness are CODE_QA only; docs retrieval and cross-domain flows are out of scope.
  • Performance: No timings or regression assertions; artifacts are for manual inspection and tuning.

9. How to Use for Manual Calibration

  1. Run fixture evaluation
    From project root: python -m tests.code_qa_eval.run. Check exit code and console output (pass/fail counts and failure lines).

  2. Inspect diagnostics
    Open tests/artifacts/code_qa_eval/<run_id>/*.md for failing (or borderline) cases. Use router (path_scope, layers), retrieval (layer_outcomes, chunk_count), and evidence gate (failure_reasons) to see why a case failed.

  3. Run against a real local repo
    Set CODE_QA_REPO_PATH=/path/to/repo, then run the same command. Compare behaviour to the fixture run.

  4. Compare mismatches
    Use the batch summary and per-case mismatches to decide what to tune: intent/sub_intent (router/prompts), path_scope/symbol_candidates (router or retrieval), or evidence thresholds (evidence gate).

  5. Adjust and re-run
    Update router, retrieval, or evidence policy; add/edit golden cases if needed; re-run the harness and confirm improvements in the summary and artifacts.


10. Changed Files Index

File Purpose
tests/fixtures/code_qa_repo/app/main.py Fixture entrypoint.
tests/fixtures/code_qa_repo/api/orders.py Fixture API handlers.
tests/fixtures/code_qa_repo/services/order_service.py Fixture service layer.
tests/fixtures/code_qa_repo/repositories/order_repository.py Fixture repository.
tests/fixtures/code_qa_repo/domain/order.py Fixture domain model.
tests/fixtures/code_qa_repo/tests/test_order_service.py Fixture tests.
tests/fixtures/code_qa_repo/utils/helpers.py Fixture utility.
tests/golden/code_qa/README.md Golden case format description.
tests/golden/code_qa/cases.yaml Golden cases for all MVP scenarios.
tests/code_qa_eval/__init__.py Package init.
tests/code_qa_eval/config.py EvalConfig: repo path (fixture vs CODE_QA_REPO_PATH), artifacts dir, golden path.
tests/code_qa_eval/golden_loader.py Load and parse golden cases from YAML.
tests/code_qa_eval/runner.py run_eval: index repo, run pipeline, compare to golden; _compare logic.
tests/code_qa_eval/artifacts.py dump_run_artifact (md+json), write_batch_summary.
tests/code_qa_eval/run.py CLI entrypoint: load config, run eval, write artifacts and summary.
tests/code_qa_eval/test_eval_harness.py Tests for loader, compare, config, fixture-mode run.
pytest.ini Added marker code_qa_eval.
iteration2_calibration_harness_report.md This report.

No changes were made to production router, UI, or docs retrieval. The canonical pipeline and existing retrieval/index stack are reused; the harness is test-side only.