CODE_QA evaluation harness
Runs the canonical CODE_QA pipeline (IntentRouterV2 → retrieval → evidence gate → diagnostics) over golden cases and writes artifacts for calibration.
Modes
- Fixture (default): Uses
tests/pipeline_setup/suite_01_synthetic/fixtures/code_qa_repo. No env vars required. - Local repo: Set
CODE_QA_REPO_PATHto a directory; optionallyCODE_QA_PROJECT_ID.
Run
From the project root (agent repo):
python -m tests.pipeline_setup.suite_01_synthetic.code_qa_eval.run
Requires a configured database (same as pipeline_intent_rag router_rag tests). Outputs:
tests/pipeline_setup/test_results/code_qa_eval/<run_id>/*.mdand*.jsonper casetests/pipeline_setup/test_results/code_qa_eval/summary_<run_id>.mdbatch summary
Exit code 0 if all golden cases pass, 1 otherwise.
Golden cases
Edit tests/pipeline_setup/suite_01_synthetic/golden/code_qa/cases.yaml to add or change cases. See tests/pipeline_setup/suite_01_synthetic/golden/code_qa/README.md for the field format.
Tests
pytest tests/pipeline_setup/suite_01_synthetic/code_qa_eval/ -v
The fixture-mode integration test (test_run_eval_fixture_mode_structure) is skipped if the DB or dependencies are not available.