# RAG Agent (Postgres) Custom RAG agent that indexes text files from a git repository into Postgres and answers queries using retrieval + LLM generation. **Changes are always in the context of a Story**: the unit of work is the story, not individual commits. The agent indexes **all changes from all commits** in the story range (base_ref..head_ref); per-commit indexing is not used. ## Quick start 1. (Optional) Run Postgres and the app via Docker (clone the repo first): - `git clone git@git.lesha.spb.ru:alex/RagAgent.git && cd RagAgent` - `docker compose up -d` — starts Postgres and the RAG app in one network `rag_net`; app connects to DB at host `postgres`. - On first start (empty DB), scripts in `docker/postgres-init/` run automatically (extension + tables). To disable, comment out the init volume in `docker-compose.yml`. - Default DSN inside the app: `postgresql://rag:rag_secret@postgres:5432/rag`. Override with `POSTGRES_*` and `RAG_REPO_PATH` (path to your knowledge-base repo, mounted into the app container). - Run commands: `docker compose run --rm app index --story my-branch`, `docker compose run --rm app ask "Question?"`. 2. Configure environment variables: - `RAG_REPO_PATH` — path to git repo with text files - `RAG_DB_DSN` — Postgres DSN (e.g. `postgresql://rag:rag_secret@localhost:5432/rag`) - `RAG_EMBEDDINGS_DIM` — embedding vector dimension: **1024** for GigaChat Embeddings (default), 1536 for OpenAI 3. Create DB schema (only if not using Docker, or if init was disabled): - `python scripts/create_db.py` (or `psql "$RAG_DB_DSN" -f scripts/schema.sql`) 4. Index files for a story (e.g. branch name as story slug). Use the **full story range** so all commits in the story are included: - `rag-agent index --story my-branch --changed --base-ref main --head-ref HEAD` - Or `--base-ref auto` to use merge-base(default-branch, head-ref) as the start of the story. 5. Ask a question (optionally scoped to a story): - `rag-agent ask "What is covered?"` - `rag-agent ask "What is covered?" --story my-branch` ## Webhook: index on push to remote When the app runs as a service in Docker, it can start a **webhook server** so that each push to the remote repository triggers a pull and incremental indexing. 1. Start the stack with the webhook server (default in Docker): - `docker compose up -d` — app runs `rag-agent serve` and listens on port 8000. - Repo is mounted at `RAG_REPO_PATH` (e.g. `/data`) **writable**, so the container can run `git fetch` + `git merge --ff-only` to pull changes. 2. Clone the knowledge-base repo into the mounted directory (once), e.g. on the host: `git clone ./data` so that `./data` is the worktree (or set `RAG_REPO_PATH` to that path and mount it). 3. In GitHub (or GitLab) add a **Webhook**: - URL: `http://:8000/webhook` (use HTTPS in production and put a reverse proxy in front). - Content type: `application/json`. - Secret: set a shared secret and export `WEBHOOK_SECRET` in the app environment (Docker: in `docker-compose.yml` or `.env`). If `WEBHOOK_SECRET` is empty, signature is not checked. 4. On each push to a branch, the server receives the webhook, pulls that branch into the worktree, and runs `rag-agent index --story --changed --base-ref --head-ref ` so only changed files are re-indexed. Health check: `GET http://:8000/health` → `ok`. Port is configurable via `WEBHOOK_PORT` (default 8000) in docker-compose. ### Webhook diagnostics (202 Accepted but no new rows in DB) 1. **Logs** — After a push, check app logs. Each webhook logs `pull_and_index started branch=… repo_path=…`; then one of: - `not a git repo or missing` — `/data` in the container is not a git clone; clone the repo into the mounted dir. - `git fetch failed` — SSH/network (see `docker/ssh/README.md`) or wrong remote. - `git checkout … failed` — branch missing in the clone. - `git merge --ff-only failed` — non–fast-forward (e.g. force-push); index is skipped. Use a normal push or re-clone. - `no new commits for branch=…` — merge was a no-op (already up to date); nothing to index. - `running index story=…` then `index completed` — index ran; check tables for that story. - `index failed` — stderr shows the `rag-agent index` error (e.g. DB, embeddings, repo path). ```bash docker compose logs -f app # or: docker logs -f rag-agent ``` Trigger a push and watch for the lines above. 2. **Story and tables** — Rows are per **story** (branch name). Query by story, e.g. `SELECT * FROM stories;` then `SELECT * FROM chunks WHERE story_id = (SELECT id FROM stories WHERE slug = 'main');`. 3. **Manual index** — Run index inside the container to confirm DB and repo work: ```bash docker compose exec app rag-agent index --story main --changed --base-ref main --head-ref HEAD ``` If this inserts rows, the issue is in the webhook path (fetch/merge/refs). 4. **Allowed extensions** — Only `.md`, `.txt`, `.rst` (or `RAG_ALLOWED_EXTENSIONS`) are indexed; other files are skipped. 5. **"expected 1536 dimensions, not 1024"** — GigaChat Embeddings returns 1024-dim vectors; the default is now 1024. If the DB was created earlier with vector(1536), drop and recreate the tables so the app can create them with 1024: `psql "$RAG_DB_DSN" -c "DROP TABLE IF EXISTS chunks; DROP TABLE IF EXISTS documents;"` then restart the app (ensure_schema will recreate the tables). ## Git hook (index on commit) Install the post-commit hook so changed files are indexed after each commit: ```bash cp scripts/post-commit .git/hooks/post-commit && chmod +x .git/hooks/post-commit ``` Story for the commit is taken from (in order): env `RAG_STORY`, file `.rag-story` in repo root (one line = slug), or current branch name. ## Git hook (server-side) Use `scripts/post-receive` in the **bare repo** on the server so that pushes trigger indexing. 1. On the server, create a **non-bare clone** (worktree) that the hook will update and use for indexing, e.g. `git clone /path/to/repo.git /var/rag-worktree/repo`. 2. In the bare repo, install the hook: `cp /path/to/RagAgent/scripts/post-receive /path/to/repo.git/hooks/post-receive && chmod +x .../post-receive`. 3. Set env for the hook (e.g. in the hook or via systemd/sshd): `RAG_REPO_PATH=/var/rag-worktree/repo`, `RAG_DB_DSN=...`, `RAG_EMBEDDINGS_DIM=...`. Optionally `RAG_AGENT_VENV` (path to venv with `rag-agent`) or `RAG_AGENT_SRC` + `RAG_AGENT_PYTHON` for `python -m rag_agent.cli`. 4. On each push the hook updates the worktree to the new commit, then runs `rag-agent index --changed --base-ref main --head-ref newrev --story ` so the story contains **all commits** on the branch (from main to newrev). Story is taken from the ref name (e.g. `refs/heads/main` → `main`). ## DB structure - **stories** — story slug (e.g. branch name); documents and chunks are tied to a story. Optional: `indexed_base_ref`, `indexed_head_ref`, `indexed_at` record the git range that was indexed (all commits in that range belong to the story). - **documents** — path + version per story; unique `(story_id, path)`. - **chunks** — text chunks with embeddings (pgvector), plus: - `start_line`, `end_line` — position in the source file (for requirements/use-case files). - `change_type` — `added` | `modified` | `unchanged` (relative to base ref when indexing with `--changed`). - `previous_content` — for `modified` chunks, the content before the change (for test-case generation). Indexing is **always per-story**: `base_ref..head_ref` defines the set of commits that belong to the story. Use `--base-ref main` (or `auto`) and `--head-ref HEAD` so the story contains all commits on the branch, not a single commit. When you run `index --changed`, the base ref is compared to head; each chunk is marked as added, modified, or unchanged. ### What changed in a story (for test cases) To get only the chunks that were added or modified in a story (e.g. to generate test cases for the changed part): ```python from rag_agent.index import fetch_changed_chunks changed = fetch_changed_chunks(conn, story_id) for r in changed: # r.path, r.content, r.change_type, r.start_line, r.end_line, r.previous_content ... ``` Scripts: `scripts/create_db.py` (Python, uses `ensure_schema` and `RAG_*` env), `scripts/schema.sql` (raw SQL). ## Embeddings (GigaChat) If `GIGACHAT_CREDENTIALS` is set (e.g. in `.env` for local runs), embeddings use GigaChat API; otherwise the stub client is used. Optional env: `GIGACHAT_EMBEDDINGS_MODEL` (default `Embeddings`), `GIGACHAT_VERIFY_SSL` (`true`/`false`). Ensure `RAG_EMBEDDINGS_DIM` matches the model output (see GigaChat docs). ## Notes - LLM client is still a stub; replace it in `src/rag_agent/agent/pipeline.py` for real answers. - This project requires Postgres with the `pgvector` extension.