Compare commits

..

10 Commits

1325 changed files with 2494 additions and 52 deletions

4
.gitignore vendored Normal file
View File

@@ -0,0 +1,4 @@
src/rag_agent/.env
.env
docker/ssh
docker/postgres_test_data

22
Dockerfile Normal file
View File

@@ -0,0 +1,22 @@
# RAG Agent app. Build from repo root (clone git@git.lesha.spb.ru:alex/RagAgent.git then docker compose build).
FROM python:3.12-slim
WORKDIR /app
# Install git for optional in-image clone; app is usually COPY'd from build context
RUN apt-get update -qq && apt-get install -y --no-install-recommends git openssh-client \
&& rm -rf /var/lib/apt/lists/*
# Copy repo (when built from cloned repo: docker compose build)
COPY pyproject.toml ./
COPY src ./src
COPY README.md ./
RUN pip install --no-cache-dir -e .
# Default: run webhook server (override in compose or when running)
ENV RAG_DB_DSN=""
ENV RAG_REPO_PATH="/data"
EXPOSE 8000
ENTRYPOINT ["rag-agent"]
CMD ["serve", "--host", "0.0.0.0", "--port", "8000"]

126
README.md
View File

@@ -1,23 +1,77 @@
# RAG Agent (Postgres) # RAG Agent (Postgres)
Custom RAG agent that indexes text files from a git repository into Postgres Custom RAG agent that indexes text files from a git repository into Postgres
and answers queries using retrieval + LLM generation. Commits are tied to and answers queries using retrieval + LLM generation. **Changes are always in the context of a Story**: the unit of work is the story, not individual commits. The agent indexes **all changes from all commits** in the story range (base_ref..head_ref); per-commit indexing is not used.
**stories**; indexing and retrieval can be scoped by story.
## Quick start ## Quick start
1. Configure environment variables: 1. (Optional) Run Postgres and the app via Docker (clone the repo first):
- `git clone git@git.lesha.spb.ru:alex/RagAgent.git && cd RagAgent`
- `docker compose up -d` — starts Postgres and the RAG app in one network `rag_net`; app connects to DB at host `postgres`.
- On first start (empty DB), scripts in `docker/postgres-init/` run automatically (extension + tables). To disable, comment out the init volume in `docker-compose.yml`.
- Default DSN inside the app: `postgresql://rag:rag_secret@postgres:5432/rag`. Override with `POSTGRES_*` and `RAG_REPO_PATH` (path to your knowledge-base repo, mounted into the app container).
- Run commands: `docker compose run --rm app index --story my-branch`, `docker compose run --rm app ask "Question?"`.
2. Configure environment variables:
- `RAG_REPO_PATH` — path to git repo with text files - `RAG_REPO_PATH` — path to git repo with text files
- `RAG_DB_DSN` — Postgres DSN (e.g. `postgresql://user:pass@localhost:5432/rag`) - `RAG_DB_DSN` — Postgres DSN (e.g. `postgresql://rag:rag_secret@localhost:5432/rag`)
- `RAG_EMBEDDINGS_DIM` — embedding vector dimension (e.g. `1536`) - `RAG_EMBEDDINGS_DIM` — embedding vector dimension: **1024** for GigaChat Embeddings (default), 1536 for OpenAI
2. Create DB schema: 3. Create DB schema (only if not using Docker, or if init was disabled):
- `python scripts/create_db.py` (or `psql "$RAG_DB_DSN" -f scripts/schema.sql`) - `python scripts/create_db.py` (or `psql "$RAG_DB_DSN" -f scripts/schema.sql`)
3. Index files for a story (e.g. branch name as story slug): 4. Index files for a story (e.g. branch name as story slug). Use the **full story range** so all commits in the story are included:
- `rag-agent index --story my-branch --changed --base-ref HEAD~1 --head-ref HEAD` - `rag-agent index --story my-branch --changed --base-ref main --head-ref HEAD`
4. Ask a question (optionally scoped to a story): - Or `--base-ref auto` to use merge-base(default-branch, head-ref) as the start of the story.
5. Ask a question (optionally scoped to a story):
- `rag-agent ask "What is covered?"` - `rag-agent ask "What is covered?"`
- `rag-agent ask "What is covered?" --story my-branch` - `rag-agent ask "What is covered?" --story my-branch`
## Webhook: index on push to remote
When the app runs as a service in Docker, it can start a **webhook server** so that each push to the remote repository triggers a pull and incremental indexing.
1. Start the stack with the webhook server (default in Docker):
- `docker compose up -d` — app runs `rag-agent serve` and listens on port 8000.
- Repo is mounted at `RAG_REPO_PATH` (e.g. `/data`) **writable**, so the container can run `git fetch` + `git merge --ff-only` to pull changes.
2. Clone the knowledge-base repo into the mounted directory (once), e.g. on the host: `git clone <url> ./data` so that `./data` is the worktree (or set `RAG_REPO_PATH` to that path and mount it).
3. In GitHub (or GitLab) add a **Webhook**:
- URL: `http://<your-server>:8000/webhook` (use HTTPS in production and put a reverse proxy in front).
- Content type: `application/json`.
- Secret: set a shared secret and export `WEBHOOK_SECRET` in the app environment (Docker: in `docker-compose.yml` or `.env`). If `WEBHOOK_SECRET` is empty, signature is not checked.
4. On each push to a branch, the server receives the webhook, pulls that branch into the worktree, and runs `rag-agent index --story <branch> --changed --base-ref <old_head> --head-ref <new_head>` so only changed files are re-indexed.
Health check: `GET http://<host>:8000/health``ok`. Port is configurable via `WEBHOOK_PORT` (default 8000) in docker-compose.
### Webhook diagnostics (202 Accepted but no new rows in DB)
1. **Logs** — After a push, check app logs. Each webhook logs `pull_and_index started branch=… repo_path=…`; then one of:
- `not a git repo or missing``/data` in the container is not a git clone; clone the repo into the mounted dir.
- `git fetch failed` — SSH/network (see `docker/ssh/README.md`) or wrong remote.
- `git checkout … failed` — branch missing in the clone.
- `git merge --ff-only failed` — nonfast-forward (e.g. force-push); index is skipped. Use a normal push or re-clone.
- `no new commits for branch=…` — merge was a no-op (already up to date); nothing to index.
- `running index story=…` then `index completed` — index ran; check tables for that story.
- `index failed` — stderr shows the `rag-agent index` error (e.g. DB, embeddings, repo path).
```bash
docker compose logs -f app
# or: docker logs -f rag-agent
```
Trigger a push and watch for the lines above.
2. **Story and tables** — Rows are per **story** (branch name). Query by story, e.g. `SELECT * FROM stories;` then `SELECT * FROM chunks WHERE story_id = (SELECT id FROM stories WHERE slug = 'main');`.
3. **Manual index** — Run index inside the container to confirm DB and repo work:
```bash
docker compose exec app rag-agent index --story main --changed --base-ref main --head-ref HEAD
```
If this inserts rows, the issue is in the webhook path (fetch/merge/refs).
4. **Allowed extensions** — Only `.md`, `.txt`, `.rst` (or `RAG_ALLOWED_EXTENSIONS`) are indexed; other files are skipped.
5. **"expected 1536 dimensions, not 1024"** — GigaChat Embeddings returns 1024-dim vectors; the default is now 1024. If the DB was created earlier with vector(1536), drop and recreate the tables so the app can create them with 1024: `psql "$RAG_DB_DSN" -c "DROP TABLE IF EXISTS chunks; DROP TABLE IF EXISTS documents;"` then restart the app (ensure_schema will recreate the tables).
## Git hook (index on commit) ## Git hook (index on commit)
Install the post-commit hook so changed files are indexed after each commit: Install the post-commit hook so changed files are indexed after each commit:
@@ -28,16 +82,62 @@ cp scripts/post-commit .git/hooks/post-commit && chmod +x .git/hooks/post-commit
Story for the commit is taken from (in order): env `RAG_STORY`, file `.rag-story` in repo root (one line = slug), or current branch name. Story for the commit is taken from (in order): env `RAG_STORY`, file `.rag-story` in repo root (one line = slug), or current branch name.
## Git hook (server-side)
Use `scripts/post-receive` in the **bare repo** on the server so that pushes trigger indexing.
1. On the server, create a **non-bare clone** (worktree) that the hook will update and use for indexing, e.g. `git clone /path/to/repo.git /var/rag-worktree/repo`.
2. In the bare repo, install the hook: `cp /path/to/RagAgent/scripts/post-receive /path/to/repo.git/hooks/post-receive && chmod +x .../post-receive`.
3. Set env for the hook (e.g. in the hook or via systemd/sshd): `RAG_REPO_PATH=/var/rag-worktree/repo`, `RAG_DB_DSN=...`, `RAG_EMBEDDINGS_DIM=...`. Optionally `RAG_AGENT_VENV` (path to venv with `rag-agent`) or `RAG_AGENT_SRC` + `RAG_AGENT_PYTHON` for `python -m rag_agent.cli`.
4. On each push the hook updates the worktree to the new commit, then runs `rag-agent index --changed --base-ref main --head-ref newrev --story <branch>` so the story contains **all commits** on the branch (from main to newrev).
Story is taken from the ref name (e.g. `refs/heads/main` → `main`).
## DB structure ## DB structure
- **stories** — story slug (e.g. branch name); documents and chunks are tied to a story. - **stories** — story slug (e.g. branch name); documents and chunks are tied to a story. Optional: `indexed_base_ref`, `indexed_head_ref`, `indexed_at` record the git range that was indexed (all commits in that range belong to the story).
- **documents** — path + version per story; unique `(story_id, path)`. - **documents** — path + version per story; unique `(story_id, path)`.
- **chunks** — text chunks with embeddings (pgvector); updated when documents are re-indexed. - **chunks** — text chunks with embeddings (pgvector), plus:
- `start_line`, `end_line` — position in the source file (for requirements/use-case files).
- `change_type` — `added` | `modified` | `unchanged` (relative to base ref when indexing with `--changed`).
- `previous_content` — for `modified` chunks, the content before the change (for test-case generation).
Indexing is **always per-story**: `base_ref..head_ref` defines the set of commits that belong to the story. Use `--base-ref main` (or `auto`) and `--head-ref HEAD` so the story contains all commits on the branch, not a single commit. When you run `index --changed`, the base ref is compared to head; each chunk is marked as added, modified, or unchanged.
### What changed in a story (for test cases)
To get only the chunks that were added or modified in a story (e.g. to generate test cases for the changed part):
```python
from rag_agent.index import fetch_changed_chunks
changed = fetch_changed_chunks(conn, story_id)
for r in changed:
# r.path, r.content, r.change_type, r.start_line, r.end_line, r.previous_content
...
```
Scripts: `scripts/create_db.py` (Python, uses `ensure_schema` and `RAG_*` env), `scripts/schema.sql` (raw SQL). Scripts: `scripts/create_db.py` (Python, uses `ensure_schema` and `RAG_*` env), `scripts/schema.sql` (raw SQL).
## Embeddings (GigaChat)
If `GIGACHAT_CREDENTIALS` is set (e.g. in `.env` for local runs), embeddings use GigaChat API; otherwise the stub client is used. Optional env: `GIGACHAT_EMBEDDINGS_MODEL` (default `Embeddings`), `GIGACHAT_VERIFY_SSL` (`true`/`false`). Ensure `RAG_EMBEDDINGS_DIM` matches the model output (see GigaChat docs).
## Agent (GigaChat)
Ответы на вопросы формирует агент на базе GigaChat: поиск по базе знаний (RAG) + генерация текста. Если задана переменная `GIGACHAT_CREDENTIALS`, используется `GigaChatLLMClient` в `src/rag_agent/agent/pipeline.py`; иначе — заглушка. Модель чата задаётся через `RAG_LLM_MODEL` (по умолчанию `GigaChat`).
## Telegram-бот
Общение с пользователем через бота в Telegram: бот отвечает на текстовые сообщения, используя знания из базы (RAG + GigaChat).
1. Создайте бота через [@BotFather](https://t.me/BotFather) и получите токен.
2. Добавьте в `.env`: `TELEGRAM_BOT_TOKEN=<токен>`.
3. Запуск: `rag-agent bot` (или `python -m rag_agent.telegram_bot`).
4. Через Docker: `docker compose up -d` поднимает БД, вебхук-сервер и бота в отдельных контейнерах; в `.env` должен быть задан `TELEGRAM_BOT_TOKEN`.
Требуются: `RAG_DB_DSN`, `RAG_REPO_PATH`, `GIGACHAT_CREDENTIALS`, `TELEGRAM_BOT_TOKEN`. Расширенное логирование (входящие сообщения, число эмбеддингов, число чанков из БД, ответ LLM): `RAG_BOT_VERBOSE_LOGGING=true|false` (по умолчанию `true` для отладки).
## Notes ## Notes
- The default embedding/LLM clients are stubs. Replace them in
`src/rag_agent/index/embeddings.py` and `src/rag_agent/agent/pipeline.py`.
- This project requires Postgres with the `pgvector` extension. - This project requires Postgres with the `pgvector` extension.

91
docker-compose.yml Normal file
View File

@@ -0,0 +1,91 @@
# Postgres with pgvector + RAG Agent app (from repo git@git.lesha.spb.ru:alex/RagAgent.git).
# Clone the repo, then: docker compose up -d
# App and DB share network "rag_net"; app uses RAG_DB_DSN with host=postgres.
# DB init: scripts in docker/postgres-init/ run on first start (empty volume); to disable, comment out the init volume.
services:
postgres:
image: pgvector/pgvector:pg16
container_name: rag-postgres
environment:
POSTGRES_USER: ${POSTGRES_USER:-rag}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-rag_secret}
POSTGRES_DB: ${POSTGRES_DB:-rag}
ports:
- "${POSTGRES_PORT:-5432}:5432"
volumes:
# PG 18+: mount at /var/lib/postgresql (data goes in versioned subdir). For pg16 use /var/lib/postgresql/data.
- rag_pgdata:/var/lib/postgresql
# Init scripts run once on first start (create extension, tables). Optional: comment out to skip.
- ./docker/postgres-init:/docker-entrypoint-initdb.d:ro
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-rag} -d ${POSTGRES_DB:-rag}"]
interval: 5s
timeout: 5s
retries: 5
networks:
- rag_net
app:
build:
context: .
dockerfile: Dockerfile
image: rag-agent:latest
container_name: rag-agent
restart: unless-stopped
depends_on:
postgres:
condition: service_healthy
ports:
- "${WEBHOOK_PORT:-8000}:8000"
environment:
RAG_DB_DSN: "postgresql://${POSTGRES_USER:-rag}:${POSTGRES_PASSWORD:-rag_secret}@postgres:5432/${POSTGRES_DB:-rag}"
# In container repo is always at /data (mounted below). Use RAG_REPO_HOST in .env for host path.
RAG_REPO_PATH: "/data"
# Accept host key on first connect; git fetch uses SSH from /root/.ssh (mounted below).
GIT_SSH_COMMAND: "ssh -o StrictHostKeyChecking=accept-new"
RAG_EMBEDDINGS_DIM: ${RAG_EMBEDDINGS_DIM:-1024}
GIGACHAT_CREDENTIALS: ${GIGACHAT_CREDENTIALS:-}
GIGACHAT_EMBEDDINGS_MODEL: ${GIGACHAT_EMBEDDINGS_MODEL:-Embeddings}
WEBHOOK_SECRET: ${WEBHOOK_SECRET:-}
volumes:
# Host path: set RAG_REPO_HOST in .env (e.g. /Users/you/repo). Falls back to RAG_REPO_PATH then ./data.
- ${RAG_REPO_HOST:-${RAG_REPO_PATH:-./data}}:/data
# SSH for git fetch (webhook): put deploy key and known_hosts in RAG_SSH_DIR. See docker/ssh/README.md.
- ${RAG_SSH_DIR:-./docker/ssh}:/root/.ssh:ro
entrypoint: ["rag-agent"]
command: ["serve", "--host", "0.0.0.0", "--port", "8000"]
networks:
- rag_net
bot:
build:
context: .
dockerfile: Dockerfile
image: rag-agent:latest
container_name: rag-bot
restart: unless-stopped
depends_on:
postgres:
condition: service_healthy
environment:
RAG_DB_DSN: "postgresql://${POSTGRES_USER:-rag}:${POSTGRES_PASSWORD:-rag_secret}@postgres:5432/${POSTGRES_DB:-rag}"
RAG_REPO_PATH: "/data"
RAG_EMBEDDINGS_DIM: ${RAG_EMBEDDINGS_DIM:-1024}
GIGACHAT_CREDENTIALS: ${GIGACHAT_CREDENTIALS:-}
GIGACHAT_EMBEDDINGS_MODEL: ${GIGACHAT_EMBEDDINGS_MODEL:-Embeddings}
TELEGRAM_BOT_TOKEN: ${TELEGRAM_BOT_TOKEN:-}
RAG_BOT_VERBOSE_LOGGING: ${RAG_BOT_VERBOSE_LOGGING:-true}
volumes:
- ${RAG_REPO_HOST:-${RAG_REPO_PATH:-./data}}:/data
entrypoint: ["rag-agent"]
command: ["bot"]
networks:
- rag_net
networks:
rag_net:
driver: bridge
volumes:
rag_pgdata:

View File

@@ -0,0 +1,7 @@
-- Example: create an extra DB user (e.g. read-only). Not executed — rename to 00-create-extra-user.sql to enable.
-- Scripts in this folder run in alphabetical order; 00-* runs before 01-schema.sql.
-- CREATE USER rag_readonly WITH PASSWORD 'change_me';
-- GRANT CONNECT ON DATABASE rag TO rag_readonly;
-- GRANT USAGE ON SCHEMA public TO rag_readonly;
-- GRANT SELECT ON ALL TABLES IN SCHEMA public TO rag_readonly;

View File

@@ -0,0 +1,41 @@
-- RAG vector DB schema (runs automatically on first Postgres init).
-- GigaChat Embeddings = 1024; for OpenAI use vector(1536).
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS stories (
id SERIAL PRIMARY KEY,
slug TEXT UNIQUE NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT (NOW() AT TIME ZONE 'utc'),
indexed_base_ref TEXT,
indexed_head_ref TEXT,
indexed_at TIMESTAMPTZ
);
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
story_id INTEGER NOT NULL REFERENCES stories(id) ON DELETE CASCADE,
path TEXT NOT NULL,
version TEXT NOT NULL,
updated_at TIMESTAMPTZ NOT NULL,
UNIQUE(story_id, path)
);
CREATE TABLE IF NOT EXISTS chunks (
id SERIAL PRIMARY KEY,
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
chunk_index INTEGER NOT NULL,
hash TEXT NOT NULL,
content TEXT NOT NULL,
embedding vector(1024) NOT NULL,
start_line INTEGER,
end_line INTEGER,
change_type TEXT NOT NULL DEFAULT 'added'
CHECK (change_type IN ('added', 'modified', 'unchanged')),
previous_content TEXT
);
CREATE INDEX IF NOT EXISTS idx_documents_story_id ON documents(story_id);
CREATE INDEX IF NOT EXISTS idx_chunks_document_id ON chunks(document_id);
CREATE INDEX IF NOT EXISTS idx_chunks_embedding ON chunks USING ivfflat (embedding vector_cosine_ops);
CREATE INDEX IF NOT EXISTS idx_chunks_change_type ON chunks(change_type);

View File

@@ -0,0 +1,9 @@
# Postgres init scripts (optional)
Files here are mounted into the Postgres container at `/docker-entrypoint-initdb.d/` and run **only on first startup** (when the data volume is empty), in alphabetical order.
- `01-schema.sql` — creates pgvector extension and RAG tables (stories, documents, chunks).
- To add more users or other setup, add scripts with names like `00-create-user.sql` (they run before `01-schema.sql`).
- To disable init: in `docker-compose.yml`, comment out the postgres volume that mounts this folder, or remove/rename the `.sql` files.
After the first run, these scripts are not executed again. To re-run them, remove the volume: `docker compose down -v` (this deletes DB data), then `docker compose up -d`.

View File

@@ -0,0 +1 @@
18

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Some files were not shown because too many files have changed in this diff Show More