Compare commits

...

6 Commits

1322 changed files with 2312 additions and 68 deletions

3
.gitignore vendored
View File

@@ -1 +1,4 @@
src/rag_agent/.env
.env
docker/ssh
docker/postgres_test_data

View File

@@ -14,8 +14,9 @@ COPY README.md ./
RUN pip install --no-cache-dir -e .
# Default: run CLI (override in compose or when running)
# Default: run webhook server (override in compose or when running)
ENV RAG_DB_DSN=""
ENV RAG_REPO_PATH="/data"
EXPOSE 8000
ENTRYPOINT ["rag-agent"]
CMD ["ask", "--help"]
CMD ["serve", "--host", "0.0.0.0", "--port", "8000"]

107
README.md
View File

@@ -1,8 +1,10 @@
# RAG Agent (Postgres)
Custom RAG agent that indexes text files from a git repository into Postgres
and answers queries using retrieval + LLM generation. Commits are tied to
**stories**; indexing and retrieval can be scoped by story.
and answers queries using retrieval + LLM generation. **Changes are always in the context of a Story**: the unit of work is the story, not individual commits. The agent indexes **all changes from all commits** in the story range (base_ref..head_ref); per-commit indexing is not used.
## Quick start
@@ -15,15 +17,61 @@ and answers queries using retrieval + LLM generation. Commits are tied to
2. Configure environment variables:
- `RAG_REPO_PATH` — path to git repo with text files
- `RAG_DB_DSN` — Postgres DSN (e.g. `postgresql://rag:rag_secret@localhost:5432/rag`)
- `RAG_EMBEDDINGS_DIM` — embedding vector dimension (e.g. `1536`)
- `RAG_EMBEDDINGS_DIM` — embedding vector dimension: **1024** for GigaChat Embeddings (default), 1536 for OpenAI
3. Create DB schema (only if not using Docker, or if init was disabled):
- `python scripts/create_db.py` (or `psql "$RAG_DB_DSN" -f scripts/schema.sql`)
4. Index files for a story (e.g. branch name as story slug):
- `rag-agent index --story my-branch --changed --base-ref HEAD~1 --head-ref HEAD`
4. Index files for a story (e.g. branch name as story slug). Use the **full story range** so all commits in the story are included:
- `rag-agent index --story my-branch --changed --base-ref main --head-ref HEAD`
- Or `--base-ref auto` to use merge-base(default-branch, head-ref) as the start of the story.
5. Ask a question (optionally scoped to a story):
- `rag-agent ask "What is covered?"`
- `rag-agent ask "What is covered?" --story my-branch`
## Webhook: index on push to remote
When the app runs as a service in Docker, it can start a **webhook server** so that each push to the remote repository triggers a pull and incremental indexing.
1. Start the stack with the webhook server (default in Docker):
- `docker compose up -d` — app runs `rag-agent serve` and listens on port 8000.
- Repo is mounted at `RAG_REPO_PATH` (e.g. `/data`) **writable**, so the container can run `git fetch` + `git merge --ff-only` to pull changes.
2. Clone the knowledge-base repo into the mounted directory (once), e.g. on the host: `git clone <url> ./data` so that `./data` is the worktree (or set `RAG_REPO_PATH` to that path and mount it).
3. In GitHub (or GitLab) add a **Webhook**:
- URL: `http://<your-server>:8000/webhook` (use HTTPS in production and put a reverse proxy in front).
- Content type: `application/json`.
- Secret: set a shared secret and export `WEBHOOK_SECRET` in the app environment (Docker: in `docker-compose.yml` or `.env`). If `WEBHOOK_SECRET` is empty, signature is not checked.
4. On each push to a branch, the server receives the webhook, pulls that branch into the worktree, and runs `rag-agent index --story <branch> --changed --base-ref <old_head> --head-ref <new_head>` so only changed files are re-indexed.
Health check: `GET http://<host>:8000/health``ok`. Port is configurable via `WEBHOOK_PORT` (default 8000) in docker-compose.
### Webhook diagnostics (202 Accepted but no new rows in DB)
1. **Logs** — After a push, check app logs. Each webhook logs `pull_and_index started branch=… repo_path=…`; then one of:
- `not a git repo or missing``/data` in the container is not a git clone; clone the repo into the mounted dir.
- `git fetch failed` — SSH/network (see `docker/ssh/README.md`) or wrong remote.
- `git checkout … failed` — branch missing in the clone.
- `git merge --ff-only failed` — nonfast-forward (e.g. force-push); index is skipped. Use a normal push or re-clone.
- `no new commits for branch=…` — merge was a no-op (already up to date); nothing to index.
- `running index story=…` then `index completed` — index ran; check tables for that story.
- `index failed` — stderr shows the `rag-agent index` error (e.g. DB, embeddings, repo path).
```bash
docker compose logs -f app
# or: docker logs -f rag-agent
```
Trigger a push and watch for the lines above.
2. **Story and tables** — Rows are per **story** (branch name). Query by story, e.g. `SELECT * FROM stories;` then `SELECT * FROM chunks WHERE story_id = (SELECT id FROM stories WHERE slug = 'main');`.
3. **Manual index** — Run index inside the container to confirm DB and repo work:
```bash
docker compose exec app rag-agent index --story main --changed --base-ref main --head-ref HEAD
```
If this inserts rows, the issue is in the webhook path (fetch/merge/refs).
4. **Allowed extensions** — Only `.md`, `.txt`, `.rst` (or `RAG_ALLOWED_EXTENSIONS`) are indexed; other files are skipped.
5. **"expected 1536 dimensions, not 1024"** — GigaChat Embeddings returns 1024-dim vectors; the default is now 1024. If the DB was created earlier with vector(1536), drop and recreate the tables so the app can create them with 1024: `psql "$RAG_DB_DSN" -c "DROP TABLE IF EXISTS chunks; DROP TABLE IF EXISTS documents;"` then restart the app (ensure_schema will recreate the tables).
## Git hook (index on commit)
Install the post-commit hook so changed files are indexed after each commit:
@@ -34,11 +82,40 @@ cp scripts/post-commit .git/hooks/post-commit && chmod +x .git/hooks/post-commit
Story for the commit is taken from (in order): env `RAG_STORY`, file `.rag-story` in repo root (one line = slug), or current branch name.
## Git hook (server-side)
Use `scripts/post-receive` in the **bare repo** on the server so that pushes trigger indexing.
1. On the server, create a **non-bare clone** (worktree) that the hook will update and use for indexing, e.g. `git clone /path/to/repo.git /var/rag-worktree/repo`.
2. In the bare repo, install the hook: `cp /path/to/RagAgent/scripts/post-receive /path/to/repo.git/hooks/post-receive && chmod +x .../post-receive`.
3. Set env for the hook (e.g. in the hook or via systemd/sshd): `RAG_REPO_PATH=/var/rag-worktree/repo`, `RAG_DB_DSN=...`, `RAG_EMBEDDINGS_DIM=...`. Optionally `RAG_AGENT_VENV` (path to venv with `rag-agent`) or `RAG_AGENT_SRC` + `RAG_AGENT_PYTHON` for `python -m rag_agent.cli`.
4. On each push the hook updates the worktree to the new commit, then runs `rag-agent index --changed --base-ref main --head-ref newrev --story <branch>` so the story contains **all commits** on the branch (from main to newrev).
Story is taken from the ref name (e.g. `refs/heads/main` → `main`).
## DB structure
- **stories** — story slug (e.g. branch name); documents and chunks are tied to a story.
- **stories** — story slug (e.g. branch name); documents and chunks are tied to a story. Optional: `indexed_base_ref`, `indexed_head_ref`, `indexed_at` record the git range that was indexed (all commits in that range belong to the story).
- **documents** — path + version per story; unique `(story_id, path)`.
- **chunks** — text chunks with embeddings (pgvector); updated when documents are re-indexed.
- **chunks** — text chunks with embeddings (pgvector), plus:
- `start_line`, `end_line` — position in the source file (for requirements/use-case files).
- `change_type` — `added` | `modified` | `unchanged` (relative to base ref when indexing with `--changed`).
- `previous_content` — for `modified` chunks, the content before the change (for test-case generation).
Indexing is **always per-story**: `base_ref..head_ref` defines the set of commits that belong to the story. Use `--base-ref main` (or `auto`) and `--head-ref HEAD` so the story contains all commits on the branch, not a single commit. When you run `index --changed`, the base ref is compared to head; each chunk is marked as added, modified, or unchanged.
### What changed in a story (for test cases)
To get only the chunks that were added or modified in a story (e.g. to generate test cases for the changed part):
```python
from rag_agent.index import fetch_changed_chunks
changed = fetch_changed_chunks(conn, story_id)
for r in changed:
# r.path, r.content, r.change_type, r.start_line, r.end_line, r.previous_content
...
```
Scripts: `scripts/create_db.py` (Python, uses `ensure_schema` and `RAG_*` env), `scripts/schema.sql` (raw SQL).
@@ -46,7 +123,21 @@ Scripts: `scripts/create_db.py` (Python, uses `ensure_schema` and `RAG_*` env),
If `GIGACHAT_CREDENTIALS` is set (e.g. in `.env` for local runs), embeddings use GigaChat API; otherwise the stub client is used. Optional env: `GIGACHAT_EMBEDDINGS_MODEL` (default `Embeddings`), `GIGACHAT_VERIFY_SSL` (`true`/`false`). Ensure `RAG_EMBEDDINGS_DIM` matches the model output (see GigaChat docs).
## Agent (GigaChat)
Ответы на вопросы формирует агент на базе GigaChat: поиск по базе знаний (RAG) + генерация текста. Если задана переменная `GIGACHAT_CREDENTIALS`, используется `GigaChatLLMClient` в `src/rag_agent/agent/pipeline.py`; иначе — заглушка. Модель чата задаётся через `RAG_LLM_MODEL` (по умолчанию `GigaChat`).
## Telegram-бот
Общение с пользователем через бота в Telegram: бот отвечает на текстовые сообщения, используя знания из базы (RAG + GigaChat).
1. Создайте бота через [@BotFather](https://t.me/BotFather) и получите токен.
2. Добавьте в `.env`: `TELEGRAM_BOT_TOKEN=<токен>`.
3. Запуск: `rag-agent bot` (или `python -m rag_agent.telegram_bot`).
4. Через Docker: `docker compose up -d` поднимает БД, вебхук-сервер и бота в отдельных контейнерах; в `.env` должен быть задан `TELEGRAM_BOT_TOKEN`.
Требуются: `RAG_DB_DSN`, `RAG_REPO_PATH`, `GIGACHAT_CREDENTIALS`, `TELEGRAM_BOT_TOKEN`. Расширенное логирование (входящие сообщения, число эмбеддингов, число чанков из БД, ответ LLM): `RAG_BOT_VERBOSE_LOGGING=true|false` (по умолчанию `true` для отладки).
## Notes
- LLM client is still a stub; replace it in `src/rag_agent/agent/pipeline.py` for real answers.
- This project requires Postgres with the `pgvector` extension.

View File

@@ -14,7 +14,8 @@ services:
ports:
- "${POSTGRES_PORT:-5432}:5432"
volumes:
- rag_pgdata:/var/lib/postgresql/data
# PG 18+: mount at /var/lib/postgresql (data goes in versioned subdir). For pg16 use /var/lib/postgresql/data.
- rag_pgdata:/var/lib/postgresql
# Init scripts run once on first start (create extension, tables). Optional: comment out to skip.
- ./docker/postgres-init:/docker-entrypoint-initdb.d:ro
healthcheck:
@@ -31,20 +32,54 @@ services:
dockerfile: Dockerfile
image: rag-agent:latest
container_name: rag-agent
restart: "no"
restart: unless-stopped
depends_on:
postgres:
condition: service_healthy
ports:
- "${WEBHOOK_PORT:-8000}:8000"
environment:
RAG_DB_DSN: "postgresql://${POSTGRES_USER:-rag}:${POSTGRES_PASSWORD:-rag_secret}@postgres:5432/${POSTGRES_DB:-rag}"
# In container repo is always at /data (mounted below). Use RAG_REPO_HOST in .env for host path.
RAG_REPO_PATH: "/data"
# Accept host key on first connect; git fetch uses SSH from /root/.ssh (mounted below).
GIT_SSH_COMMAND: "ssh -o StrictHostKeyChecking=accept-new"
RAG_EMBEDDINGS_DIM: ${RAG_EMBEDDINGS_DIM:-1024}
GIGACHAT_CREDENTIALS: ${GIGACHAT_CREDENTIALS:-}
GIGACHAT_EMBEDDINGS_MODEL: ${GIGACHAT_EMBEDDINGS_MODEL:-Embeddings}
WEBHOOK_SECRET: ${WEBHOOK_SECRET:-}
volumes:
# Host path: set RAG_REPO_HOST in .env (e.g. /Users/you/repo). Falls back to RAG_REPO_PATH then ./data.
- ${RAG_REPO_HOST:-${RAG_REPO_PATH:-./data}}:/data
# SSH for git fetch (webhook): put deploy key and known_hosts in RAG_SSH_DIR. See docker/ssh/README.md.
- ${RAG_SSH_DIR:-./docker/ssh}:/root/.ssh:ro
entrypoint: ["rag-agent"]
command: ["serve", "--host", "0.0.0.0", "--port", "8000"]
networks:
- rag_net
bot:
build:
context: .
dockerfile: Dockerfile
image: rag-agent:latest
container_name: rag-bot
restart: unless-stopped
depends_on:
postgres:
condition: service_healthy
environment:
RAG_DB_DSN: "postgresql://${POSTGRES_USER:-rag}:${POSTGRES_PASSWORD:-rag_secret}@postgres:5432/${POSTGRES_DB:-rag}"
RAG_REPO_PATH: ${RAG_REPO_PATH:-/data}
RAG_EMBEDDINGS_DIM: ${RAG_EMBEDDINGS_DIM:-1536}
RAG_REPO_PATH: "/data"
RAG_EMBEDDINGS_DIM: ${RAG_EMBEDDINGS_DIM:-1024}
GIGACHAT_CREDENTIALS: ${GIGACHAT_CREDENTIALS:-}
GIGACHAT_EMBEDDINGS_MODEL: ${GIGACHAT_EMBEDDINGS_MODEL:-Embeddings}
TELEGRAM_BOT_TOKEN: ${TELEGRAM_BOT_TOKEN:-}
RAG_BOT_VERBOSE_LOGGING: ${RAG_BOT_VERBOSE_LOGGING:-true}
volumes:
- ${RAG_REPO_PATH:-./data}:/data:ro
- ${RAG_REPO_HOST:-${RAG_REPO_PATH:-./data}}:/data
entrypoint: ["rag-agent"]
command: ["ask", "--help"]
command: ["bot"]
networks:
- rag_net

View File

@@ -1,12 +1,15 @@
-- RAG vector DB schema (runs automatically on first Postgres init).
-- If RAG_EMBEDDINGS_DIM is not 1536, change vector(1536) below.
-- GigaChat Embeddings = 1024; for OpenAI use vector(1536).
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS stories (
id SERIAL PRIMARY KEY,
slug TEXT UNIQUE NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT (NOW() AT TIME ZONE 'utc')
created_at TIMESTAMPTZ NOT NULL DEFAULT (NOW() AT TIME ZONE 'utc'),
indexed_base_ref TEXT,
indexed_head_ref TEXT,
indexed_at TIMESTAMPTZ
);
CREATE TABLE IF NOT EXISTS documents (
@@ -24,9 +27,15 @@ CREATE TABLE IF NOT EXISTS chunks (
chunk_index INTEGER NOT NULL,
hash TEXT NOT NULL,
content TEXT NOT NULL,
embedding vector(1536) NOT NULL
embedding vector(1024) NOT NULL,
start_line INTEGER,
end_line INTEGER,
change_type TEXT NOT NULL DEFAULT 'added'
CHECK (change_type IN ('added', 'modified', 'unchanged')),
previous_content TEXT
);
CREATE INDEX IF NOT EXISTS idx_documents_story_id ON documents(story_id);
CREATE INDEX IF NOT EXISTS idx_chunks_document_id ON chunks(document_id);
CREATE INDEX IF NOT EXISTS idx_chunks_embedding ON chunks USING ivfflat (embedding vector_cosine_ops);
CREATE INDEX IF NOT EXISTS idx_chunks_change_type ON chunks(change_type);

View File

@@ -0,0 +1 @@
18

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Some files were not shown because too many files have changed in this diff Show More