Compare commits

...

5 Commits

Author SHA1 Message Date
304e4dae6d Тестовый коммит 2026-01-31 23:57:27 +03:00
dce020d637 Настройка гитигнор 2026-01-31 23:54:01 +03:00
15f8a57d3a Stop tracking .env 2026-01-31 23:48:24 +03:00
a990e704d9 Бот работает 2026-01-31 23:46:08 +03:00
c8980abe2b Рабочий вариант 2026-01-31 20:19:44 +03:00
1314 changed files with 1632 additions and 45 deletions

3
.gitignore vendored
View File

@@ -1 +1,4 @@
src/rag_agent/.env
.env
docker/ssh
docker/postgres_test_data

View File

@@ -1,3 +1,6 @@
# RAG Agent (Postgres)
Custom RAG agent that indexes text files from a git repository into Postgres
@@ -14,7 +17,7 @@ and answers queries using retrieval + LLM generation. **Changes are always in th
2. Configure environment variables:
- `RAG_REPO_PATH` — path to git repo with text files
- `RAG_DB_DSN` — Postgres DSN (e.g. `postgresql://rag:rag_secret@localhost:5432/rag`)
- `RAG_EMBEDDINGS_DIM` — embedding vector dimension (e.g. `1536`)
- `RAG_EMBEDDINGS_DIM` — embedding vector dimension: **1024** for GigaChat Embeddings (default), 1536 for OpenAI
3. Create DB schema (only if not using Docker, or if init was disabled):
- `python scripts/create_db.py` (or `psql "$RAG_DB_DSN" -f scripts/schema.sql`)
4. Index files for a story (e.g. branch name as story slug). Use the **full story range** so all commits in the story are included:
@@ -40,6 +43,35 @@ When the app runs as a service in Docker, it can start a **webhook server** so t
Health check: `GET http://<host>:8000/health``ok`. Port is configurable via `WEBHOOK_PORT` (default 8000) in docker-compose.
### Webhook diagnostics (202 Accepted but no new rows in DB)
1. **Logs** — After a push, check app logs. Each webhook logs `pull_and_index started branch=… repo_path=…`; then one of:
- `not a git repo or missing``/data` in the container is not a git clone; clone the repo into the mounted dir.
- `git fetch failed` — SSH/network (see `docker/ssh/README.md`) or wrong remote.
- `git checkout … failed` — branch missing in the clone.
- `git merge --ff-only failed` — nonfast-forward (e.g. force-push); index is skipped. Use a normal push or re-clone.
- `no new commits for branch=…` — merge was a no-op (already up to date); nothing to index.
- `running index story=…` then `index completed` — index ran; check tables for that story.
- `index failed` — stderr shows the `rag-agent index` error (e.g. DB, embeddings, repo path).
```bash
docker compose logs -f app
# or: docker logs -f rag-agent
```
Trigger a push and watch for the lines above.
2. **Story and tables** — Rows are per **story** (branch name). Query by story, e.g. `SELECT * FROM stories;` then `SELECT * FROM chunks WHERE story_id = (SELECT id FROM stories WHERE slug = 'main');`.
3. **Manual index** — Run index inside the container to confirm DB and repo work:
```bash
docker compose exec app rag-agent index --story main --changed --base-ref main --head-ref HEAD
```
If this inserts rows, the issue is in the webhook path (fetch/merge/refs).
4. **Allowed extensions** — Only `.md`, `.txt`, `.rst` (or `RAG_ALLOWED_EXTENSIONS`) are indexed; other files are skipped.
5. **"expected 1536 dimensions, not 1024"** — GigaChat Embeddings returns 1024-dim vectors; the default is now 1024. If the DB was created earlier with vector(1536), drop and recreate the tables so the app can create them with 1024: `psql "$RAG_DB_DSN" -c "DROP TABLE IF EXISTS chunks; DROP TABLE IF EXISTS documents;"` then restart the app (ensure_schema will recreate the tables).
## Git hook (index on commit)
Install the post-commit hook so changed files are indexed after each commit:
@@ -91,7 +123,21 @@ Scripts: `scripts/create_db.py` (Python, uses `ensure_schema` and `RAG_*` env),
If `GIGACHAT_CREDENTIALS` is set (e.g. in `.env` for local runs), embeddings use GigaChat API; otherwise the stub client is used. Optional env: `GIGACHAT_EMBEDDINGS_MODEL` (default `Embeddings`), `GIGACHAT_VERIFY_SSL` (`true`/`false`). Ensure `RAG_EMBEDDINGS_DIM` matches the model output (see GigaChat docs).
## Agent (GigaChat)
Ответы на вопросы формирует агент на базе GigaChat: поиск по базе знаний (RAG) + генерация текста. Если задана переменная `GIGACHAT_CREDENTIALS`, используется `GigaChatLLMClient` в `src/rag_agent/agent/pipeline.py`; иначе — заглушка. Модель чата задаётся через `RAG_LLM_MODEL` (по умолчанию `GigaChat`).
## Telegram-бот
Общение с пользователем через бота в Telegram: бот отвечает на текстовые сообщения, используя знания из базы (RAG + GigaChat).
1. Создайте бота через [@BotFather](https://t.me/BotFather) и получите токен.
2. Добавьте в `.env`: `TELEGRAM_BOT_TOKEN=<токен>`.
3. Запуск: `rag-agent bot` (или `python -m rag_agent.telegram_bot`).
4. Через Docker: `docker compose up -d` поднимает БД, вебхук-сервер и бота в отдельных контейнерах; в `.env` должен быть задан `TELEGRAM_BOT_TOKEN`.
Требуются: `RAG_DB_DSN`, `RAG_REPO_PATH`, `GIGACHAT_CREDENTIALS`, `TELEGRAM_BOT_TOKEN`. Расширенное логирование (входящие сообщения, число эмбеддингов, число чанков из БД, ответ LLM): `RAG_BOT_VERBOSE_LOGGING=true|false` (по умолчанию `true` для отладки).
## Notes
- LLM client is still a stub; replace it in `src/rag_agent/agent/pipeline.py` for real answers.
- This project requires Postgres with the `pgvector` extension.

View File

@@ -42,18 +42,47 @@ services:
RAG_DB_DSN: "postgresql://${POSTGRES_USER:-rag}:${POSTGRES_PASSWORD:-rag_secret}@postgres:5432/${POSTGRES_DB:-rag}"
# In container repo is always at /data (mounted below). Use RAG_REPO_HOST in .env for host path.
RAG_REPO_PATH: "/data"
RAG_EMBEDDINGS_DIM: ${RAG_EMBEDDINGS_DIM:-1536}
# Accept host key on first connect; git fetch uses SSH from /root/.ssh (mounted below).
GIT_SSH_COMMAND: "ssh -o StrictHostKeyChecking=accept-new"
RAG_EMBEDDINGS_DIM: ${RAG_EMBEDDINGS_DIM:-1024}
GIGACHAT_CREDENTIALS: ${GIGACHAT_CREDENTIALS:-}
GIGACHAT_EMBEDDINGS_MODEL: ${GIGACHAT_EMBEDDINGS_MODEL:-Embeddings}
WEBHOOK_SECRET: ${WEBHOOK_SECRET:-}
volumes:
# Host path: set RAG_REPO_HOST in .env (e.g. /Users/you/repo). Falls back to RAG_REPO_PATH then ./data.
- ${RAG_REPO_HOST:-${RAG_REPO_PATH:-./data}}:/data
# SSH for git fetch (webhook): put deploy key and known_hosts in RAG_SSH_DIR. See docker/ssh/README.md.
- ${RAG_SSH_DIR:-./docker/ssh}:/root/.ssh:ro
entrypoint: ["rag-agent"]
command: ["serve", "--host", "0.0.0.0", "--port", "8000"]
networks:
- rag_net
bot:
build:
context: .
dockerfile: Dockerfile
image: rag-agent:latest
container_name: rag-bot
restart: unless-stopped
depends_on:
postgres:
condition: service_healthy
environment:
RAG_DB_DSN: "postgresql://${POSTGRES_USER:-rag}:${POSTGRES_PASSWORD:-rag_secret}@postgres:5432/${POSTGRES_DB:-rag}"
RAG_REPO_PATH: "/data"
RAG_EMBEDDINGS_DIM: ${RAG_EMBEDDINGS_DIM:-1024}
GIGACHAT_CREDENTIALS: ${GIGACHAT_CREDENTIALS:-}
GIGACHAT_EMBEDDINGS_MODEL: ${GIGACHAT_EMBEDDINGS_MODEL:-Embeddings}
TELEGRAM_BOT_TOKEN: ${TELEGRAM_BOT_TOKEN:-}
RAG_BOT_VERBOSE_LOGGING: ${RAG_BOT_VERBOSE_LOGGING:-true}
volumes:
- ${RAG_REPO_HOST:-${RAG_REPO_PATH:-./data}}:/data
entrypoint: ["rag-agent"]
command: ["bot"]
networks:
- rag_net
networks:
rag_net:
driver: bridge

View File

@@ -1,5 +1,5 @@
-- RAG vector DB schema (runs automatically on first Postgres init).
-- If RAG_EMBEDDINGS_DIM is not 1536, change vector(1536) below.
-- GigaChat Embeddings = 1024; for OpenAI use vector(1536).
CREATE EXTENSION IF NOT EXISTS vector;
@@ -27,7 +27,7 @@ CREATE TABLE IF NOT EXISTS chunks (
chunk_index INTEGER NOT NULL,
hash TEXT NOT NULL,
content TEXT NOT NULL,
embedding vector(1536) NOT NULL,
embedding vector(1024) NOT NULL,
start_line INTEGER,
end_line INTEGER,
change_type TEXT NOT NULL DEFAULT 'added'

View File

@@ -0,0 +1 @@
18

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Some files were not shown because too many files have changed in this diff Show More