# LLM Inventory ## Provider and SDK - Provider in code: GigaChat / Sber - Local SDK style: custom thin HTTP client over `requests` - Core files: - `app/modules/shared/gigachat/client.py` - `app/modules/shared/gigachat/settings.py` - `app/modules/shared/gigachat/token_provider.py` - `app/modules/agent/llm/service.py` There is no OpenAI SDK, Azure SDK, or local model runtime in the current implementation. ## Configuration Model and endpoint configuration are read from environment in `GigaChatSettings.from_env()`: - `GIGACHAT_AUTH_URL` - default: `https://ngw.devices.sberbank.ru:9443/api/v2/oauth` - `GIGACHAT_API_URL` - default: `https://gigachat.devices.sberbank.ru/api/v1` - `GIGACHAT_SCOPE` - default: `GIGACHAT_API_PERS` - `GIGACHAT_TOKEN` - required for auth - `GIGACHAT_SSL_VERIFY` - default: `true` - `GIGACHAT_MODEL` - default: `GigaChat` - `GIGACHAT_EMBEDDING_MODEL` - default: `Embeddings` - `AGENT_PROMPTS_DIR` - optional prompt directory override PostgreSQL config for retrieval storage is separate: - `DATABASE_URL` - default: `postgresql+psycopg://agent:agent@db:5432/agent` ## Default models - Chat/completions model default: `GigaChat` - Embedding model default: `Embeddings` ## Completion payload Observed payload sent by `GigaChatClient.complete(...)`: ```json { "model": "GigaChat", "messages": [ {"role": "system", "content": ""}, {"role": "user", "content": ""} ] } ``` Endpoint: - `POST {GIGACHAT_API_URL}/chat/completions` Observed response handling: - reads `choices[0].message.content` - if no choices: returns empty string ## Embeddings payload Observed payload sent by `GigaChatClient.embed(...)`: ```json { "model": "Embeddings", "input": [ "", "" ] } ``` Endpoint: - `POST {GIGACHAT_API_URL}/embeddings` Observed response handling: - expects `data` list - maps each `item.embedding` to `list[float]` ## Parameters ### Explicitly implemented - `model` - `messages` - `input` - HTTP timeout: - completions: `90s` - embeddings: `90s` - auth: `30s` - TLS verification flag: - `verify=settings.ssl_verify` ### Not implemented in payload - `temperature` - `top_p` - `max_tokens` - `response_format` - tools/function calling - streaming - seed - stop sequences `ASSUMPTION:` the service uses provider defaults for sampling and output length because these fields are not sent in the request payload. ## Context and budget limits There is no centralized token budget manager in the current code. Observed practical limits instead: - prompt file text is loaded as-is from disk - user input is passed as-is - RAG context shaping happens outside the LLM client - docs indexing summary truncation: - docs module catalog summary: `4000` chars - docs policy text: `4000` chars - project QA source bundle caps: - top `12` rag items - top `10` file candidates - logging truncation only: - LLM input/output logs capped at `1500` chars for logs `ASSUMPTION:` there is no explicit max-context enforcement before chat completion requests. The current system relies on upstream graph logic to keep inputs small enough. ## Retry, backoff, timeout ### Timeouts - auth: `30s` - chat completion: `90s` - embeddings: `90s` ### Retry - Generic async retry wrapper exists in `app/modules/shared/retry_executor.py` - It retries only: - `TimeoutError` - `ConnectionError` - `OSError` - Retry constants: - `MAX_RETRIES = 5` - backoff: `0.1 * attempt` seconds ### Important current limitation - `GigaChatClient` raises `GigaChatError` on HTTP and request failures. - `RetryExecutor` does not catch `GigaChatError`. - Result: LLM and embeddings calls are effectively not retried by this generic retry helper unless errors are converted upstream. ## Prompt formation Prompt loading is handled by `PromptLoader`: - base dir: `app/modules/agent/prompts` - override: `AGENT_PROMPTS_DIR` - file naming convention: `.txt` Prompt composition model today: - system prompt: - full contents of selected prompt file - user prompt: - raw runtime input string passed by the caller - no separate developer prompt layer in the application payload If a prompt file is missing: - fallback system prompt: `You are a helpful assistant.` ## Prompt templates present - `router_intent` - `general_answer` - `project_answer` - `docs_detect` - `docs_strategy` - `docs_plan_sections` - `docs_generation` - `docs_self_check` - `docs_execution_summary` - `project_edits_plan` - `project_edits_hunks` - `project_edits_self_check` ## Key LLM call entrypoints ### Composition roots - `app/modules/agent/module.py` - builds `GigaChatSettings` - builds `GigaChatTokenProvider` - builds `GigaChatClient` - builds `PromptLoader` - builds `AgentLlmService` - `app/modules/rag_session/module.py` - builds the same provider stack for embeddings used by RAG ### Main abstraction - `AgentLlmService.generate(prompt_name, user_input, log_context=None)` ### Current generate callsites - `app/modules/agent/engine/router/intent_classifier.py` - `router_intent` - `app/modules/agent/engine/graphs/base_graph.py` - `general_answer` - `app/modules/agent/engine/graphs/project_qa_graph.py` - `project_answer` - `app/modules/agent/engine/graphs/docs_graph_logic.py` - `docs_detect` - `docs_strategy` - `docs_plan_sections` - `docs_generation` - `docs_self_check` - `docs_execution_summary`-like usage via summary step - `app/modules/agent/engine/graphs/project_edits_logic.py` - `project_edits_plan` - `project_edits_self_check` - `project_edits_hunks` ## Logging and observability `AgentLlmService` logs: - input: - `graph llm input: context=... prompt=... user_input=...` - output: - `graph llm output: context=... prompt=... output=...` Log truncation: - 1500 chars RAG retrieval logs separately in `RagService`, but without embedding vectors. ## Integration with retrieval There are two distinct GigaChat usages: 1. Chat/completion path for agent reasoning and generation 2. Embedding path for RAG indexing and retrieval The embedding adapter is `GigaChatEmbedder`, used by: - `app/modules/rag/services/rag_service.py` ## Notable limitations - Single provider coupling: chat and embeddings both depend on GigaChat-specific endpoints. - No model routing by scenario. - No tool/function calling. - No centralized prompt token budgeting. - No explicit retry for `GigaChatError`. - No streaming completions. - No structured response mode beyond prompt conventions and downstream parsing.