271 lines
6.5 KiB
Markdown
271 lines
6.5 KiB
Markdown
# LLM Inventory
|
|
|
|
## Provider and SDK
|
|
|
|
- Provider in code: GigaChat / Sber
|
|
- Local SDK style: custom thin HTTP client over `requests`
|
|
- Core files:
|
|
- `app/modules/shared/gigachat/client.py`
|
|
- `app/modules/shared/gigachat/settings.py`
|
|
- `app/modules/shared/gigachat/token_provider.py`
|
|
- `app/modules/agent/llm/service.py`
|
|
|
|
There is no OpenAI SDK, Azure SDK, or local model runtime in the current implementation.
|
|
|
|
## Configuration
|
|
|
|
Model and endpoint configuration are read from environment in `GigaChatSettings.from_env()`:
|
|
|
|
- `GIGACHAT_AUTH_URL`
|
|
- default: `https://ngw.devices.sberbank.ru:9443/api/v2/oauth`
|
|
- `GIGACHAT_API_URL`
|
|
- default: `https://gigachat.devices.sberbank.ru/api/v1`
|
|
- `GIGACHAT_SCOPE`
|
|
- default: `GIGACHAT_API_PERS`
|
|
- `GIGACHAT_TOKEN`
|
|
- required for auth
|
|
- `GIGACHAT_SSL_VERIFY`
|
|
- default: `true`
|
|
- `GIGACHAT_MODEL`
|
|
- default: `GigaChat`
|
|
- `GIGACHAT_EMBEDDING_MODEL`
|
|
- default: `Embeddings`
|
|
- `AGENT_PROMPTS_DIR`
|
|
- optional prompt directory override
|
|
|
|
PostgreSQL config for retrieval storage is separate:
|
|
|
|
- `DATABASE_URL`
|
|
- default: `postgresql+psycopg://agent:agent@db:5432/agent`
|
|
|
|
## Default models
|
|
|
|
- Chat/completions model default: `GigaChat`
|
|
- Embedding model default: `Embeddings`
|
|
|
|
## Completion payload
|
|
|
|
Observed payload sent by `GigaChatClient.complete(...)`:
|
|
|
|
```json
|
|
{
|
|
"model": "GigaChat",
|
|
"messages": [
|
|
{"role": "system", "content": "<prompt template text>"},
|
|
{"role": "user", "content": "<runtime user input>"}
|
|
]
|
|
}
|
|
```
|
|
|
|
Endpoint:
|
|
|
|
- `POST {GIGACHAT_API_URL}/chat/completions`
|
|
|
|
Observed response handling:
|
|
|
|
- reads `choices[0].message.content`
|
|
- if no choices: returns empty string
|
|
|
|
## Embeddings payload
|
|
|
|
Observed payload sent by `GigaChatClient.embed(...)`:
|
|
|
|
```json
|
|
{
|
|
"model": "Embeddings",
|
|
"input": [
|
|
"<text1>",
|
|
"<text2>"
|
|
]
|
|
}
|
|
```
|
|
|
|
Endpoint:
|
|
|
|
- `POST {GIGACHAT_API_URL}/embeddings`
|
|
|
|
Observed response handling:
|
|
|
|
- expects `data` list
|
|
- maps each `item.embedding` to `list[float]`
|
|
|
|
## Parameters
|
|
|
|
### Explicitly implemented
|
|
|
|
- `model`
|
|
- `messages`
|
|
- `input`
|
|
- HTTP timeout:
|
|
- completions: `90s`
|
|
- embeddings: `90s`
|
|
- auth: `30s`
|
|
- TLS verification flag:
|
|
- `verify=settings.ssl_verify`
|
|
|
|
### Not implemented in payload
|
|
|
|
- `temperature`
|
|
- `top_p`
|
|
- `max_tokens`
|
|
- `response_format`
|
|
- tools/function calling
|
|
- streaming
|
|
- seed
|
|
- stop sequences
|
|
|
|
`ASSUMPTION:` the service uses provider defaults for sampling and output length because these fields are not sent in the request payload.
|
|
|
|
## Context and budget limits
|
|
|
|
There is no centralized token budget manager in the current code.
|
|
|
|
Observed practical limits instead:
|
|
|
|
- prompt file text is loaded as-is from disk
|
|
- user input is passed as-is
|
|
- RAG context shaping happens outside the LLM client
|
|
- docs indexing summary truncation:
|
|
- docs module catalog summary: `4000` chars
|
|
- docs policy text: `4000` chars
|
|
- project QA source bundle caps:
|
|
- top `12` rag items
|
|
- top `10` file candidates
|
|
- logging truncation only:
|
|
- LLM input/output logs capped at `1500` chars for logs
|
|
|
|
`ASSUMPTION:` there is no explicit max-context enforcement before chat completion requests. The current system relies on upstream graph logic to keep inputs small enough.
|
|
|
|
## Retry, backoff, timeout
|
|
|
|
### Timeouts
|
|
|
|
- auth: `30s`
|
|
- chat completion: `90s`
|
|
- embeddings: `90s`
|
|
|
|
### Retry
|
|
|
|
- Generic async retry wrapper exists in `app/modules/shared/retry_executor.py`
|
|
- It retries only:
|
|
- `TimeoutError`
|
|
- `ConnectionError`
|
|
- `OSError`
|
|
- Retry constants:
|
|
- `MAX_RETRIES = 5`
|
|
- backoff: `0.1 * attempt` seconds
|
|
|
|
### Important current limitation
|
|
|
|
- `GigaChatClient` raises `GigaChatError` on HTTP and request failures.
|
|
- `RetryExecutor` does not catch `GigaChatError`.
|
|
- Result: LLM and embeddings calls are effectively not retried by this generic retry helper unless errors are converted upstream.
|
|
|
|
## Prompt formation
|
|
|
|
Prompt loading is handled by `PromptLoader`:
|
|
|
|
- base dir: `app/modules/agent/prompts`
|
|
- override: `AGENT_PROMPTS_DIR`
|
|
- file naming convention: `<prompt_name>.txt`
|
|
|
|
Prompt composition model today:
|
|
|
|
- system prompt:
|
|
- full contents of selected prompt file
|
|
- user prompt:
|
|
- raw runtime input string passed by the caller
|
|
- no separate developer prompt layer in the application payload
|
|
|
|
If a prompt file is missing:
|
|
|
|
- fallback system prompt: `You are a helpful assistant.`
|
|
|
|
## Prompt templates present
|
|
|
|
- `router_intent`
|
|
- `general_answer`
|
|
- `project_answer`
|
|
- `docs_detect`
|
|
- `docs_strategy`
|
|
- `docs_plan_sections`
|
|
- `docs_generation`
|
|
- `docs_self_check`
|
|
- `docs_execution_summary`
|
|
- `project_edits_plan`
|
|
- `project_edits_hunks`
|
|
- `project_edits_self_check`
|
|
|
|
## Key LLM call entrypoints
|
|
|
|
### Composition roots
|
|
|
|
- `app/modules/agent/module.py`
|
|
- builds `GigaChatSettings`
|
|
- builds `GigaChatTokenProvider`
|
|
- builds `GigaChatClient`
|
|
- builds `PromptLoader`
|
|
- builds `AgentLlmService`
|
|
- `app/modules/rag_session/module.py`
|
|
- builds the same provider stack for embeddings used by RAG
|
|
|
|
### Main abstraction
|
|
|
|
- `AgentLlmService.generate(prompt_name, user_input, log_context=None)`
|
|
|
|
### Current generate callsites
|
|
|
|
- `app/modules/agent/engine/router/intent_classifier.py`
|
|
- `router_intent`
|
|
- `app/modules/agent/engine/graphs/base_graph.py`
|
|
- `general_answer`
|
|
- `app/modules/agent/engine/graphs/project_qa_graph.py`
|
|
- `project_answer`
|
|
- `app/modules/agent/engine/graphs/docs_graph_logic.py`
|
|
- `docs_detect`
|
|
- `docs_strategy`
|
|
- `docs_plan_sections`
|
|
- `docs_generation`
|
|
- `docs_self_check`
|
|
- `docs_execution_summary`-like usage via summary step
|
|
- `app/modules/agent/engine/graphs/project_edits_logic.py`
|
|
- `project_edits_plan`
|
|
- `project_edits_self_check`
|
|
- `project_edits_hunks`
|
|
|
|
## Logging and observability
|
|
|
|
`AgentLlmService` logs:
|
|
|
|
- input:
|
|
- `graph llm input: context=... prompt=... user_input=...`
|
|
- output:
|
|
- `graph llm output: context=... prompt=... output=...`
|
|
|
|
Log truncation:
|
|
|
|
- 1500 chars
|
|
|
|
RAG retrieval logs separately in `RagService`, but without embedding vectors.
|
|
|
|
## Integration with retrieval
|
|
|
|
There are two distinct GigaChat usages:
|
|
|
|
1. Chat/completion path for agent reasoning and generation
|
|
2. Embedding path for RAG indexing and retrieval
|
|
|
|
The embedding adapter is `GigaChatEmbedder`, used by:
|
|
|
|
- `app/modules/rag/services/rag_service.py`
|
|
|
|
## Notable limitations
|
|
|
|
- Single provider coupling: chat and embeddings both depend on GigaChat-specific endpoints.
|
|
- No model routing by scenario.
|
|
- No tool/function calling.
|
|
- No centralized prompt token budgeting.
|
|
- No explicit retry for `GigaChatError`.
|
|
- No streaming completions.
|
|
- No structured response mode beyond prompt conventions and downstream parsing.
|