Фиксация изменений
This commit is contained in:
270
docs/architecture/llm_inventory.md
Normal file
270
docs/architecture/llm_inventory.md
Normal file
@@ -0,0 +1,270 @@
|
||||
# LLM Inventory
|
||||
|
||||
## Provider and SDK
|
||||
|
||||
- Provider in code: GigaChat / Sber
|
||||
- Local SDK style: custom thin HTTP client over `requests`
|
||||
- Core files:
|
||||
- `app/modules/shared/gigachat/client.py`
|
||||
- `app/modules/shared/gigachat/settings.py`
|
||||
- `app/modules/shared/gigachat/token_provider.py`
|
||||
- `app/modules/agent/llm/service.py`
|
||||
|
||||
There is no OpenAI SDK, Azure SDK, or local model runtime in the current implementation.
|
||||
|
||||
## Configuration
|
||||
|
||||
Model and endpoint configuration are read from environment in `GigaChatSettings.from_env()`:
|
||||
|
||||
- `GIGACHAT_AUTH_URL`
|
||||
- default: `https://ngw.devices.sberbank.ru:9443/api/v2/oauth`
|
||||
- `GIGACHAT_API_URL`
|
||||
- default: `https://gigachat.devices.sberbank.ru/api/v1`
|
||||
- `GIGACHAT_SCOPE`
|
||||
- default: `GIGACHAT_API_PERS`
|
||||
- `GIGACHAT_TOKEN`
|
||||
- required for auth
|
||||
- `GIGACHAT_SSL_VERIFY`
|
||||
- default: `true`
|
||||
- `GIGACHAT_MODEL`
|
||||
- default: `GigaChat`
|
||||
- `GIGACHAT_EMBEDDING_MODEL`
|
||||
- default: `Embeddings`
|
||||
- `AGENT_PROMPTS_DIR`
|
||||
- optional prompt directory override
|
||||
|
||||
PostgreSQL config for retrieval storage is separate:
|
||||
|
||||
- `DATABASE_URL`
|
||||
- default: `postgresql+psycopg://agent:agent@db:5432/agent`
|
||||
|
||||
## Default models
|
||||
|
||||
- Chat/completions model default: `GigaChat`
|
||||
- Embedding model default: `Embeddings`
|
||||
|
||||
## Completion payload
|
||||
|
||||
Observed payload sent by `GigaChatClient.complete(...)`:
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "GigaChat",
|
||||
"messages": [
|
||||
{"role": "system", "content": "<prompt template text>"},
|
||||
{"role": "user", "content": "<runtime user input>"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Endpoint:
|
||||
|
||||
- `POST {GIGACHAT_API_URL}/chat/completions`
|
||||
|
||||
Observed response handling:
|
||||
|
||||
- reads `choices[0].message.content`
|
||||
- if no choices: returns empty string
|
||||
|
||||
## Embeddings payload
|
||||
|
||||
Observed payload sent by `GigaChatClient.embed(...)`:
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "Embeddings",
|
||||
"input": [
|
||||
"<text1>",
|
||||
"<text2>"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Endpoint:
|
||||
|
||||
- `POST {GIGACHAT_API_URL}/embeddings`
|
||||
|
||||
Observed response handling:
|
||||
|
||||
- expects `data` list
|
||||
- maps each `item.embedding` to `list[float]`
|
||||
|
||||
## Parameters
|
||||
|
||||
### Explicitly implemented
|
||||
|
||||
- `model`
|
||||
- `messages`
|
||||
- `input`
|
||||
- HTTP timeout:
|
||||
- completions: `90s`
|
||||
- embeddings: `90s`
|
||||
- auth: `30s`
|
||||
- TLS verification flag:
|
||||
- `verify=settings.ssl_verify`
|
||||
|
||||
### Not implemented in payload
|
||||
|
||||
- `temperature`
|
||||
- `top_p`
|
||||
- `max_tokens`
|
||||
- `response_format`
|
||||
- tools/function calling
|
||||
- streaming
|
||||
- seed
|
||||
- stop sequences
|
||||
|
||||
`ASSUMPTION:` the service uses provider defaults for sampling and output length because these fields are not sent in the request payload.
|
||||
|
||||
## Context and budget limits
|
||||
|
||||
There is no centralized token budget manager in the current code.
|
||||
|
||||
Observed practical limits instead:
|
||||
|
||||
- prompt file text is loaded as-is from disk
|
||||
- user input is passed as-is
|
||||
- RAG context shaping happens outside the LLM client
|
||||
- docs indexing summary truncation:
|
||||
- docs module catalog summary: `4000` chars
|
||||
- docs policy text: `4000` chars
|
||||
- project QA source bundle caps:
|
||||
- top `12` rag items
|
||||
- top `10` file candidates
|
||||
- logging truncation only:
|
||||
- LLM input/output logs capped at `1500` chars for logs
|
||||
|
||||
`ASSUMPTION:` there is no explicit max-context enforcement before chat completion requests. The current system relies on upstream graph logic to keep inputs small enough.
|
||||
|
||||
## Retry, backoff, timeout
|
||||
|
||||
### Timeouts
|
||||
|
||||
- auth: `30s`
|
||||
- chat completion: `90s`
|
||||
- embeddings: `90s`
|
||||
|
||||
### Retry
|
||||
|
||||
- Generic async retry wrapper exists in `app/modules/shared/retry_executor.py`
|
||||
- It retries only:
|
||||
- `TimeoutError`
|
||||
- `ConnectionError`
|
||||
- `OSError`
|
||||
- Retry constants:
|
||||
- `MAX_RETRIES = 5`
|
||||
- backoff: `0.1 * attempt` seconds
|
||||
|
||||
### Important current limitation
|
||||
|
||||
- `GigaChatClient` raises `GigaChatError` on HTTP and request failures.
|
||||
- `RetryExecutor` does not catch `GigaChatError`.
|
||||
- Result: LLM and embeddings calls are effectively not retried by this generic retry helper unless errors are converted upstream.
|
||||
|
||||
## Prompt formation
|
||||
|
||||
Prompt loading is handled by `PromptLoader`:
|
||||
|
||||
- base dir: `app/modules/agent/prompts`
|
||||
- override: `AGENT_PROMPTS_DIR`
|
||||
- file naming convention: `<prompt_name>.txt`
|
||||
|
||||
Prompt composition model today:
|
||||
|
||||
- system prompt:
|
||||
- full contents of selected prompt file
|
||||
- user prompt:
|
||||
- raw runtime input string passed by the caller
|
||||
- no separate developer prompt layer in the application payload
|
||||
|
||||
If a prompt file is missing:
|
||||
|
||||
- fallback system prompt: `You are a helpful assistant.`
|
||||
|
||||
## Prompt templates present
|
||||
|
||||
- `router_intent`
|
||||
- `general_answer`
|
||||
- `project_answer`
|
||||
- `docs_detect`
|
||||
- `docs_strategy`
|
||||
- `docs_plan_sections`
|
||||
- `docs_generation`
|
||||
- `docs_self_check`
|
||||
- `docs_execution_summary`
|
||||
- `project_edits_plan`
|
||||
- `project_edits_hunks`
|
||||
- `project_edits_self_check`
|
||||
|
||||
## Key LLM call entrypoints
|
||||
|
||||
### Composition roots
|
||||
|
||||
- `app/modules/agent/module.py`
|
||||
- builds `GigaChatSettings`
|
||||
- builds `GigaChatTokenProvider`
|
||||
- builds `GigaChatClient`
|
||||
- builds `PromptLoader`
|
||||
- builds `AgentLlmService`
|
||||
- `app/modules/rag_session/module.py`
|
||||
- builds the same provider stack for embeddings used by RAG
|
||||
|
||||
### Main abstraction
|
||||
|
||||
- `AgentLlmService.generate(prompt_name, user_input, log_context=None)`
|
||||
|
||||
### Current generate callsites
|
||||
|
||||
- `app/modules/agent/engine/router/intent_classifier.py`
|
||||
- `router_intent`
|
||||
- `app/modules/agent/engine/graphs/base_graph.py`
|
||||
- `general_answer`
|
||||
- `app/modules/agent/engine/graphs/project_qa_graph.py`
|
||||
- `project_answer`
|
||||
- `app/modules/agent/engine/graphs/docs_graph_logic.py`
|
||||
- `docs_detect`
|
||||
- `docs_strategy`
|
||||
- `docs_plan_sections`
|
||||
- `docs_generation`
|
||||
- `docs_self_check`
|
||||
- `docs_execution_summary`-like usage via summary step
|
||||
- `app/modules/agent/engine/graphs/project_edits_logic.py`
|
||||
- `project_edits_plan`
|
||||
- `project_edits_self_check`
|
||||
- `project_edits_hunks`
|
||||
|
||||
## Logging and observability
|
||||
|
||||
`AgentLlmService` logs:
|
||||
|
||||
- input:
|
||||
- `graph llm input: context=... prompt=... user_input=...`
|
||||
- output:
|
||||
- `graph llm output: context=... prompt=... output=...`
|
||||
|
||||
Log truncation:
|
||||
|
||||
- 1500 chars
|
||||
|
||||
RAG retrieval logs separately in `RagService`, but without embedding vectors.
|
||||
|
||||
## Integration with retrieval
|
||||
|
||||
There are two distinct GigaChat usages:
|
||||
|
||||
1. Chat/completion path for agent reasoning and generation
|
||||
2. Embedding path for RAG indexing and retrieval
|
||||
|
||||
The embedding adapter is `GigaChatEmbedder`, used by:
|
||||
|
||||
- `app/modules/rag/services/rag_service.py`
|
||||
|
||||
## Notable limitations
|
||||
|
||||
- Single provider coupling: chat and embeddings both depend on GigaChat-specific endpoints.
|
||||
- No model routing by scenario.
|
||||
- No tool/function calling.
|
||||
- No centralized prompt token budgeting.
|
||||
- No explicit retry for `GigaChatError`.
|
||||
- No streaming completions.
|
||||
- No structured response mode beyond prompt conventions and downstream parsing.
|
||||
Reference in New Issue
Block a user