6.5 KiB
LLM Inventory
Provider and SDK
- Provider in code: GigaChat / Sber
- Local SDK style: custom thin HTTP client over
requests - Core files:
app/modules/shared/gigachat/client.pyapp/modules/shared/gigachat/settings.pyapp/modules/shared/gigachat/token_provider.pyapp/modules/agent/llm/service.py
There is no OpenAI SDK, Azure SDK, or local model runtime in the current implementation.
Configuration
Model and endpoint configuration are read from environment in GigaChatSettings.from_env():
GIGACHAT_AUTH_URL- default:
https://ngw.devices.sberbank.ru:9443/api/v2/oauth
- default:
GIGACHAT_API_URL- default:
https://gigachat.devices.sberbank.ru/api/v1
- default:
GIGACHAT_SCOPE- default:
GIGACHAT_API_PERS
- default:
GIGACHAT_TOKEN- required for auth
GIGACHAT_SSL_VERIFY- default:
true
- default:
GIGACHAT_MODEL- default:
GigaChat
- default:
GIGACHAT_EMBEDDING_MODEL- default:
Embeddings
- default:
AGENT_PROMPTS_DIR- optional prompt directory override
PostgreSQL config for retrieval storage is separate:
DATABASE_URL- default:
postgresql+psycopg://agent:agent@db:5432/agent
- default:
Default models
- Chat/completions model default:
GigaChat - Embedding model default:
Embeddings
Completion payload
Observed payload sent by GigaChatClient.complete(...):
{
"model": "GigaChat",
"messages": [
{"role": "system", "content": "<prompt template text>"},
{"role": "user", "content": "<runtime user input>"}
]
}
Endpoint:
POST {GIGACHAT_API_URL}/chat/completions
Observed response handling:
- reads
choices[0].message.content - if no choices: returns empty string
Embeddings payload
Observed payload sent by GigaChatClient.embed(...):
{
"model": "Embeddings",
"input": [
"<text1>",
"<text2>"
]
}
Endpoint:
POST {GIGACHAT_API_URL}/embeddings
Observed response handling:
- expects
datalist - maps each
item.embeddingtolist[float]
Parameters
Explicitly implemented
modelmessagesinput- HTTP timeout:
- completions:
90s - embeddings:
90s - auth:
30s
- completions:
- TLS verification flag:
verify=settings.ssl_verify
Not implemented in payload
temperaturetop_pmax_tokensresponse_format- tools/function calling
- streaming
- seed
- stop sequences
ASSUMPTION: the service uses provider defaults for sampling and output length because these fields are not sent in the request payload.
Context and budget limits
There is no centralized token budget manager in the current code.
Observed practical limits instead:
- prompt file text is loaded as-is from disk
- user input is passed as-is
- RAG context shaping happens outside the LLM client
- docs indexing summary truncation:
- docs module catalog summary:
4000chars - docs policy text:
4000chars
- docs module catalog summary:
- project QA source bundle caps:
- top
12rag items - top
10file candidates
- top
- logging truncation only:
- LLM input/output logs capped at
1500chars for logs
- LLM input/output logs capped at
ASSUMPTION: there is no explicit max-context enforcement before chat completion requests. The current system relies on upstream graph logic to keep inputs small enough.
Retry, backoff, timeout
Timeouts
- auth:
30s - chat completion:
90s - embeddings:
90s
Retry
- Generic async retry wrapper exists in
app/modules/shared/retry_executor.py - It retries only:
TimeoutErrorConnectionErrorOSError
- Retry constants:
MAX_RETRIES = 5- backoff:
0.1 * attemptseconds
Important current limitation
GigaChatClientraisesGigaChatErroron HTTP and request failures.RetryExecutordoes not catchGigaChatError.- Result: LLM and embeddings calls are effectively not retried by this generic retry helper unless errors are converted upstream.
Prompt formation
Prompt loading is handled by PromptLoader:
- base dir:
app/modules/agent/prompts - override:
AGENT_PROMPTS_DIR - file naming convention:
<prompt_name>.txt
Prompt composition model today:
- system prompt:
- full contents of selected prompt file
- user prompt:
- raw runtime input string passed by the caller
- no separate developer prompt layer in the application payload
If a prompt file is missing:
- fallback system prompt:
You are a helpful assistant.
Prompt templates present
router_intentgeneral_answerproject_answerdocs_detectdocs_strategydocs_plan_sectionsdocs_generationdocs_self_checkdocs_execution_summaryproject_edits_planproject_edits_hunksproject_edits_self_check
Key LLM call entrypoints
Composition roots
app/modules/agent/module.py- builds
GigaChatSettings - builds
GigaChatTokenProvider - builds
GigaChatClient - builds
PromptLoader - builds
AgentLlmService
- builds
app/modules/rag_session/module.py- builds the same provider stack for embeddings used by RAG
Main abstraction
AgentLlmService.generate(prompt_name, user_input, log_context=None)
Current generate callsites
app/modules/agent/engine/router/intent_classifier.pyrouter_intent
app/modules/agent/engine/graphs/base_graph.pygeneral_answer
app/modules/agent/engine/graphs/project_qa_graph.pyproject_answer
app/modules/agent/engine/graphs/docs_graph_logic.pydocs_detectdocs_strategydocs_plan_sectionsdocs_generationdocs_self_checkdocs_execution_summary-like usage via summary step
app/modules/agent/engine/graphs/project_edits_logic.pyproject_edits_planproject_edits_self_checkproject_edits_hunks
Logging and observability
AgentLlmService logs:
- input:
graph llm input: context=... prompt=... user_input=...
- output:
graph llm output: context=... prompt=... output=...
Log truncation:
- 1500 chars
RAG retrieval logs separately in RagService, but without embedding vectors.
Integration with retrieval
There are two distinct GigaChat usages:
- Chat/completion path for agent reasoning and generation
- Embedding path for RAG indexing and retrieval
The embedding adapter is GigaChatEmbedder, used by:
app/modules/rag/services/rag_service.py
Notable limitations
- Single provider coupling: chat and embeddings both depend on GigaChat-specific endpoints.
- No model routing by scenario.
- No tool/function calling.
- No centralized prompt token budgeting.
- No explicit retry for
GigaChatError. - No streaming completions.
- No structured response mode beyond prompt conventions and downstream parsing.