Files
agent/docs/architecture/llm_inventory.md

6.5 KiB

LLM Inventory

Provider and SDK

  • Provider in code: GigaChat / Sber
  • Local SDK style: custom thin HTTP client over requests
  • Core files:
    • app/modules/shared/gigachat/client.py
    • app/modules/shared/gigachat/settings.py
    • app/modules/shared/gigachat/token_provider.py
    • app/modules/agent/llm/service.py

There is no OpenAI SDK, Azure SDK, or local model runtime in the current implementation.

Configuration

Model and endpoint configuration are read from environment in GigaChatSettings.from_env():

  • GIGACHAT_AUTH_URL
    • default: https://ngw.devices.sberbank.ru:9443/api/v2/oauth
  • GIGACHAT_API_URL
    • default: https://gigachat.devices.sberbank.ru/api/v1
  • GIGACHAT_SCOPE
    • default: GIGACHAT_API_PERS
  • GIGACHAT_TOKEN
    • required for auth
  • GIGACHAT_SSL_VERIFY
    • default: true
  • GIGACHAT_MODEL
    • default: GigaChat
  • GIGACHAT_EMBEDDING_MODEL
    • default: Embeddings
  • AGENT_PROMPTS_DIR
    • optional prompt directory override

PostgreSQL config for retrieval storage is separate:

  • DATABASE_URL
    • default: postgresql+psycopg://agent:agent@db:5432/agent

Default models

  • Chat/completions model default: GigaChat
  • Embedding model default: Embeddings

Completion payload

Observed payload sent by GigaChatClient.complete(...):

{
  "model": "GigaChat",
  "messages": [
    {"role": "system", "content": "<prompt template text>"},
    {"role": "user", "content": "<runtime user input>"}
  ]
}

Endpoint:

  • POST {GIGACHAT_API_URL}/chat/completions

Observed response handling:

  • reads choices[0].message.content
  • if no choices: returns empty string

Embeddings payload

Observed payload sent by GigaChatClient.embed(...):

{
  "model": "Embeddings",
  "input": [
    "<text1>",
    "<text2>"
  ]
}

Endpoint:

  • POST {GIGACHAT_API_URL}/embeddings

Observed response handling:

  • expects data list
  • maps each item.embedding to list[float]

Parameters

Explicitly implemented

  • model
  • messages
  • input
  • HTTP timeout:
    • completions: 90s
    • embeddings: 90s
    • auth: 30s
  • TLS verification flag:
    • verify=settings.ssl_verify

Not implemented in payload

  • temperature
  • top_p
  • max_tokens
  • response_format
  • tools/function calling
  • streaming
  • seed
  • stop sequences

ASSUMPTION: the service uses provider defaults for sampling and output length because these fields are not sent in the request payload.

Context and budget limits

There is no centralized token budget manager in the current code.

Observed practical limits instead:

  • prompt file text is loaded as-is from disk
  • user input is passed as-is
  • RAG context shaping happens outside the LLM client
  • docs indexing summary truncation:
    • docs module catalog summary: 4000 chars
    • docs policy text: 4000 chars
  • project QA source bundle caps:
    • top 12 rag items
    • top 10 file candidates
  • logging truncation only:
    • LLM input/output logs capped at 1500 chars for logs

ASSUMPTION: there is no explicit max-context enforcement before chat completion requests. The current system relies on upstream graph logic to keep inputs small enough.

Retry, backoff, timeout

Timeouts

  • auth: 30s
  • chat completion: 90s
  • embeddings: 90s

Retry

  • Generic async retry wrapper exists in app/modules/shared/retry_executor.py
  • It retries only:
    • TimeoutError
    • ConnectionError
    • OSError
  • Retry constants:
    • MAX_RETRIES = 5
    • backoff: 0.1 * attempt seconds

Important current limitation

  • GigaChatClient raises GigaChatError on HTTP and request failures.
  • RetryExecutor does not catch GigaChatError.
  • Result: LLM and embeddings calls are effectively not retried by this generic retry helper unless errors are converted upstream.

Prompt formation

Prompt loading is handled by PromptLoader:

  • base dir: app/modules/agent/prompts
  • override: AGENT_PROMPTS_DIR
  • file naming convention: <prompt_name>.txt

Prompt composition model today:

  • system prompt:
    • full contents of selected prompt file
  • user prompt:
    • raw runtime input string passed by the caller
  • no separate developer prompt layer in the application payload

If a prompt file is missing:

  • fallback system prompt: You are a helpful assistant.

Prompt templates present

  • router_intent
  • general_answer
  • project_answer
  • docs_detect
  • docs_strategy
  • docs_plan_sections
  • docs_generation
  • docs_self_check
  • docs_execution_summary
  • project_edits_plan
  • project_edits_hunks
  • project_edits_self_check

Key LLM call entrypoints

Composition roots

  • app/modules/agent/module.py
    • builds GigaChatSettings
    • builds GigaChatTokenProvider
    • builds GigaChatClient
    • builds PromptLoader
    • builds AgentLlmService
  • app/modules/rag_session/module.py
    • builds the same provider stack for embeddings used by RAG

Main abstraction

  • AgentLlmService.generate(prompt_name, user_input, log_context=None)

Current generate callsites

  • app/modules/agent/engine/router/intent_classifier.py
    • router_intent
  • app/modules/agent/engine/graphs/base_graph.py
    • general_answer
  • app/modules/agent/engine/graphs/project_qa_graph.py
    • project_answer
  • app/modules/agent/engine/graphs/docs_graph_logic.py
    • docs_detect
    • docs_strategy
    • docs_plan_sections
    • docs_generation
    • docs_self_check
    • docs_execution_summary-like usage via summary step
  • app/modules/agent/engine/graphs/project_edits_logic.py
    • project_edits_plan
    • project_edits_self_check
    • project_edits_hunks

Logging and observability

AgentLlmService logs:

  • input:
    • graph llm input: context=... prompt=... user_input=...
  • output:
    • graph llm output: context=... prompt=... output=...

Log truncation:

  • 1500 chars

RAG retrieval logs separately in RagService, but without embedding vectors.

Integration with retrieval

There are two distinct GigaChat usages:

  1. Chat/completion path for agent reasoning and generation
  2. Embedding path for RAG indexing and retrieval

The embedding adapter is GigaChatEmbedder, used by:

  • app/modules/rag/services/rag_service.py

Notable limitations

  • Single provider coupling: chat and embeddings both depend on GigaChat-specific endpoints.
  • No model routing by scenario.
  • No tool/function calling.
  • No centralized prompt token budgeting.
  • No explicit retry for GigaChatError.
  • No streaming completions.
  • No structured response mode beyond prompt conventions and downstream parsing.