agent/docs/architecture/llm_inventory.md

# LLM Inventory

## Provider and SDK

- Provider in code: GigaChat / Sber
- Local SDK style: custom thin HTTP client over `requests`
- Core files:
  - `app/modules/shared/gigachat/client.py`
  - `app/modules/shared/gigachat/settings.py`
  - `app/modules/shared/gigachat/token_provider.py`
  - `app/modules/agent/llm/service.py`

There is no OpenAI SDK, Azure SDK, or local model runtime in the current implementation.

## Configuration

Model and endpoint configuration are read from environment in `GigaChatSettings.from_env()`:

- `GIGACHAT_AUTH_URL`
  - default: `https://ngw.devices.sberbank.ru:9443/api/v2/oauth`
- `GIGACHAT_API_URL`
  - default: `https://gigachat.devices.sberbank.ru/api/v1`
- `GIGACHAT_SCOPE`
  - default: `GIGACHAT_API_PERS`
- `GIGACHAT_TOKEN`
  - required for auth
- `GIGACHAT_SSL_VERIFY`
  - default: `true`
- `GIGACHAT_MODEL`
  - default: `GigaChat`
- `GIGACHAT_EMBEDDING_MODEL`
  - default: `Embeddings`
- `AGENT_PROMPTS_DIR`
  - optional prompt directory override

PostgreSQL config for retrieval storage is separate:

- `DATABASE_URL`
  - default: `postgresql+psycopg://agent:agent@db:5432/agent`

## Default models

- Chat/completions model default: `GigaChat`
- Embedding model default: `Embeddings`

## Completion payload

Observed payload sent by `GigaChatClient.complete(...)`:

```json
{
  "model": "GigaChat",
  "messages": [
    {"role": "system", "content": "<prompt template text>"},
    {"role": "user", "content": "<runtime user input>"}
  ]
}
```

Endpoint:

- `POST {GIGACHAT_API_URL}/chat/completions`

Observed response handling:

- reads `choices[0].message.content`
- if no choices: returns empty string

## Embeddings payload

Observed payload sent by `GigaChatClient.embed(...)`:

```json
{
  "model": "Embeddings",
  "input": [
    "<text1>",
    "<text2>"
  ]
}
```

Endpoint:

- `POST {GIGACHAT_API_URL}/embeddings`

Observed response handling:

- expects `data` list
- maps each `item.embedding` to `list[float]`

## Parameters

### Explicitly implemented

- `model`
- `messages`
- `input`
- HTTP timeout:
  - completions: `90s`
  - embeddings: `90s`
  - auth: `30s`
- TLS verification flag:
  - `verify=settings.ssl_verify`

### Not implemented in payload

- `temperature`
- `top_p`
- `max_tokens`
- `response_format`
- tools/function calling
- streaming
- seed
- stop sequences

`ASSUMPTION:` the service uses provider defaults for sampling and output length because these fields are not sent in the request payload.

## Context and budget limits

There is no centralized token budget manager in the current code.

Observed practical limits instead:

- prompt file text is loaded as-is from disk
- user input is passed as-is
- RAG context shaping happens outside the LLM client
- docs indexing summary truncation:
  - docs module catalog summary: `4000` chars
  - docs policy text: `4000` chars
- project QA source bundle caps:
  - top `12` rag items
  - top `10` file candidates
- logging truncation only:
  - LLM input/output logs capped at `1500` chars for logs

`ASSUMPTION:` there is no explicit max-context enforcement before chat completion requests. The current system relies on upstream graph logic to keep inputs small enough.

## Retry, backoff, timeout

### Timeouts

- auth: `30s`
- chat completion: `90s`
- embeddings: `90s`

### Retry

- Generic async retry wrapper exists in `app/modules/shared/retry_executor.py`
- It retries only:
  - `TimeoutError`
  - `ConnectionError`
  - `OSError`
- Retry constants:
  - `MAX_RETRIES = 5`
  - backoff: `0.1 * attempt` seconds

### Important current limitation

- `GigaChatClient` raises `GigaChatError` on HTTP and request failures.
- `RetryExecutor` does not catch `GigaChatError`.
- Result: LLM and embeddings calls are effectively not retried by this generic retry helper unless errors are converted upstream.

## Prompt formation

Prompt loading is handled by `PromptLoader`:

- base dir: `app/modules/agent/prompts`
- override: `AGENT_PROMPTS_DIR`
- file naming convention: `<prompt_name>.txt`

Prompt composition model today:

- system prompt:
  - full contents of selected prompt file
- user prompt:
  - raw runtime input string passed by the caller
- no separate developer prompt layer in the application payload

If a prompt file is missing:

- fallback system prompt: `You are a helpful assistant.`

## Prompt templates present

- `router_intent`
- `general_answer`
- `project_answer`
- `docs_detect`
- `docs_strategy`
- `docs_plan_sections`
- `docs_generation`
- `docs_self_check`
- `docs_execution_summary`
- `project_edits_plan`
- `project_edits_hunks`
- `project_edits_self_check`

## Key LLM call entrypoints

### Composition roots

- `app/modules/agent/module.py`
  - builds `GigaChatSettings`
  - builds `GigaChatTokenProvider`
  - builds `GigaChatClient`
  - builds `PromptLoader`
  - builds `AgentLlmService`
- `app/modules/rag_session/module.py`
  - builds the same provider stack for embeddings used by RAG

### Main abstraction

- `AgentLlmService.generate(prompt_name, user_input, log_context=None)`

### Current generate callsites

- `app/modules/agent/engine/router/intent_classifier.py`
  - `router_intent`
- `app/modules/agent/engine/graphs/base_graph.py`
  - `general_answer`
- `app/modules/agent/engine/graphs/project_qa_graph.py`
  - `project_answer`
- `app/modules/agent/engine/graphs/docs_graph_logic.py`
  - `docs_detect`
  - `docs_strategy`
  - `docs_plan_sections`
  - `docs_generation`
  - `docs_self_check`
  - `docs_execution_summary`-like usage via summary step
- `app/modules/agent/engine/graphs/project_edits_logic.py`
  - `project_edits_plan`
  - `project_edits_self_check`
  - `project_edits_hunks`

## Logging and observability

`AgentLlmService` logs:

- input:
  - `graph llm input: context=... prompt=... user_input=...`
- output:
  - `graph llm output: context=... prompt=... output=...`

Log truncation:

- 1500 chars

RAG retrieval logs separately in `RagService`, but without embedding vectors.

## Integration with retrieval

There are two distinct GigaChat usages:

1. Chat/completion path for agent reasoning and generation
2. Embedding path for RAG indexing and retrieval

The embedding adapter is `GigaChatEmbedder`, used by:

- `app/modules/rag/services/rag_service.py`

## Notable limitations

- Single provider coupling: chat and embeddings both depend on GigaChat-specific endpoints.
- No model routing by scenario.
- No tool/function calling.
- No centralized prompt token budgeting.
- No explicit retry for `GigaChatError`.
- No streaming completions.
- No structured response mode beyond prompt conventions and downstream parsing.