diff --git a/USAGE.md b/USAGE.md index c8e7b096..d7149aa5 100644 --- a/USAGE.md +++ b/USAGE.md @@ -306,7 +306,9 @@ Reasoning variants (`qwen-qwq-*`, `qwq-*`, `*-thinking`) automatically strip `te The OpenAI-compatible backend also serves as the gateway for **OpenRouter**, **Ollama**, and any other service that speaks the OpenAI `/v1/chat/completions` wire format — just point `OPENAI_BASE_URL` at the service. -**Model-name prefix routing:** If a model name starts with `openai/`, `gpt-`, `qwen/`, or `qwen-`, the provider is selected by the prefix regardless of which env vars are set. This prevents accidental misrouting to Anthropic when multiple credentials exist in the environment. +**Model-name prefix routing:** If a model name starts with `openai/`, `gpt-`, `qwen/`, `qwen-`, `kimi/`, or `kimi-`, the provider is selected by the prefix regardless of which env vars are set. This prevents accidental misrouting to Anthropic when multiple credentials exist in the environment. Kimi and Qwen prefixes route to DashScope compatible mode; `openai/` is stripped before the Chat Completions request is sent. + +**Token and cost accounting:** Anthropic usage fields are recorded directly, including cache create/read tokens. OpenAI-compatible responses normalize `prompt_tokens` / `completion_tokens` into the same internal usage fields; when a provider reports `prompt_tokens_details.cached_tokens`, cached prompt tokens are counted as cache reads and subtracted from uncached input tokens so totals are not double-counted. `/status`, `/cost`, and `/usage` expose cumulative token totals and an estimated cost; unknown third-party prices use the built-in estimated-default pricing marker. ### Tested models and aliases @@ -320,6 +322,9 @@ These are the models registered in the built-in alias table with known token lim | `grok` / `grok-3` | `grok-3` | xAI | 64 000 | 131 072 | | `grok-mini` / `grok-3-mini` | `grok-3-mini` | xAI | 64 000 | 131 072 | | `grok-2` | `grok-2` | xAI | — | — | +| `kimi` | `kimi-k2.5` | DashScope | 16 384 | 256 000 | +| `gpt-4.1` / `gpt-4.1-mini` / `gpt-4.1-nano` | same | OpenAI-compatible | 32 768 | 1 047 576 | +| `gpt-5.4` / `gpt-5.4-mini` / `gpt-5.4-nano` | same | OpenAI-compatible | 128 000 | 1 000 000 / 400 000 | Any model name that does not match an alias is passed through verbatim. This is how you use OpenRouter model slugs (`openai/gpt-4.1-mini`), Ollama tags (`llama3.2`), or full Anthropic model IDs (`claude-sonnet-4-20250514`). @@ -343,8 +348,10 @@ Local project settings override user-level settings. Aliases resolve through the 1. If the resolved model name starts with `claude` → Anthropic. 2. If it starts with `grok` → xAI. -3. Otherwise, `claw` checks which credential is set: `ANTHROPIC_API_KEY`/`ANTHROPIC_AUTH_TOKEN` first, then `OPENAI_API_KEY`, then `XAI_API_KEY`. -4. If nothing matches, it defaults to Anthropic. +3. If it starts with `openai/` or `gpt-` → OpenAI-compatible. +4. If it starts with `qwen/`, `qwen-`, `kimi/`, or `kimi-` → DashScope compatible mode. +5. If `OPENAI_BASE_URL` is set with `OPENAI_API_KEY`, route unprefixed custom/local model names to OpenAI-compatible. +6. Otherwise, `claw` checks which credential is set: Anthropic first, then OpenAI, then xAI; if nothing matches, it defaults to Anthropic. ## FAQ diff --git a/docs/MODEL_COMPATIBILITY.md b/docs/MODEL_COMPATIBILITY.md index fef8b98d..90b101dd 100644 --- a/docs/MODEL_COMPATIBILITY.md +++ b/docs/MODEL_COMPATIBILITY.md @@ -9,7 +9,8 @@ This document describes model-specific handling in the OpenAI-compatible provide - [Kimi Models (is_error Exclusion)](#kimi-models-is_error-exclusion) - [Reasoning Models (Tuning Parameter Stripping)](#reasoning-models-tuning-parameter-stripping) - [GPT-5 (max_completion_tokens)](#gpt-5-max_completion_tokens) - - [Qwen Models (DashScope Routing)](#qwen-models-dashscope-routing) + - [Qwen and Kimi Models (DashScope Routing)](#qwen-and-kimi-models-dashscope-routing) + - [OpenAI-Compatible Usage Accounting](#openai-compatible-usage-accounting) - [Implementation Details](#implementation-details) - [Adding New Models](#adding-new-models) - [Testing](#testing) @@ -46,7 +47,7 @@ The `openai_compat.rs` provider translates Claude Code's internal message format fn model_rejects_is_error_field(model: &str) -> bool { let lowered = model.to_ascii_lowercase(); let canonical = lowered.rsplit('/').next().unwrap_or(lowered.as_str()); - canonical.starts_with("kimi-") + canonical.starts_with("kimi") } ``` @@ -120,13 +121,13 @@ let max_tokens_key = if wire_model.starts_with("gpt-5") { --- -### Qwen Models (DashScope Routing) +### Qwen and Kimi Models (DashScope Routing) -**Affected models:** All models with `qwen` prefix +**Affected models:** All models with `qwen` or `kimi` prefixes, including `qwen/`, `qwen-`, `kimi/`, and `kimi-` forms. -**Behavior:** Routed to DashScope (`https://dashscope.aliyuncs.com/compatible-mode/v1`) rather than default providers. +**Behavior:** Routed to DashScope (`https://dashscope.aliyuncs.com/compatible-mode/v1`) rather than ambient-credential fallback providers. Known routing prefixes are stripped before sending the wire model. -**Rationale:** Qwen models are hosted by Alibaba Cloud's DashScope service, not OpenAI or Anthropic. +**Rationale:** Qwen and Kimi compatible-mode models are hosted through Alibaba Cloud's DashScope service, not OpenAI or Anthropic. **Configuration:** ```rust @@ -137,6 +138,17 @@ pub const DEFAULT_DASHSCOPE_BASE_URL: &str = "https://dashscope.aliyuncs.com/com **Note:** Some Qwen models are also reasoning models (see [Reasoning Models](#reasoning-models-tuning-parameter-stripping) above) and receive both treatments. + +--- + +### OpenAI-Compatible Usage Accounting + +**Affected providers:** OpenAI-compatible, xAI, DashScope, OpenRouter/Ollama/local gateways that return Chat Completions `usage`. + +**Behavior:** `prompt_tokens` and `completion_tokens` are normalized into the shared `Usage` shape used by Anthropic. If a provider includes `prompt_tokens_details.cached_tokens`, cached prompt tokens are recorded as `cache_read_input_tokens` and subtracted from uncached `input_tokens`, preserving an accurate `total_tokens()` without double-counting. Streaming OpenAI responses request `stream_options.include_usage` where supported so final usage chunks feed the same accounting path. + +**Status/cost surfaces:** `/status`, `/cost`, and JSON output expose cumulative input/output/cache/total token fields plus an estimated cost marker. Unknown third-party model pricing uses the `estimated-default` pricing label rather than pretending provider-specific prices are known. + ## Implementation Details ### File Location @@ -153,6 +165,7 @@ rust/crates/api/src/providers/openai_compat.rs | `is_reasoning_model()` | Detects reasoning models that need tuning param stripping | | `translate_message()` | Converts internal messages to OpenAI format (applies `is_error` logic) | | `build_chat_completion_request()` | Constructs full request payload (applies all model-specific logic) | +| `OpenAiUsage::normalized()` | Maps Chat Completions usage and cached prompt token details into shared token/cost accounting fields | ### Provider Prefix Handling diff --git a/rust/crates/rusty-claude-cli/src/main.rs b/rust/crates/rusty-claude-cli/src/main.rs index b5be5a32..91af6665 100644 --- a/rust/crates/rusty-claude-cli/src/main.rs +++ b/rust/crates/rusty-claude-cli/src/main.rs @@ -4067,6 +4067,8 @@ fn run_resume_command( "cache_creation_input_tokens": usage.cache_creation_input_tokens, "cache_read_input_tokens": usage.cache_read_input_tokens, "total_tokens": usage.total_tokens(), + "estimated_cost_usd": format_usd(usage.estimate_cost_usd().total_cost_usd()), + "pricing": "estimated-default", })), }) }