How LLM Training Works
Before diving into Kimi-specific details, here's the mental model that the rest of this page builds on. Training a large language model happens in two big phases, plus an architecture choice that decides what the model is.
Pre-training is like having someone read the entire internet and library — they absorb enormous knowledge but have no idea how to act on it usefully.
SFT is an apprenticeship: they watch thousands of worked examples of the job done well, and imitate.
RL is on-the-job practice with a coach: they attempt real tasks, get scored, and gradually learn which strategies actually work — including ones no human demonstrated.
K2.7-Code reuses an existing pre-trained base (from Kimi K2) and an existing architecture. Almost all of its improvements come from better Phase 2 — more coding-focused SFT and RL. That's why this page spends most of its time on post-training: that's where the model actually changed.
Release Context
Kimi K2.7-Code is the latest open-weight coding model from Moonshot AI. It is explicitly described as "a coding-focused agentic model built on Kimi K2.6" and ships under a Modified MIT license with weights on Hugging Face.
Headline claims over its predecessor:
Architecture
Moonshot's Hugging Face deployment guide states that K2.7-Code has the same architecture as K2.5 / K2.6 and that deployment methods can be reused directly. The architecture traces back to the original Kimi K2 design.
The full parameter count across all experts. It describes stored capability, not the amount of compute used for each token.
Only a sparse slice of the model runs for each token. This is the number that more closely tracks inference cost.
The model keeps many specialized expert networks available, then routes each token to a small subset of them.
MLA compresses attention state into a latent cache, making long-context inference more memory efficient.
Each layer gives the token another pass through attention, routing, and feed-forward computation before output.
Dense layers run the same feed-forward path for every token. Kimi keeps this mostly sparse, with one dense layer.
The router chooses 8 experts from 384 for each token, plus a shared expert that provides common capacity.
This is the internal feature width used by attention to represent token relationships across the context.
Each selected expert uses this hidden size inside its feed-forward block, shaping how much specialized work it can do.
Heads let the model attend to different relationships at once. Kimi cuts heads versus DeepSeek-V3 to improve efficiency.
The tokenizer maps text and code into this set of pieces. A larger vocabulary can represent code symbols and languages more directly.
The model can attend across up to 256K tokens, useful for repository-scale coding tasks and long agent traces.
SwiGLU is the nonlinear gating function inside feed-forward layers, helping the model control which features pass through.
MoonViT converts image or video inputs into representations the language model can process alongside text.
Moonshot derived this shape from empirical scaling-law analysis. Compared with DeepSeek-V3, Kimi K2 increases expert sparsity (48 vs. 32) to 384 total experts while cutting attention heads from 128 to 64 to preserve long-context inference efficiency.
One token through the Kimi stack
flowchart LR
A["Input
tokens"] --> B["Token
embeddings"]
B --> D["Repeated layer:
MLA context mix"]
D --> E["Repeated layer:
dense or routed MoE"]
E --> F["Residual +
normalization"]
F --> G["Final norm +
LM head"]
G --> H["Next-token
logits"]
At inference time, each transformer layer mixes context with MLA, then applies a feed-forward path. Sparse MoE layers route the token through a small active expert set instead of running all 384 experts.
A normal ("dense") model runs every parameter for every word it processes — expensive. An MoE model instead contains many small sub-networks called experts (here, 384 of them). For each token, a tiny "router" picks just the 8 most relevant experts to actually run.
That's why Kimi has 1 trillion total parameters but only activates 32 billion per token: it has the knowledge capacity of a huge model but the running cost of a much smaller one. The ratio of total to active experts is called sparsity — higher sparsity gave Moonshot better results per unit of compute.
How one token wakes only part of the model
flowchart LR
T["Layer state"] --> R["Router scores
384 routed experts"]
R --> S["Select 8
routed experts"]
T --> X["Shared expert
always active"]
S --> M["Combine expert outputs
(router-weighted)"]
X --> M
M --> O["Layer output"]
The router selects 8 routed experts for each token, while the shared expert provides common capacity. The layer combines those outputs instead of waking every expert.
Think of MoE like a hospital. A dense model is one doctor who must personally know everything. An MoE model is a hospital with 384 specialists, where a triage nurse (the router) sends each patient to the 8 specialists who can actually help. You keep all the expertise on staff, but you don't wake up all 384 for every patient.
Attention is how a model decides which earlier words matter when producing the next one. The catch: it has to store a running memory (the "KV cache") for every token, which balloons on long inputs. Multi-head Latent Attention (MLA) compresses that memory so the model can handle very long contexts (256K tokens here) without running out of GPU RAM — essential for coding agents that read whole codebases.
Query, key, and value in plain terms
flowchart LR
P["Context
tokens"] --> K["Keys
what is here"]
P --> V["Values
what to carry forward"]
N["Current
position"] --> Q["Query
what to look for"]
Q --> A["Similarity
scores"]
K --> A
A --> S["Softmax
normalization"]
S --> W["Attention
weights"]
W --> C["Weighted
value mix"]
V --> C
C --> O["Context-aware
state"]
Attention is lookup by relevance. The current position forms a query, compares it with keys from the available context, normalizes those scores, and mixes the corresponding values into the next layer.
Why latent attention helps long context
flowchart LR
A["Standard cache"] --> B["Materialize K/V
per head"]
B --> C["Cache grows with
heads x context"]
C --> D["GPU memory
pressure"]
E["MLA"] --> F["Store compressed
latent cache"]
F --> G["Project needed
K/V features"]
G --> H["Long context
lower cache cost"]
MLA is not just an accuracy trick; it is a memory-layout decision. By caching a compressed latent representation, Kimi can keep long contexts practical for agentic coding runs.
Pre-training: The K2 Foundation
The base model underlying K2.7-Code was pre-trained as part of Kimi K2 on 15.5 trillion high-quality tokens. Public details come from the Kimi K2 technical report.
MuonClip optimizer
Training a model means nudging billions of numbers (weights) a little bit at a time so its predictions improve. The optimizer is the algorithm that decides how big each nudge should be. For years the default was AdamW; Moonshot uses Muon, which learns more from each token (better "token efficiency") — valuable because high-quality training text is finite.
Moonshot trains with the token-efficient Muon optimizer, but scaling it up caused attention-logit explosions (logits exceeding 1,000) that produced loss spikes and divergence.
"Loss" measures how wrong the model is; training tries to push it steadily down. A loss spike is a sudden jump where the model briefly goes haywire — at trillion-parameter scale a bad spike can wreck weeks of compute. Keeping the curve smooth is a major engineering goal.
Their fix is MuonClip, which adds a QK-Clip step after each optimizer update:
- After the standard Muon Newton-Schulz update, compute the maximum attention logit per head.
- If it exceeds a threshold τ (set to 100), rescale the query/key projection weights for that head.
- For MLA, only the unshared head components (
qC,kC,qR) are clipped; the shared rotary componentkRis left alone.
MuonClip as an attention guardrail
flowchart LR
G["Gradients"] --> M["Muon
update"]
M --> L["Check per-head
max attention logit"]
L --> C{"Above threshold?"}
C -->|No| N["Keep
update"]
C -->|Yes| Q["Rescale that head's
Q/K weights"]
Q --> N
N --> S["Next
step"]
MuonClip is a guardrail around attention dynamics. It lets Muon keep its token-efficiency benefits while preventing query-key dot products from exploding during trillion-parameter training.
With MuonClip, Kimi K2 completed its 15.5 T-token pre-training with zero loss spikes and no manual τ adjustments.
QK-Clip is like a governor on an engine. Muon is a powerful but twitchy engine that occasionally over-revs (exploding logits). Rather than swap the engine, Moonshot added a limiter that only kicks in on the specific cylinders that are red-lining, leaving everything else at full power.
Training recipe
Learning rate (LR) = how big each weight update is. Warm-up starts it small to avoid early instability; decay shrinks it near the end so the model settles. Batch size = how much text the model reads before each update (67 M tokens here is enormous). YaRN is a trick to stretch a model trained on short text so it can handle much longer inputs.
- Optimizer: MuonClip with weight decay 0.1 and consistent RMS matching.
- LR schedule: WSD (Warmup-Stable-Decay).
- First 10 T tokens: constant LR 2e-4 after a 500-step warm-up.
- Next 5.5 T tokens: cosine decay from 2e-4 to 2e-5.
- Global batch size: 67 M tokens throughout.
- Annealing phase: LR decayed 2e-5 → 7e-6 on 400 B tokens at 4 k length + 60 B tokens at 32 k length.
- Long context: YaRN used to extend to 128 k (base report); K2.7-Code is listed at 256 k context.
Data & token efficiency
There's a finite amount of high-quality human-written text in the world, and the biggest models have nearly exhausted it. So labs increasingly squeeze more value out of what they have. Rephrasing rewrites existing good text in new styles, multiplying useful training signal — but carefully, because rewriting can introduce errors or repetition that makes the model worse.
Because high-quality human text is scarce, Moonshot emphasizes token utility rather than raw token count. Two rephrasing strategies are highlighted:
- Knowledge rephrasing: chunk-wise autoregressive rewriting with style/perspective diversity and fidelity verification. Each corpus is rephrased at most twice.
- Math rephrasing: high-quality math documents are rewritten into a "learning-note" style, and materials from other languages are translated into English.
The full pre-training corpus spans Web Text, Code, Mathematics, and Knowledge.
Infrastructure
K2 was trained on NVIDIA H800 GPUs with 16-way Pipeline Parallelism, 16-way Expert Parallelism, and ZeRO-1 Data Parallelism. To fit activations, Moonshot uses selective recomputation, FP8 storage for insensitive activations, and CPU offloading of remaining activations, overlapping transfers with compute.
Post-training Pipeline
K2.7-Code's gains come from post-training, not from re-pre-training. The K2 technical report describes the multi-stage pipeline that K2.7-Code inherits and specializes.
A freshly pre-trained model is a brilliant autocomplete — it can continue text, but it won't reliably answer your question, use a tool, or refuse a harmful request. Post-training (SFT + RL) is what turns that raw capability into a helpful, safe, tool-using agent. Since K2.7-Code starts from an already-pre-trained base, this is the part Moonshot actually reworked.
Supervised Fine-Tuning (SFT)
Feed the model a large set of (instruction, ideal response) pairs and train it to imitate the ideal responses. The hard part isn't the training — it's creating millions of high-quality examples, especially for multi-step tool use, which is exactly what the pipeline below does.
- Moonshot uses the standard Muon optimizer for post-training, since a Muon-pre-trained checkpoint works best with Muon fine-tuning.
- The instruction dataset maximizes prompt diversity and response quality, using human annotation, prompt engineering, and LLM/human judges.
- For agentic tasks, they built a large-scale synthetic data pipeline that generates tool-use demonstrations.
Agentic data synthesis
The pipeline has three stages:
- Tool spec generation: 3,000+ real MCP tools from GitHub plus 20,000+ LLM-evolved synthetic tools across domains.
- Agent and task generation: diverse agents are paired with tasks and explicit rubrics.
- Trajectory generation: a user simulator + tool-execution simulator generate multi-turn interactions; an LLM judge filters trajectories against rubrics.
Real execution sandboxes (especially for coding) complement simulation, providing ground-truth signals such as test-suite pass rates.
Reinforcement Learning
In RL the model attempts a task, a reward scores the attempt, and training pushes it toward higher-scoring behavior. The crucial question is: who decides the reward?
Verifiable rewards come from objective checks — does the code pass its unit tests? is the math answer correct? These are gold-standard because they can't be faked. Self-critic / rubric rewards are for fuzzy tasks (e.g. "was this essay helpful?") where the model judges its own output against guidelines. Moonshot's trick is to keep grounding the fuzzy judge using the objective signals, so it stays honest.
K2 uses a unified, Gym-like RL framework. It supports tasks with verifiable rewards and tasks with subjective preferences via a self-critic rubric reward.
How coding behavior gets rewarded
flowchart LR
T["Coding /
agentic task"] --> R["Multiple
rollouts"]
R --> S["Sandbox or
simulator"]
S --> V["Tests and
verifiers"]
S --> G["Self-critic
rubrics"]
V --> W["Reward
signal"]
G --> W
W --> U["Regularized
policy update"]
U --> R
The post-training loop teaches behavior, not just facts. Coding tasks can be scored by tests and sandboxes, while open-ended agent behavior needs rubrics or reward models layered on top.
Verifiable-reward RL is like training a coder with a test suite: green check or red X, no arguing. Self-critic RL is like teaching writing, where there's no unit test — so you train an internal editor with a rubric. Moonshot periodically sends that editor back to "objective school" (the verifiable tasks) so its taste doesn't drift.
Verifiable-reward domains
- Math / STEM / logic: curated QA with moderate difficulty selected by SFT pass@k.
- Instruction following: deterministic code-interpreter checks plus LLM-as-judge and hack-check layers.
- Faithfulness: a sentence-level faithfulness judge model detects unsupported factual claims.
- Coding & software engineering: competition problems with unit tests; GitHub issues/PRs executed in Kubernetes sandboxes.
- Safety: adversarial prompt evolution with attack/target/judge models and rubric-based labels.
Self-critique rubric reward
For tasks without automatic verifiers, the model evaluates its own outputs with:
- Core rubrics representing Kimi's fundamental assistant values.
- Prescriptive rubrics to eliminate reward hacking.
- Human-annotated rubrics for specific instructions.
During RL, the critic is refined using verifiable-reward rollouts, transferring objective signals into the critic's subjective judgments.
RL algorithm tweaks
- Budget control: per-sample maximum token budgets; responses that exceed them are truncated and penalized.
- PTX loss: auxiliary supervised loss on high-quality hand-curated samples to prevent forgetting.
- Temperature decay: high temperature early for exploration, decaying later for stable outputs.
The K2.6 Stepping Stone
The Hugging Face model card for Kimi-K2.6 cites the K2.5 technical report (arXiv:2602.02276) as its academic source. K2.6 is therefore best understood as an incremental post-training release on the K2.5 architecture, not a new model family.
Kimi K2.6 was released on April 20, 2026 (preview began April 13). It is a native multimodal agentic model built on the same 1 T / 32 B MoE backbone as K2.5, with four public variants: Instant, Thinking, Agent, and Agent Swarm.
What K2.6 added
- Longer autonomous runs: up to 12 hours, 4,000 coordinated steps, and 300 parallel sub-agents in a swarm.
- 256K context window (262,144 tokens) as the default for agentic variants.
- Native INT4 quantization via quantization-aware training (QAT), cutting GPU memory ~50% and inference time ~2× versus FP16.
- More post-training compute on long-horizon stability, instruction following, and swarm coordination.
- Native video input alongside images and text.
K2.5 / K2.6 architecture & training highlights
Because K2.6 reuses the K2.5 architecture, the relevant technical details are in the K2.5 report:
- Joint text-vision pre-training: K2.5 continues from a near-end K2 checkpoint and processes ~15 T additional vision-text tokens. Vision and text tokens are mixed at a moderate, constant ratio from early in training, rather than injected late at high ratios.
- MoonViT-3D: a native-resolution vision encoder initialized from SigLIP-SO-400M, using NaViT patch packing. For video, four consecutive frames are packed as a spatiotemporal volume and temporally pooled 4×, sharing all weights with image encoding.
- Zero-vision SFT: text-only supervised fine-tuning is enough to activate visual tool use because joint pre-training already aligned the modalities. Adding human-designed visual trajectories at this stage actually hurt generalization.
- Joint multimodal RL: outcome-based visual RL improves not only vision benchmarks but also text-only benchmarks such as MMLU-Pro, GPQA-Diamond, and LongBench v2.
Agent Swarm & PARL
K2.5/K2.6 introduces Agent Swarm, trained via Parallel-Agent Reinforcement Learning (PARL):
- A trainable orchestrator dynamically creates and delegates to frozen sub-agents.
- Only the orchestrator is updated by RL; sub-agent trajectories are treated as environmental observations. This avoids credit-assignment ambiguity and training instability.
- The PARL reward has three terms: task-level outcome, an instantiation reward to prevent serial collapse, and a finish reward to prevent spurious parallelism.
- Training optimizes critical steps (the longest parallel branch) rather than total steps, so the model learns to minimize wall-clock latency, not just maximize concurrency.
Parallel agents as context management
flowchart LR
O["Trainable
orchestrator"] --> P["Create and delegate
to frozen subagents"]
P --> A["Code
search"]
P --> B["Test
repair"]
P --> C["Docs
lookup"]
P --> D["Verification"]
A --> R["Selected
results"]
B --> R
C --> R
D --> R
R --> F["Coordinated
answer"]
Agent Swarm is also context management. Sub-agents keep local histories bounded, and the orchestrator receives selected results instead of every intermediate trace.
Toggle: the token-efficiency technique that foreshadows K2.7-Code
The K2.5 report describes a training heuristic called Toggle that alternates between two RL phases every m iterations:
- Phase 0 (budget-limited): the model must solve problems within a task-dependent token budget, estimated from the ρ-th percentile of correct rollout lengths.
- Phase 1 (standard scaling): the model uses the full token budget to encourage inference-time scaling.
On K2 Thinking, Toggle reduced output tokens by 25–30% with negligible performance loss. This same trade-off — shorter reasoning chains without sacrificing accuracy — is exactly what Moonshot highlights in K2.7-Code ("~30% fewer thinking tokens").
What Makes K2.7-Code Different
Moonshot has not published a dedicated K2.7-Code paper, but the model card and release notes make clear that the model is the result of a coding-specialized continuation of K2.6:
- Same architecture as K2.6 — no new topology, no new pre-training.
- Revised post-training pipeline targeting long-horizon coding, multi-step software engineering, and IDE/terminal agent workflows.
- Reduced thinking-token usage (~30% fewer than K2.6), suggesting additional length/reasoning-budget regularization during RL, likely building on the Toggle-style token-efficiency training from K2.5.
- Multimodal support — text, image, and experimental video input via MoonViT.
- Forced thinking modes —
thinking=Trueandpreserve_thinking=Truecannot be disabled.
In other words, K2.7-Code is not a from-scratch trained model. It is the K2-family base architecture further aligned for coding agents, using the same MoE backbone, Muon-family optimizers, synthetic agentic data, verifiable-reward RL, Agent Swarm orchestration, and token-efficiency heuristics described in the K2 and K2.5 reports.
Source Documents
The public technical reports Moonshot released for the K2 family. Open them directly:
There is no separate Kimi K2.6 PDF; the K2.6 model card explicitly references the K2.5 report above. For K2.7-Code specifically, the canonical source is the Hugging Face model card and the MoonshotAI/Kimi-K2 GitHub repository.
Bottom Line
Moonshot did not train a new base model named "Kimi 2.7." Kimi K2.7-Code is a coding-focused post-train of Kimi K2.6 on the existing 1 T / 32 B ultra-sparse MoE architecture. Its improvements come from additional coding-centric supervised fine-tuning, synthetic agentic trajectory generation, and reinforcement learning with verifiable rewards (unit tests, sandboxes) and self-critic rubrics, with explicit regularization to cut reasoning-token overhead.
Glossary of Key Terms
Quick reference for the jargon used above, in plain language.
- Token
- A chunk of text (roughly ¾ of a word) that the model reads and generates one at a time. Training and context length are both measured in tokens.
- Parameters (weights)
- The billions of internal numbers the model adjusts during training. More parameters = more capacity to store patterns. Kimi has 1 trillion total.
- Pre-training
- The first, longest phase: learning language and knowledge by predicting the next token across trillions of tokens. Produces the "base model."
- Post-training
- Everything after pre-training (SFT + RL) that turns a raw base model into a helpful, safe, tool-using assistant.
- SFT (Supervised Fine-Tuning)
- Training the model to imitate curated examples of good (instruction → response) pairs.
- RL (Reinforcement Learning)
- Training by trial, scoring, and reward — the model practices tasks and is pushed toward higher-scoring behavior.
- Reward (verifiable vs. self-critic)
- The score that guides RL. Verifiable rewards come from objective checks (tests pass, math correct). Self-critic rewards have the model grade its own work against rubrics for fuzzy tasks.
- MoE (Mixture-of-Experts)
- An architecture with many expert sub-networks where only a few run per token, giving huge capacity at low running cost.
- Sparsity
- The ratio of total experts to active experts. Higher sparsity (Kimi uses 48) gave better performance per unit of compute.
- Attention / MLA
- Attention lets the model weigh which earlier tokens matter. Multi-head Latent Attention compresses attention memory so long contexts fit in GPU RAM.
- Optimizer (Muon / MuonClip / AdamW)
- The algorithm that decides how to update weights each step. Muon is more token-efficient than the classic AdamW; MuonClip adds stability.
- Loss
- A number measuring how wrong the model's predictions are. Training drives it down; sudden "loss spikes" signal instability.
- Learning rate
- How big each weight update is. Warmed up at the start, decayed toward the end.
- Context window
- The maximum amount of text (in tokens) the model can consider at once. Kimi K2.7-Code: 256K tokens.
- Quantization (INT4 / QAT)
- Storing weights at lower numeric precision to save memory and speed up inference. QAT trains the model to tolerate this so quality stays high.
- Agentic / Agent Swarm
- "Agentic" means the model plans and acts using tools over many steps. Agent Swarm runs many sub-agents in parallel under a trainable orchestrator.
- Trajectory
- A full recorded sequence of an agent's steps (thoughts, tool calls, results) for one task — the raw material for agentic training data.