Benchmarks

Coming soon

How the models compare

Evaluation scores across four standard benchmarks plus speed and context. All scores from internal pre-release runs and are subject to change at launch.

  • Flash
  • Code
  • Pro

MMLU

General knowledge
Flash
78.0
Code
82.0
Pro
86.0

HumanEval

Code generation
Flash
71.0
Code
88.0
Pro
84.0

MATH / AIME

Math reasoning
Flash
42.0
Code
58.0
Pro
67.0

HellaSwag

Commonsense reasoning
Flash
81.0
Code
83.0
Pro
85.0

Scores from internal pre-release evaluations. Higher is better. All benchmarks run with greedy decoding unless otherwise noted. Final launch numbers may differ.

Methodology

How we evaluate

Benchmarks only matter if they're reproducible. Here's exactly how each number on this page was produced.

Standard suites

MMLU, HumanEval, MATH/AIME, and HellaSwag — the same public evaluation sets used across the industry. No custom filters.

Greedy decoding

All scores use greedy decoding (temperature 0) unless noted. This isolates model capability from sampling luck.

Internal, pre-release

Runs are conducted internally on launch candidate builds. We'll publish full methodology and prompts at release.

Reproducible prompts

Evaluation prompts and scoring scripts will be shared at launch so you can re-run every number yourself.

Speed at p50

Throughput and latency are p50 figures from a controlled workload, not best-case peaks. Real-world times depend on load.

No cherry-picking

We report every model on every benchmark — including the ones where a model is weaker. No hidden comparisons.

Coming soon

Benchmarks update at launch.

Final scores and full methodology ship with the models. Join the waitlist to run your own evals during early access.