Benchmarks

Coming soon

How the models compare

Evaluation scores across four standard benchmarks plus speed and context. All scores from internal pre-release runs and are subject to change at launch.

Flash
Code
Pro

MMLU

General knowledge

Flash

78.0

Code

82.0

Pro

86.0

HumanEval

Code generation

Flash

71.0

Code

88.0

Pro

84.0

MATH / AIME

Math reasoning

Flash

42.0

Code

58.0

Pro

67.0

HellaSwag

Commonsense reasoning

Flash

81.0

Code

83.0

Pro

85.0

Scores from internal pre-release evaluations. Higher is better. All benchmarks run with greedy decoding unless otherwise noted. Final launch numbers may differ.

Methodology

How we evaluate

Benchmarks only matter if they're reproducible. Here's exactly how each number on this page was produced.

Standard suites

MMLU, HumanEval, MATH/AIME, and HellaSwag — the same public evaluation sets used across the industry. No custom filters.

Greedy decoding

All scores use greedy decoding (temperature 0) unless noted. This isolates model capability from sampling luck.

Internal, pre-release

Runs are conducted internally on launch candidate builds. We'll publish full methodology and prompts at release.

Reproducible prompts

Evaluation prompts and scoring scripts will be shared at launch so you can re-run every number yourself.

Speed at p50

Throughput and latency are p50 figures from a controlled workload, not best-case peaks. Real-world times depend on load.

No cherry-picking

We report every model on every benchmark — including the ones where a model is weaker. No hidden comparisons.

Coming soon

Benchmarks update at launch.

Final scores and full methodology ship with the models. Join the waitlist to run your own evals during early access.

Join the waitlist Compare models