Standard suites
MMLU, HumanEval, MATH/AIME, and HellaSwag — the same public evaluation sets used across the industry. No custom filters.
Benchmarks
Evaluation scores across four standard benchmarks plus speed and context. All scores from internal pre-release runs and are subject to change at launch.
Scores from internal pre-release evaluations. Higher is better. All benchmarks run with greedy decoding unless otherwise noted. Final launch numbers may differ.
Methodology
Benchmarks only matter if they're reproducible. Here's exactly how each number on this page was produced.
MMLU, HumanEval, MATH/AIME, and HellaSwag — the same public evaluation sets used across the industry. No custom filters.
All scores use greedy decoding (temperature 0) unless noted. This isolates model capability from sampling luck.
Runs are conducted internally on launch candidate builds. We'll publish full methodology and prompts at release.
Evaluation prompts and scoring scripts will be shared at launch so you can re-run every number yourself.
Throughput and latency are p50 figures from a controlled workload, not best-case peaks. Real-world times depend on load.
We report every model on every benchmark — including the ones where a model is weaker. No hidden comparisons.
Final scores and full methodology ship with the models. Join the waitlist to run your own evals during early access.