FrontierMath — IA face à la recherche mathématique avancée

Un benchmark de plusieurs centaines de problèmes mathématiques inédits de niveau expert qui prennent des heures à des jours aux spécialistes pour être résolus. Les niveaux de difficulté 1-3 couvrent les problèmes de premier cycle universitaire jusqu'au début des cycles supérieurs, tandis que le niveau 4 concerne les mathématiques de niveau recherche. Projet mené par Epoch et OpenAI.

Performance des modèles d'IA sur FrontierMath

OpenAI

Anthropic

Google

xAI

OpenAI

Anthropic

Google

xAI

CC-BY

Classement FrontierMath

Modèle ↕	Précision ↓	# Corrects ↕	Organisation ↕
GPT-5 (high)	12.5%±4.8%	6 / 48	OpenAI
GPT-5 Pro	12.5%±4.8%	6 / 48	OpenAI
Gemini 2.5 Deep Think	10.4%±4.4%	5 / 48	Google,Google DeepMind
GPT-5 mini (high)	6.3%±3.5%	3 / 48	OpenAI
GPT-5 (medium)	6.3%±3.5%	3 / 48	OpenAI
o4-mini (high)	6.3%±3.5%	3 / 48	OpenAI
Claude Sonnet 4.5 (32k thinking)	4.2%±2.9%	2 / 48	Anthropic
GPT-5 mini (medium)	4.2%±2.9%	2 / 48	OpenAI
Claude Opus 4.1 (27K thinking)	4.2%±2.9%	2 / 48	Anthropic
Gemini 2.5 Pro	4.2%±2.9%	2 / 48	Google DeepMind
o3-mini (high)	4.2%±2.9%	2 / 48	OpenAI
Claude Opus 4 (27K thinking)	4.2%±2.9%	2 / 48	Anthropic
Claude Haiku 4.5 (32K thinking)	2.1%±2.1%	1 / 48	Anthropic
Claude Sonnet 4.5 (no thinking)	2.1%±2.1%	1 / 48	Anthropic
Grok 4	2.1%±2.1%	1 / 48	xAI