FormationEval
72 language models evaluated on 505 petroleum geoscience MCQs
March 2026 suite update
The original FormationEval OG benchmark (505 MCQ) was created Christmas 2025. FormationEval now also includes the imported DISKOS-QA (17 March 2026) and SPE MCQ (21 March 2026) tracks. The public leaderboard and quiz still reflect the evaluated 505 question MCQ v0.1 track. A full rerun on the expanded suite is pending because this is a self funded one person project and expanded suite evaluation requires materially more token spend.
If you want to collaborate, support reruns or discuss related research and engineering work, contact almaz.ermilov@gmail.com.
Model comparison
All 72 models ranked by accuracy on the evaluated MCQ v0.1 track. Filter by company, type or view domain-specific performance.
Leaderboard
# | Model | Company | Open | Price (In/Out) | Accuracy | Correct | |
|---|---|---|---|---|---|---|---|
| 1 | gemini-3-pro-preview | No | $2.00/$12.00 | 99.8% | 504/505 | ||
| 2 | Z | glm-4.7 | Zhipu | Yes | $0.40/$1.50 | 98.6% | 498/505 |
| 3 | gemini-3-flash-preview | No | $0.50/$3.00 | 98.2% | 496/505 | ||
| 4 | gemini-2.5-pro | No | $1.25/$10.00 | 97.8% | 494/505 | ||
| 5 | grok-4.1-fast | xAI | No | $0.20/$0.50 | 97.6% | 493/505 | |
| 6 | gpt-5.2-chat-medium | OpenAI | No | $1.75/$14.00 | 97.4% | 492/505 | |
| 7 | Mo | kimi-k2-thinking | Moonshot | No | $0.40/$1.75 | 97.2% | 491/505 |
| 8 | claude-opus-4.5 | Anthropic | No | $5.00/$25.00 | 97.0% | 490/505 | |
| 9 | gpt-5.2-chat-high | OpenAI | No | $1.75/$14.00 | 96.8% | 489/505 | |
| 10 | gpt-5.2-chat-low | OpenAI | No | $1.75/$14.00 | 96.8% | 489/505 | |
| 11 | gpt-5-mini-medium | OpenAI | No | $0.25/$2.00 | 96.4% | 487/505 | |
| 12 | gpt-5.1-chat-medium | OpenAI | No | $1.25/$10.00 | 96.4% | 487/505 | |
| 13 | DS | deepseek-r1 | DeepSeek | Yes | $0.30/$1.20 | 96.2% | 486/505 |
| 14 | grok-4-fast | xAI | No | $0.20/$0.50 | 96.0% | 485/505 | |
| 15 | gpt-5-mini-high | OpenAI | No | $0.25/$2.00 | 95.6% | 483/505 | |
| 16 | gpt-5-mini-low | OpenAI | No | $0.25/$2.00 | 95.2% | 481/505 | |
| 17 | o4-mini-high | OpenAI | No | $1.10/$4.40 | 95.2% | 481/505 | |
| 18 | gemini-2.5-flash | No | $0.30/$2.50 | 95.0% | 480/505 | ||
| 19 | o4-mini-medium | OpenAI | No | $1.10/$4.40 | 95.0% | 480/505 | |
| 20 | grok-3-mini | xAI | No | $0.30/$0.50 | 95.0% | 480/505 | |
| 21 | DS | deepseek-v3.2 | DeepSeek | Yes | $0.22/$0.32 | 94.9% | 479/505 |
| 22 | gpt-5.1-chat-low | OpenAI | No | $1.25/$10.00 | 94.9% | 479/505 | |
| 23 | o3-mini-low | OpenAI | No | $1.10/$4.40 | 94.9% | 479/505 | |
| 24 | o3-mini-medium | OpenAI | No | $1.10/$4.40 | 94.9% | 479/505 | |
| 25 | claude-3.7-sonnet | Anthropic | No | $3.00/$15.00 | 94.7% | 478/505 | |
| 26 | o3-mini-high | OpenAI | No | $1.10/$4.40 | 94.7% | 478/505 | |
| 27 | gpt-5-chat | OpenAI | No | $1.25/$10.00 | 94.5% | 477/505 | |
| 28 | o4-mini-low | OpenAI | No | $1.10/$4.40 | 94.3% | 476/505 | |
| 29 | gpt-5.1-chat-high | OpenAI | No | $1.25/$10.00 | 93.9% | 474/505 | |
| 30 | gpt-4.1 | OpenAI | No | $2.00/$8.00 | 93.7% | 473/505 | |
| 31 | gemini-2.0-flash-001 | No | $0.10/$0.40 | 93.3% | 471/505 | ||
| 32 | gpt-5-nano-low | OpenAI | No | $0.05/$0.40 | 93.3% | 471/505 | |
| 33 | llama-4-scout | Meta | Yes | $0.08/$0.30 | 93.1% | 470/505 | |
| 34 | mistral-medium-3.1 | Mistral | Yes | $0.40/$2.00 | 93.1% | 470/505 | |
| 35 | qwen3-235b-a22b-2507 | Alibaba | Yes | $0.07/$0.46 | 93.1% | 470/505 | |
| 36 | qwen3-30b-a3b-thinking-2507 | Alibaba | Yes | $0.05/$0.34 | 93.1% | 470/505 | |
| 37 | gpt-4o | OpenAI | No | $2.50/$10.00 | 92.9% | 469/505 | |
| 38 | gpt-5-nano-high | OpenAI | No | $0.05/$0.40 | 92.9% | 469/505 | |
| 39 | gpt-5-nano-medium | OpenAI | No | $0.05/$0.40 | 92.9% | 469/505 | |
| 40 | MM | minimax-m2 | MiniMax | No | $0.20/$1.00 | 92.9% | 469/505 |
| 41 | qwen3-14b | Alibaba | Yes | $0.05/$0.22 | 92.9% | 469/505 | |
| 42 | qwen3-32b | Alibaba | Yes | $0.08/$0.24 | 92.1% | 465/505 | |
| 43 | gpt-4.1-mini | OpenAI | No | $0.40/$1.60 | 91.7% | 463/505 | |
| 44 | claude-haiku-4.5 | Anthropic | No | $1.00/$5.00 | 91.5% | 462/505 | |
| 45 | gemini-2.5-flash-lite | No | $0.10/$0.40 | 91.3% | 461/505 | ||
| 46 | gpt-oss-120b | OpenAI | Yes | $0.04/$0.19 | 90.7% | 458/505 | |
| 47 | qwen3-vl-8b-thinking | Alibaba | Yes | $0.18/$2.10 | 90.3% | 456/505 | |
| 48 | mistral-small-3.2-24b-instruct | Mistral | Yes | $0.06/$0.18 | 89.3% | 451/505 | |
| 49 | gpt-oss-20b | OpenAI | Yes | $0.03/$0.14 | 89.3% | 451/505 | |
| 50 | claude-sonnet-4.5 | Anthropic | No | $3.00/$15.00 | 89.1% | 450/505 | |
| 51 | mistral-small-24b-instruct-2501 | Mistral | Yes | $0.03/$0.11 | 88.7% | 448/505 | |
| 52 | qwen3-8b | Alibaba | Yes | $0.03/$0.11 | 88.7% | 448/505 | |
| 53 | phi-4-reasoning-plus | Microsoft | Yes | $0.07/$0.35 | 87.7% | 443/505 | |
| 54 | ministral-14b-2512 | Mistral | Yes | $0.20/$0.20 | 87.7% | 443/505 | |
| 55 | qwen3-vl-8b-instruct | Alibaba | Yes | $0.06/$0.40 | 87.5% | 442/505 | |
| 56 | Z | glm-4-32b | Zhipu | Yes | $0.10/$0.10 | 87.3% | 441/505 |
| 57 | ministral-8b-2512 | Mistral | Yes | $0.15/$0.15 | 86.9% | 439/505 | |
| 58 | gpt-4.1-nano | OpenAI | No | $0.10/$0.40 | 86.1% | 435/505 | |
| 59 | gemma-3-27b-it | Yes | $0.04/$0.15 | 85.3% | 431/505 | ||
| 60 | DS | deepseek-r1-0528-qwen3-8b | DeepSeek | Yes | $0.02/$0.10 | 85.1% | 430/505 |
| 61 | gpt-4o-mini | OpenAI | No | $0.15/$0.60 | 84.8% | 428/505 | |
| 62 | claude-3.5-haiku | Anthropic | No | $0.80/$4.00 | 84.0% | 424/505 | |
| 63 | gemma-3-12b-it | Yes | $0.03/$0.10 | 82.2% | 415/505 | ||
| 64 | nemotron-nano-9b-v2 | Nvidia | Yes | $0.04/$0.16 | 79.6% | 402/505 | |
| 65 | ministral-3b-2512 | Mistral | Yes | $0.10/$0.10 | 79.2% | 400/505 | |
| 66 | mistral-nemo | Mistral | Yes | $0.02/$0.04 | 78.8% | 398/505 | |
| 67 | nemotron-3-nano-30b-a3b | Nvidia | Yes | $0.06/$0.24 | 77.4% | 391/505 | |
| 68 | nemotron-nano-12b-v2-vl | Nvidia | Yes | $0.20/$0.60 | 77.4% | 391/505 | |
| 69 | gemma-3n-e4b-it | Yes | $0.02/$0.04 | 75.2% | 380/505 | ||
| 70 | llama-3.1-8b-instruct | Meta | Yes | $0.02/$0.03 | 72.5% | 366/505 | |
| 71 | gemma-3-4b-it | Yes | $0.02/$0.07 | 71.3% | 360/505 | ||
| 72 | llama-3.2-3b-instruct | Meta | Yes | $0.02/$0.02 | 57.6% | 291/505 |
Showing 72 of 72 models
Visualizations
Accuracy vs price
Higher accuracy models tend to be more expensive. Green dots are open-weight models.
Top 33 models
Top 33 open-weight models
Domain performance heatmap
Accuracy breakdown by domain for the top 50 models.
| Model | Drilling | Geophysics | Pet. Geology | Petrophysics | Production | Reservoir | Sediment. |
|---|---|---|---|---|---|---|---|
| gemini-3-pro-preview | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| glm-4.7 | 100 | 100 | 99 | 98 | 100 | 100 | 99 |
| gemini-3-flash-preview | 100 | 99 | 99 | 97 | 100 | 100 | 99 |
| gemini-2.5-pro | 96 | 99 | 99 | 97 | 93 | 100 | 100 |
| grok-4.1-fast | 96 | 100 | 99 | 96 | 100 | 100 | 100 |
| gpt-5.2-chat-medium | 96 | 100 | 99 | 96 | 100 | 100 | 99 |
| kimi-k2-thinking | 96 | 99 | 99 | 96 | 100 | 98 | 98 |
| claude-opus-4.5 | 96 | 96 | 98 | 96 | 100 | 100 | 97 |
| gpt-5.2-chat-high | 96 | 100 | 99 | 95 | 100 | 100 | 98 |
| gpt-5.2-chat-low | 96 | 99 | 99 | 96 | 100 | 98 | 98 |
| gpt-5-mini-medium | 96 | 100 | 98 | 95 | 93 | 100 | 99 |
| gpt-5.1-chat-medium | 96 | 98 | 99 | 95 | 100 | 100 | 98 |
| deepseek-r1 | 96 | 98 | 99 | 95 | 100 | 100 | 97 |
| grok-4-fast | 96 | 100 | 99 | 93 | 100 | 100 | 99 |
| gpt-5-mini-high | 96 | 100 | 99 | 93 | 93 | 100 | 100 |
| gpt-5-mini-low | 96 | 100 | 97 | 92 | 100 | 98 | 99 |
| o4-mini-high | 96 | 100 | 97 | 92 | 100 | 100 | 100 |
| gemini-2.5-flash | 88 | 98 | 99 | 93 | 100 | 100 | 98 |
| o4-mini-medium | 92 | 99 | 98 | 92 | 93 | 100 | 99 |
| grok-3-mini | 96 | 98 | 98 | 92 | 100 | 98 | 98 |
| deepseek-v3.2 | 92 | 96 | 97 | 92 | 100 | 100 | 97 |
| gpt-5.1-chat-low | 92 | 93 | 97 | 95 | 100 | 93 | 98 |
| o3-mini-low | 96 | 99 | 98 | 92 | 100 | 98 | 97 |
| o3-mini-medium | 96 | 99 | 99 | 92 | 100 | 100 | 97 |
| claude-3.7-sonnet | 92 | 94 | 95 | 93 | 100 | 100 | 96 |
| o3-mini-high | 96 | 99 | 98 | 92 | 100 | 95 | 97 |
| gpt-5-chat | 96 | 91 | 97 | 93 | 100 | 98 | 97 |
| o4-mini-low | 96 | 99 | 97 | 91 | 93 | 98 | 99 |
| gpt-5.1-chat-high | 96 | 89 | 96 | 93 | 100 | 93 | 99 |
| gpt-4.1 | 96 | 90 | 95 | 92 | 100 | 95 | 97 |
| gemini-2.0-flash-001 | 100 | 96 | 97 | 90 | 93 | 98 | 99 |
| gpt-5-nano-low | 100 | 95 | 97 | 90 | 86 | 95 | 98 |
| llama-4-scout | 88 | 98 | 96 | 90 | 100 | 98 | 98 |
| mistral-medium-3.1 | 96 | 95 | 97 | 89 | 100 | 100 | 98 |
| qwen3-235b-a22b-2507 | 92 | 93 | 97 | 91 | 79 | 95 | 96 |
| qwen3-30b-a3b-thinking-2507 | 100 | 96 | 98 | 89 | 93 | 98 | 97 |
| gpt-4o | 92 | 90 | 96 | 90 | 100 | 98 | 97 |
| gpt-5-nano-high | 96 | 96 | 98 | 89 | 86 | 100 | 97 |
| gpt-5-nano-medium | 96 | 95 | 98 | 89 | 93 | 100 | 96 |
| minimax-m2 | 96 | 94 | 95 | 90 | 86 | 98 | 96 |
| qwen3-14b | 96 | 95 | 97 | 90 | 93 | 95 | 96 |
| qwen3-32b | 88 | 96 | 95 | 89 | 86 | 100 | 97 |
| gpt-4.1-mini | 88 | 90 | 95 | 89 | 100 | 95 | 98 |
| claude-haiku-4.5 | 92 | 95 | 95 | 88 | 93 | 100 | 96 |
| gemini-2.5-flash-lite | 100 | 94 | 93 | 89 | 79 | 93 | 95 |
| gpt-oss-120b | 88 | 94 | 94 | 88 | 100 | 95 | 91 |
| qwen3-vl-8b-thinking | 92 | 94 | 93 | 87 | 93 | 95 | 95 |
| mistral-small-3.2-24b-instruct | 92 | 91 | 93 | 86 | 93 | 95 | 95 |
| gpt-oss-20b | 92 | 96 | 93 | 85 | 100 | 93 | 91 |
| claude-sonnet-4.5 | 88 | 86 | 89 | 91 | 100 | 95 | 83 |