MolecularIQ Leaderboard
Characterizing chemical reasoning capabilities through symbolic verification on molecular graphs
| Rank | Model | Type | Size | Reasoning | Overall | Counting | Indexing | Generation |
|---|---|---|---|---|---|---|---|---|
| 🥇 | GPT-OSS 120B (High) | △ | 120B (A5B) | Yes | 0.475 | 0.468 | 0.425 | 0.537 |
| 🥈 | Qwen-3 235B | △ | 235B (A22B) | Yes | 0.392 | 0.371 | 0.345 | 0.467 |
| 🥉 | DeepSeek-R1-0528 | △ | 671B (A37B) | Yes | 0.371 | 0.364 | 0.302 | 0.455 |
| 4 | GPT-OSS 20B (High) | △ | 20B (A4B) | Yes | 0.357 | 0.391 | 0.327 | 0.352 |
| 5 | Qwen-3 Next 80B | △ | 80B (A3B) | Yes | 0.323 | 0.307 | 0.269 | 0.400 |
| 6 | GPT-OSS 20B (Med) | △ | 20B (A4B) | Yes | 0.272 | 0.279 | 0.222 | 0.321 |
| 7 | GPT-OSS 120B (Med) | △ | 120B (A5B) | Yes | 0.265 | 0.291 | 0.223 | 0.280 |
| 8 | SEED-OSS | △ | 36B | Yes | 0.243 | 0.268 | 0.256 | 0.200 |
| 9 | Qwen-3 30B | △ | 30B (A3B) | Yes | 0.227 | 0.230 | 0.165 | 0.294 |
| 10 | GLM-4.6 | △ | 355B (A32B) | Yes | 0.162 | 0.159 | 0.113 | 0.220 |
| 11 | GPT-OSS 120B (Low) | △ | 120B (A5B) | Yes | 0.159 | 0.171 | 0.107 | 0.203 |
| 12 | Qwen-3 32B | △ | 32B | Yes | 0.156 | 0.170 | 0.091 | 0.210 |
| 13 | Qwen-3 14B | △ | 14B | Yes | 0.131 | 0.144 | 0.075 | 0.177 |
| 14 | Qwen-3 8B | △ | 8B | Yes | 0.112 | 0.125 | 0.058 | 0.158 |
| 15 | Nemotron-Nano 9B v2 | △ | 9B | Yes | 0.111 | 0.122 | 0.066 | 0.148 |
| 16 | GPT-OSS 20B (Low) | △ | 20B (A4B) | Yes | 0.108 | 0.130 | 0.055 | 0.140 |
| 17 | LLaMA-3.3 70B | △ | 70B | No | 0.090 | 0.111 | 0.016 | 0.147 |
| 18 | ChemDFM-R-14B | ⌬ | 14B | Yes | 0.087 | 0.129 | 0.028 | 0.105 |
| 19 | Mistral-Small | △ | 24B | No | 0.083 | 0.097 | 0.016 | 0.141 |
| 20 | Gemma-3 27B | △ | 27B | No | 0.078 | 0.099 | 0.012 | 0.127 |
| 21 | Gemma-2 27B | △ | 27B | No | 0.069 | 0.066 | 0.009 | 0.139 |
| 22 | Qwen-2.5 14B | △ | 14B | No | 0.066 | 0.076 | 0.012 | 0.115 |
| 23 | Ether0 | ⌬ | 24B | Yes | 0.065 | 0.032 | 0.001 | 0.175 |
| 24 | Gemma-3 12B | △ | 12B | No | 0.051 | 0.064 | 0.008 | 0.084 |
| 25 | Qwen-2.5 7B | △ | 7B | No | 0.051 | 0.060 | 0.019 | 0.075 |
| 26 | TxGemma-27B | ⌬ | 27B | No | 0.050 | 0.070 | 0.018 | 0.062 |
| 27 | LLaMA-3 8B | △ | 8B | No | 0.047 | 0.060 | 0.002 | 0.081 |
| 28 | Gemma-2 9B | △ | 9B | No | 0.040 | 0.039 | 0.019 | 0.063 |
| 29 | ChemDFM-8B | ⌬ | 8B | No | 0.036 | 0.049 | 0.001 | 0.060 |
| 30 | MolReasoner-Cap | ⌬ | 7B | Yes | 0.035 | 0.056 | 0.008 | 0.040 |
| 31 | MolReasoner-Gen | ⌬ | 7B | Yes | 0.032 | 0.049 | 0.015 | 0.032 |
| 32 | LLaMA-2 13B | △ | 13B | No | 0.031 | 0.039 | 0.002 | 0.053 |
| 33 | TxGemma-9B | ⌬ | 9B | No | 0.026 | 0.041 | 0.005 | 0.031 |
| 34 | ChemDFM-13B | ⌬ | 13B | No | 0.024 | 0.041 | 0.002 | 0.027 |
| 35 | Llama-3 MolInst | ⌬ | 8B | No | 0.020 | 0.046 | 0.005 | 0.007 |
| 36 | LlaSMol | ⌬ | 7B | No | 0.015 | 0.016 | 0.018 | 0.011 |
| 37 | ChemLLM | ⌬ | 7B | No | 0.014 | 0.020 | 0.003 | 0.019 |
| 38 | Mistral-7B v0.1 | △ | 7B | No | 0.012 | 0.020 | 0.003 | 0.012 |
Scoring: Models receive a binary reward (1 for correct, 0 for incorrect) for each question. The final score per question is the average across three rollouts. The column values shown represent the average of these scores across all questions in that category.
Rank: Based on Overall score (descending)
Model Type: △ for generalist/base models, ⌬ for chemistry-finetuned models