MolecularIQ Leaderboard

Characterizing chemical reasoning capabilities through symbolic verification on molecular graphs

Select Columns to Display
RankModel Type Size Reasoning Overall Counting Indexing Generation
🥇GPT-OSS 120B (High)120B (A5B)Yes0.4750.4680.4250.537
🥈Qwen-3 235B235B (A22B)Yes0.3920.3710.3450.467
🥉DeepSeek-R1-0528671B (A37B)Yes0.3710.3640.3020.455
4GPT-OSS 20B (High)20B (A4B)Yes0.3570.3910.3270.352
5Qwen-3 Next 80B80B (A3B)Yes0.3230.3070.2690.400
6GPT-OSS 20B (Med)20B (A4B)Yes0.2720.2790.2220.321
7GPT-OSS 120B (Med)120B (A5B)Yes0.2650.2910.2230.280
8SEED-OSS36BYes0.2430.2680.2560.200
9Qwen-3 30B30B (A3B)Yes0.2270.2300.1650.294
10GLM-4.6355B (A32B)Yes0.1620.1590.1130.220
11GPT-OSS 120B (Low)120B (A5B)Yes0.1590.1710.1070.203
12Qwen-3 32B32BYes0.1560.1700.0910.210
13Qwen-3 14B14BYes0.1310.1440.0750.177
14Qwen-3 8B8BYes0.1120.1250.0580.158
15Nemotron-Nano 9B v29BYes0.1110.1220.0660.148
16GPT-OSS 20B (Low)20B (A4B)Yes0.1080.1300.0550.140
17LLaMA-3.3 70B70BNo0.0900.1110.0160.147
18ChemDFM-R-14B14BYes0.0870.1290.0280.105
19Mistral-Small24BNo0.0830.0970.0160.141
20Gemma-3 27B27BNo0.0780.0990.0120.127
21Gemma-2 27B27BNo0.0690.0660.0090.139
22Qwen-2.5 14B14BNo0.0660.0760.0120.115
23Ether024BYes0.0650.0320.0010.175
24Gemma-3 12B12BNo0.0510.0640.0080.084
25Qwen-2.5 7B7BNo0.0510.0600.0190.075
26TxGemma-27B27BNo0.0500.0700.0180.062
27LLaMA-3 8B8BNo0.0470.0600.0020.081
28Gemma-2 9B9BNo0.0400.0390.0190.063
29ChemDFM-8B8BNo0.0360.0490.0010.060
30MolReasoner-Cap7BYes0.0350.0560.0080.040
31MolReasoner-Gen7BYes0.0320.0490.0150.032
32LLaMA-2 13B13BNo0.0310.0390.0020.053
33TxGemma-9B9BNo0.0260.0410.0050.031
34ChemDFM-13B13BNo0.0240.0410.0020.027
35Llama-3 MolInst8BNo0.0200.0460.0050.007
36LlaSMol7BNo0.0150.0160.0180.011
37ChemLLM7BNo0.0140.0200.0030.019
38Mistral-7B v0.17BNo0.0120.0200.0030.012
Scoring: Models receive a binary reward (1 for correct, 0 for incorrect) for each question. The final score per question is the average across three rollouts. The column values shown represent the average of these scores across all questions in that category.
Rank: Based on Overall score (descending)
Model Type: △ for generalist/base models, for chemistry-finetuned models