ARB: Advanced Reasoning Benchmark for Large Language Models
Tomohiro Sawada1,2, Daniel Paleka1,3, Alexander Havrilla1,2, Pranav Tadepalli1,2, Paula Vidas1, Alexander Kranias1,2, John J Nay4,5, Kshitij Gupta1,6, Aran Komatsuzaki1,2
1 DuckAI, 2 Georgia Tech, 3 ETH Zürich, 4 Nomos AI, 5 Stanford University Center for Legal Informatics, 6 MILA
Abstract
Large Language Models (LLMs) have demonstrated remarkable performance on various quantitative reasoning and knowledge benchmarks, such as MMLU and MATH. However, many of these benchmarks are losing utility as LLMs get increasingly high scores, despite not yet achieving expert level performance in these domains. We introduce ARB, a novel benchmark composed of advanced reasoning problems designed to evaluate LLMs on text comprehension and expert domain reasoning. ARB presents a more challenging test than prior benchmarks, featuring questions that test deeper knowledge of mathematics, physics, biology, chemistry, and law.
As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge. In order to improve both automatic and assisted symbolic evaluation capabilities, we introduce a rubric-based self-evaluation approach, allowing GPT-4 to score its own intermediate reasoning steps.
We evaluated recent models such as GPT-4 and Claude on ARB and demonstrated that even with Chain-of-Thought prompting methods, current models score well below 50% on more demanding expert tasks. Further, we conducted a human evaluation of the symbolic subset of ARB, finding close agreement between annotators and GPT-4 self-evaluation scores.
Evaluation Results
Our evaluation of current large language models (LLMs) focuses on text-only problems, with no multimodal tasks, using models including ChatGPT, GPT 3.5, GPT-4, and Claude. Each question type is assessed with task-specific instructions and chain of thought; for multiple-choice questions, the model's choice is compared with the correct answer, while numerical, symbolic, and proof-like problems require extraction and parsing of the model's answer, often requiring mathematical libraries and manual grading due to their complexity. We also tested two model-based approaches for grading, including GPT-4's ability to grade equivalence of two symbolic expressions and a rubric-based evaluation method, which showed promising results, facilitating the evaluation of increasingly unstructured answers.

Model-based Rubric Evaluation
As the complexity of reasoning tasks for language learning models (LLMs) grows, reliable evaluation becomes challenging due to difficulties in grading symbolic answers and assessing intermediate reasoning steps. We propose an approach where the model generates and uses rubrics to evaluate solutions, based on reference solutions and examples of human-crafted rubrics. Our evaluation revealed that GPT-4 creates effective rubrics, covering key solution steps well but struggling with point allocation, outperforming its predecessor, GPT-3.5-turbo.


BibTex
@misc{sawada2023arb,
title={ARB: Advanced Reasoning Benchmark for Large Language Models},
author={Tomohiro Sawada, Daniel Paleka, Alexander Havrilla, Pranav Tadepalli, Paula Vidas, Alexander Perikles Kranias, John J Nay, Kshitij Gupta, Aran Komatsuzaki},
year={2023},
eprint={TBD},
archivePrefix={arXiv},
primaryClass={cs.LG, cs.CL}
}
Acknowledgements
We thank Jeffrey Deng for developing and documenting the API, and building the project website. We would also like to thank Raunak Chowdhuri for helpful comments, and Zhangir Azerbayev for useful discussions early on in the project. TS is supported by NSF grant 1745583.