LLM Reasoners
Research on LLM reasoning mostly relies on the answer accuracy as a proxy metric for the reasoning process. However...
  • A correct final answer does not always indicate the correctness of the reasoning chain. This is refered to as "false positives".
  • LLMs may hallucinate when generating reasoning chain, which would cause problems in many scenarios, like using LLMs as tutors.
  • Existing reasoning chain evaluation methods typically require massive training data, or lacks evaluation accuracy and interpretability.
In this leaderboard, we focus on the direct evaluation of reasoning chains with our newly proposed metric AutoRace (Automated Reasoning Chain Evaluation). It can evaluate reasoning chains and provides explanations, without any human effort. For a detailed explanation of the evaluation metric and analysis of the results, please refer to our blog and paper.

Try AutoRace in

Logo

OpenAI GPTs:

Note: AutoRace is an automatic evaluation method based on GPT-4. It does not guarantee absolute accuracy in the evaluation results.

CategoryMathLogicalCommonsenseEmbodied
TaskAverageGSM8k*AQuA*Game24PrOntoQAStrategyQA*Blocksworld
MetricAutoRaceAutoRaceOracleOracleAutoRaceOracle

GPT-4 turbo

0.610.860.590.090.750.910.45

Claude-3 Opus

0.600.900.570.070.880.780.41

Gemini Pro

0.360.670.280.080.520.460.15

InternLM-2 7B

0.280.610.170.030.450.310.10

Mixtral 8x7B

0.270.490.190.040.440.320.11

Mistral 7B

0.260.380.410.020.400.280.09

Llama-2 70B

0.240.370.090.040.580.340.05

Gemma 7B

0.230.480.160.020.340.300.10

Qwen-1.5 7B

0.220.530.170.050.210.330.06

Llama-2 13B

0.180.240.060.040.420.280.05

To evaluate the reasoning chains, we apply AutoRace for open-domain tasks, including GSM8k, AQuA, and StrategyQA. For other close-domain tasks, we test the reasoning chain with oracle evaluators (rule-based programs). By clicking the "show accuracy" button, you can see the final answer accuracy of some tasks for reference.