Research on LLM reasoning mostly relies on the answer accuracy as a proxy metric for the reasoning process. However...
- A correct final answer does not always indicate the correctness of the reasoning chain. This is refered to as "false positives".
- LLMs may hallucinate when generating reasoning chain, which would cause problems in many scenarios, like using LLMs as tutors.
- Existing reasoning chain evaluation methods typically require massive training data, or lacks evaluation accuracy and interpretability.
Try AutoRace in
OpenAI GPTs:
Note: AutoRace is an automatic evaluation method based on GPT-4. It does not guarantee absolute accuracy in the evaluation results.
Category | Math | Logical | Commonsense | Embodied | |||
---|---|---|---|---|---|---|---|
Task | Average | GSM8k* | AQuA* | Game24 | PrOntoQA | StrategyQA* | Blocksworld |
Metric | AutoRace | AutoRace | Oracle | Oracle | AutoRace | Oracle | |
GPT-4 turbo | 0.61 | 0.86 | 0.59 | 0.09 | 0.75 | 0.91 | 0.45 |
Claude-3 Opus | 0.60 | 0.90 | 0.57 | 0.07 | 0.88 | 0.78 | 0.41 |
Gemini Pro | 0.36 | 0.67 | 0.28 | 0.08 | 0.52 | 0.46 | 0.15 |
InternLM-2 7B | 0.28 | 0.61 | 0.17 | 0.03 | 0.45 | 0.31 | 0.10 |
Mixtral 8x7B | 0.27 | 0.49 | 0.19 | 0.04 | 0.44 | 0.32 | 0.11 |
Mistral 7B | 0.26 | 0.38 | 0.41 | 0.02 | 0.40 | 0.28 | 0.09 |
Llama-2 70B | 0.24 | 0.37 | 0.09 | 0.04 | 0.58 | 0.34 | 0.05 |
Gemma 7B | 0.23 | 0.48 | 0.16 | 0.02 | 0.34 | 0.30 | 0.10 |
Qwen-1.5 7B | 0.22 | 0.53 | 0.17 | 0.05 | 0.21 | 0.33 | 0.06 |
Llama-2 13B | 0.18 | 0.24 | 0.06 | 0.04 | 0.42 | 0.28 | 0.05 |
To evaluate the reasoning chains, we apply AutoRace for open-domain tasks, including GSM8k, AQuA, and StrategyQA. For other close-domain tasks, we test the reasoning chain with oracle evaluators (rule-based programs). By clicking the "show accuracy" button, you can see the final answer accuracy of some tasks for reference.