Question 1

What is long-horizon reasoning?

Accepted Answer

Reasoning that requires planning and maintaining information across many steps or over extended timeframes, often involving multi-step deductions and goal-directed strategies.

Question 2

What is chain-of-thought evaluation?

Accepted Answer

Assessing the quality of the reasoning steps a model provides to reach a solution, focusing on coherence, relevance, and faithfulness to evidence—not just the final answer.

Question 3

How does chain-of-thought prompting differ from standard prompting?

Accepted Answer

Chain-of-thought prompting asks the model to show its step-by-step reasoning, while standard prompting seeks only the final answer. The former can improve performance on complex tasks but may reveal non-faithful or inaccurate steps.

Question 4

What criteria are used to evaluate long-horizon reasoning?

Accepted Answer

Final answer accuracy; coherence and relevance of the reasoning steps; faithfulness to evidence; consistency across steps; and avoidance of propagating errors.

Question 5

What are common challenges in evaluating long-horizon reasoning?

Accepted Answer

Verifying multi-step reasoning can be difficult; human raters may introduce bias; visible steps may be misleading; and ensuring evaluation metrics match real-world task goals can be hard.

Long-horizon Reasoning and Chain-of-Thought Evaluation

Long-horizon Reasoning and Chain-of-Thought Evaluation

💡 Key Takeaways

❓ Frequently Asked Questions