"Program Synthesis Benchmarks Beyond Pass@k (LLM Evaluations)" refers to evaluating large language models (LLMs) on their ability to generate correct code, moving past the traditional "Pass@k" metric, which measures if at least one out of k generated solutions is correct. This approach advocates for richer, more comprehensive benchmarks that assess code quality, robustness, efficiency, and other dimensions, providing deeper insights into LLMs' true program synthesis capabilities.
"Program Synthesis Benchmarks Beyond Pass@k (LLM Evaluations)" refers to evaluating large language models (LLMs) on their ability to generate correct code, moving past the traditional "Pass@k" metric, which measures if at least one out of k generated solutions is correct. This approach advocates for richer, more comprehensive benchmarks that assess code quality, robustness, efficiency, and other dimensions, providing deeper insights into LLMs' true program synthesis capabilities.
What is Pass@k in program synthesis?
Pass@k is the probability that at least one of k generated programs passes the given unit tests for a task, measuring success when sampling multiple candidate solutions.
Why should benchmarks go beyond Pass@k?
Pass@k captures test-passing success but depends on test design and ignores generalization, code quality, and efficiency. Benchmarks should address these aspects for a fuller evaluation.
What other metrics and benchmarks complement Pass@k in program synthesis?
Functional correctness across diverse test suites, robustness to partial or noisy specifications, generalization to unseen tasks, runtime and resource usage, code quality and readability, and reproducibility across seeds; sometimes formal verification or equivalence checks are used.
How can you design and interpret benchmarks beyond Pass@k?
Define a representative task distribution, create varied, independent test suites, measure multiple facets (correctness, efficiency, robustness, quality), specify a reproducible protocol (seeds, time budget, k values), and report results with analyses of failures and uncertainty.