Each knowledge choice technique inherently has a goal. In apply, these targets typically emerge implicitly by means of benchmark-driven iteration: researchers develop choice methods, prepare fashions, measure benchmark efficiency, then refine accordingly. This raises a pure query: what occurs after we make this optimization express? To discover this, we suggest benchmark-targeted rating (BETR), a easy technique that selects pretraining paperwork primarily based on similarity to benchmark coaching examples. BETR embeds benchmark examples and a pattern of pretraining paperwork in a shared house, scores this pattern by similarity to benchmarks, then trains a light-weight classifier to foretell these scores for the complete corpus.
We examine knowledge choice strategies by coaching over 500 fashions spanning 10¹⁹ to 10²² FLOPs and becoming scaling legal guidelines to them. From this, we discover that merely aligning pretraining knowledge to analysis benchmarks utilizing BETR achieves a 2.1x compute multiplier over DCLM-Baseline (4.7x over unfiltered knowledge) and improves efficiency on 9 out of 10 duties throughout all scales. BETR additionally generalizes nicely: when focusing on a various set of benchmarks disjoint from our analysis suite, it nonetheless matches or outperforms baselines. Our scaling evaluation additional reveals a transparent development: bigger fashions require much less aggressive filtering. Total, our findings present that immediately matching pretraining knowledge to focus on duties exactly shapes mannequin capabilities and spotlight that optimum choice methods should adapt to mannequin scale.
- † College of Washington
- ‡ Stanford
- § Anthropic
- ** Work performed whereas at Apple