Test-Time Scaling with OLAF2

4 minutes

2025. 1. 16.

Test-Time Scaling has emerged as an exciting area of research, with recent models like QWQ, O1, and Deepseek-R1 demonstrating surprising reasoning capabilities. Test-Time Scaling can be approached through various methods, including Best-of-N, Monte Carlo Tree Search, and Reflective Tuning. In this post, we share our preliminary findings from applying test-time scaling techniques to OLAF2-14B, our flagship model.

Experimental Setup

The figure below showcases Deepseek’s results on test-time scaling, where the average number of thought tokens is plotted on the X-axis.

In contrast, our experiments involve applying multiple scaling methods simultaneously, which complicates the calculation of token counts. Some tokens have inherently higher value than others. To address this, we use FLOPs (Floating Point Operations) as a more consistent metric. FLOPs are computed following the approach outlined in the Scaling Laws for Neural Language Models paper. Specifically, a single forward pass is approximated as:

Here:

n_{layer}: Number of layers
d_{model}: Dimension of the residual stream
n_{ctx}: Number of tokens in the input context

We benchmark our methods on the GSM8K and the Omni Math subset of HRM8K. While it would have been ideal to include more subsets and benchmarks, due to compute constraints we focus on these two. This selection is motivated by two key reasons:

Diversity in Difficulty: GSM8K represents relatively easy, school-level math word problems, while Omni Math includes olympiad-level, highly challenging problems.
Simplified Evaluation: Both subsets have been pre-filtered to include only questions with digit-based answers, simplifying the evaluation process.

For more details about the benchmark, please refer to our paper.

Evaluation Results

To our surprise, increasing test-time compute significantly enhances the performance of OLAF2-14B. The efficiency of this scaling, however, depends heavily on how the compute is utilized, as some methods are far more effective than others. When scaled to the extreme, OLAF2-14B surpasses GPT-4o on both benchmarks.