원라인에이아이 추론특화 모델 OLAF v2를 소개합니다!

4 minutes

Jan 13, 2025

Introducing OLAF2

Today, we are excited to announce the release of OLAF2, a Korean language model built upon Qwen2.5. OLAF2 is available in two variants: 14B and 1.5B. Both variants feature specialized reasoning modes designed to tackle advanced Math and STEM questions with precision. They support a maximum context length of 32K, making them highly suitable for Retrieval-Augmented Generation (RAG) and tool-based applications. The model's training process incorporates iterative data generation and a strong focus on safety and refusal mechanisms, effectively reducing hallucination and enhancing reliability.

Reasoning Mode

Motivated by recent advancements in controllable inference-time scaling, OLAF2 is designed to generate longer and more detailed reasoning chains when operating in reasoning mode. By focusing on scaling test-time compute rather than train-time compute, we enable smaller models to perform at their full potential. This approach is particularly advantageous in environments with severe hardware constraints, where deploying large-scale models may be impractical. For further details, refer to the "Scaling Test-Time Compute" section in our recent HRM8K blog (https://www.onelineai.com/en/blog/hrm8k).

OLAF2's reasoning mode has proven highly efficient, demonstrating a significant performance improvement on the HRM8K benchmark (from 43.8 to 45.8). This capability allows smaller models to achieve near-equivalent reasoning performance to larger models, making high-quality language modeling accessible even on smaller in-house servers, ultimately empowering users with robust reasoning capabilities in resource-constrained settings.

Evaluation

We evaluate our models in three categories, Reasoning, Knowledge and Korean Fluency.

Reasoning — HRM8K

→ HRM8K is a bilingual benchmark comprising 8,011 parallel mathematics problems in Korean and English, designed to evaluate multilingual mathematical reasoning capabilities. The benchmark includes questions from diverse sources, such as English mathematics benchmarks and Korean mathematics competitions, ensuring robust evaluation of problem-solving in both languages. HRM8K aims to assess the models’ multilingual mathematical reasoning capabilities by alternating the input and reasoning languages.

Knowledge — KMMLU

→ KMMLU is a comprehensive benchmark consisting of 35,030 expert-level multiple-choice questions across 45 subjects, designed to evaluate language model proficiency in Korean. The benchmark sources its content from original Korean exams, reflecting authentic linguistic and cultural nuances unlike previous benchmarks based on translated dataset. It assesses models on their ability to handle domain-specific reasoning and general knowledge while emphasizing challenges in localized knowledge and Korean cultural contexts.

Fluency — LogicKor

→ LogicKor is a Korean multi-domain reasoning benchmark designed to measure a model’s thinking ability in six key areas: reasoning, math, writing, coding, comprehension, and grammar. It uses an LLM-as-a-judge approach to evaluate performance on 42 multi-turn prompts covering diverse tasks in each of these categories. Through this design, LogicKor provides a comprehensive assessment of a model’s capacity to handle complex and varied challenges in Korean.

* denotes the score has been brought from Exaone3.5 Technical Report
** denotes the score has been imported from the official logickor leaderboard
remaining scores were re-implemented using each official implementation

Key Takeaways

OLAFv2

Standard Mode:
In its standard mode, OLAFv2 (14B parameters) delivers a competitive HRM8K reasoning score of 43.8, outperforming EXAONE-3.5-32B-Instruct (41.4), a model more than twice its size. This underscores OLAFv2’s strong reasoning capabilities even without its dedicated reasoning mode. On KMMLU (Knowledge), OLAFv2 achieves 54.21, ranking just below the much larger Llama-3.1-70B-Instruct (60.83) but well ahead of EXAONE-3.5-32B-Instruct (47.63). Additionally, OLAFv2 scores 8.51 on LogicKor (Fluency), placing it among the top performers, second only to EXAONE-3.5-32B-Instruct (9.06). This highlights OLAFv2’s remarkable ability to generate fluent and coherent outputs despite being significantly smaller than its counterparts.

Reasoning Mode:
OLAFv2 in reasoning mode achieves the highest score on the HRM8K benchmark with 45.8, outperforming even larger models like Llama-3.1-70B-Instruct (45.6) and Qwen2.5-32B-Instruct (44.4). This demonstrates the model’s exceptional ability to handle complex reasoning tasks.

OLAFv2-Mini

Standard Mode:
In its standard mode, OLAFv2-Mini achieves an HRM8K reasoning score of 35.9, which is impressive given its smaller size and resource efficiency. On KMMLU (Knowledge), it scores 44.77, outperforming other smaller models like EXAONE-3.5-2.4B-Instruct (42.39), showcasing its strong and efficient knowledge representation. For LogicKor (Fluency), OLAFv2-Mini scores 7.4, delivering reasonable fluency for a mini-model, albeit slightly trailing behind the top-performing larger models.

Reasoning Mode:
Our Reasoning Mode is equally applicable to smaller models, shown by OLAFv2-Mini, achieving a strong HRM8K score of 38.0 in reasoning mode, further showcasing the benefits of its reasoning mode and test-time compute scaling.