4 minutes
2024. 12. 5.
We are excited to introduce HRM8K (HAE-RAE Math 8K), the first publicly available benchmark for mathematical reasoning in Korean. HRM8K comprises 8,011 instances, sourced through a combination of translations from established English benchmarks (e.g., GSM8K, MATH, NuminaMath, MMMLU) and original problems curated from existing Korean math exams by our team.
Benchmark Details
The HRM8K benchmark consists of two subsets.
Korean School Math (KSM): This subset includes 1,428 challenging Korean mathematical problems from Korean examinations and competitions. The problems are manually captured as screenshots by a human labeler, processed through OCR using the GPT-4o API and cross-checked by two labelers.
Prior Sets: This subset comprises 6,583 math problems translated from existing English benchmarks, including GSM8K, MATH, Numina-Math, and MMLU. For GSM8K, MATH, and Numina-Math, translations are done using the GPT-4o API, followed by a human quality check. For MMLU, we utilize the human-translated version provided by OpenAI (MMMLU).
Evaluation Settings
LLMs are initially prompted in the following manner for our experiments. In cases where system roles are not available, the system_message
is included in the user query.
We include the “Respond in Korean” in our prompt, as the ability of interest is each model's capability to solve and explain math questions in Korean. Moreover, while using a low temperature is ideal for pass@1 settings, we observe that multiple LLMs tend to reason in their preferred language (e.g., English or Chinese) in such an environment. Accordingly, we use the following sampling parameters as our default setting unless otherwise specified.
temperature = 0.7
top_p = 0.95
min_tokens = 8
max_tokens = 2048
Performance Report
Here, we report the performance of existing LLMs on HRM8K.
To our surprise, we observe that the performance of Qwen2.5-72B-Instruct is on par with GPT-4o. GPT-4o-Mini and Llama-3.1-70B-Instruct follow the two models. Llama-3.1-405B-Instruct was excluded due to hardware constraints.
Frontier Models (Proprietary / Large & Open)
We also present the evaluation results for smaller LLMs (<20B), including several models developed by Korean companies. Below are key details about the models:
VARCO-8B-Instruct: A fine-tuned version of Llama-3.1-8B.
Solar Pro (preview) Instruct: A derivative of Phi-3-Medium, scaled through model merging and further trained. Please note that this version of the model does not officially support Korean.
Exaone-3-7.8B-Instruct: A model trained from scratch on 8T tokens, with a mix of Korean and English data.
In terms of performance, Qwen2.5-14B-Instruct leads with a score of 43.5, followed by Llama-3.1-8B-Instruct (39.4), Qwen2.5-7B-Instruct (37.0), and Exaone-3.0-7.8B-Instruct (36.02). Notably, despite not being specifically pre-trained on Korean data, Qwen2.5 and Llama-3.1 demonstrate competitive performance.
Open LLMs ( < 20B)
We further compare the performance of Qwen2.5-7B, Llama3.1-8B, and Exaone3-7.8B relative to their computational budget during pretraining. Following Kaplan et al. (2020), we estimate pretraining compute as 6×#parameters×#tokens. For brevity, we use ExaFLOPs to represent the scale of computational operations, where each unit corresponds to 10^{18} floating-point operations (FLOPs).
The table below shows that Exaone-3.0-7.8B-Instruct is trained with a relatively smaller compute budget than its counterparts. Notably, for Exaone-3.0-7.8B-Instruct, each unit of ExaFLOP compute during pretraining contributes to a score of 0.00058 on HRM8K, outperforming Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct in efficiency. We hypothesize that this is due to its focus on Korean data during pretraining data curation.
We are operating the HRM8K leaderboard here
, if you are interested in performance of more models.
Scaling Test-Time Compute
Considering the recent hype on scaling test-time compute, we detail relevant experiments in this section. A common approach is training a dedicated LLM with Long-CoT (Chain-of-Thought) reasoning. A notable example is QwQ-preview by the Qwen Team. Interestingly, as shown in the table below, QwQ-preview underperforms compared to Qwen2.5-32B-Instruct on HRM8K. While little is known about the specific training recipe for QwQ-preview, our educated guess is that it is primarily trained on English or Chinese datasets, making it less effective for Korean questions. Please note that we observe the model respond in either English or Chinese for most instances, even while being prompted to explain in Korean.
To mitigate the issue, we train two versions of Long-CoT models of our own: OLV-0.1 and OLV-0.2.
For OLV-0.1, we train Qwen2.5-7B to generate a two-step chain of CoT reasoning. In its first step, the model takes time to understand the question and plan a solution. In its second step, the solution is generated.
OLV-0.2 is the highlight. We train it to generate a three-step CoT reasoning. To the above-mentioned steps, we add a final step where the model re-iterates its response to check whether it has made any mistakes. To our surprise, the model successfully learns to fix its own mistakes.
We detail the performance of Qwen2.5-7B-Instruct, OLV-0.1, and OLV-0.2 in the table below. OLV models outperform their baseline greatly. Furthermore, the added re-iteration step successfully pushes the model further.
Below, we provide examples of OLV-0.2 self-correcting itself. (Yes, these are cherry-picked examples, and OLV sometimes fixes correct responses.)
Final Remarks
The OLV series will take some time to be made public as we are still actively making progress for longer and more dynamic reasoning chains. We just wanted to share our initial results to provide a glimpse of what we are working on. HRM8K, however, will be released soon (during December). If you want your models evaluated before the release, please contact us at spthsrbwls123@yonsei.ac.kr
.