Korean Reasoning Benchmarks: HRM8K & HRMCR

4 minutes

2025. 1. 16.

We are sharing two new Korean reasoning benchmarks: HRM8K (HAE-RAE Math 8K) and HRMCR (HAE-RAE Multi-Step Commonsense Reasoning). These benchmarks are carefully designed to assess the Korean mathematical reasoning skills and commonsense reasoning capabilities of Large Language Models (LLMs). In this post, we first highlight the importance of evaluating reasoning capability in Korean LLMs, explain each benchmark in details, and share insights from their evaluation results.

The paper and benchmark datasets are publicly available below:

📝 Paper (HRM8K): https://arxiv.org/abs/2501.02448
📝 Paper (HRMCR): https://arxiv.org/abs/2501.05712
🤗 Dataset (HRM8K): https://huggingface.co/datasets/HAERAE-HUB/HRM8K
🤗 Dataset (HRMCR): https://huggingface.co/datasets/HAERAE-HUB/HRMCR

The Lack of Korean Reasoning Evaluation

The recent advancements in powerful Large Language Models (LLMs), such as OpenAI's o1 and Anthropic’s Claude-3.5, are leading a significant evolution in the field. These developments are not confined to the English but are also fostering the development of multilingual LLMs, including Korean. Consequently, the need for benchmarks that can systematically and fairly evaluate the performance of LLMs across diverse linguistic environments, including Korean, has become increasingly critical.

To address this, various benchmarks have been specifically developed to assess LLMs' Korean language capabilities. The following table describes highly-used benchmarks to evaluate LLMs’ Korean capabilities with 3 primary categories: Knowledge, Fluency, and Reasoning.

Knowledge: Benchmarks designed to evaluate the performance of language models in comprehending and processing the Korean language and cultural knowledge. These benchmarks feature question sets covering various aspects of Korean knowledge and culture, such as society, history, law, and more: CLIcK, HAE-RAE Bench, KMMLU, Ko-H5, etc.
Fluency: Benchmarks designed to assess a language model's Korean instruction-following ability and fluency. These benchmarks often employ the LLM-as-a-Judge approach, where large language models act as evaluators to measure the quality and accuracy of responses: KoMT-Bench, LogicKor, etc.
Reasoning: Benchmarks used to evaluate a language model's ability to reasoning in Korean. These tasks range from basic Korean knowledge to complex scenarios that require specialized knowledge and logical inference. However, these benchmarks are outdated or limited by the domain, making it challenging to evaluate the model's complex reasoning capabilities: KoBEST, KMMLU, Ko-H5, KoCommonGen v2, etc.

These benchmarks primarily focus on tasks like language understanding, general knowledge, and commonsense reasoning, making it challenging to assess the reasoning capability—one of the critical capabilities of language models. Furthermore, most reasoning benchmarks primarily focus on commonsense reasoning, and their limited accessibility remains a significant challenge for evaluation.

To address these challenges, we introduce HRM8K and HRMCR, the fully publicly available Korean reasoning benchmarks which focus on mathematical reasoning and cultural commonsense reasoning, respectively.

HRM8K: Korean Mathematical Reasoning Benchmark

HRM8K is first publicly available benchmark for mathematical reasoning in Korean. This benchmark comprises 8,011 instances sourced through a combination of translations from established English benchmarks (GSM8K, MATH, Omni-MATH, and MMMLU) and original problems curated from Korean math exams and competitions.

Benchmark Overview

The HRM8K benchmark consists of two subsets, and each subset is available in both Korean and English. To create a bilingual (English-Korean) dataset, we translated each instance in both subsets using GPT-4o, followed by human review of the translated samples to ensure quality:

Korean School Math (KSM): This subset comprises 1,428 challenging mathematical problems from Korean sources. We collect only from Olympiad or competition-level exams, regardless of the target age group. Consequently, even problems from younger curricula require a certain level of reasoning ability to solve. We manually capture the problems collected from the following sources as screenshots, process them through OCR using the GPT-4o API, and validate.
- Sources: KMO (한국수학올림피아드), KJMO (한국주니어수학올림피아드), CSAT (대학수학능력시험), KMS (한국대학수학경시대회), and TQ (교원임용경쟁시험).
Prior Sets: This subset comprises 6,583 problems from existing English mathematics benchmarks. We retain only instances with numeric answers for the MATH and Omni-MATH datasets, excluding those with text, equations, or proofs as final answers. In addition, we select only three math-related subsets, including abstract_algebra, college_mathematics, and high_school_mathematics from MMMLU datasets. The sources from which data was collected are as follows:
- Sources: GSM8K, MATH, Omni-MATH, and MMMLU

Experimental Setup

We evaluate models across three cross-lingual setups (input language - to - reasoning language) to analyze the performance variations based on changes in the input and reasoning language, the evaluated settings are: Korean-to-Korean (K2K), Korean-to-English(K2E), English-to-English(E2E). We exclude the English-to-Korean (E2K) scenario because models typically fail to maintain a Korean reasoning when the input is given in English.

We experiment with six multilingual language models with decent Korean performance: three Qwen2.5 Instruct models (1.5B, 7B, and 72B parameters) and three Llama3.1/2 Instruct models (1B, 8B, and 70B parameters). We set the sampling parameters to temperature=0.7, top_p=0.95 .

Results & Findings

The above table shows the performance of Qwen2.5 and Llama3.1/2 models on the HRM8K benchmark depending on the input and reasoning language. We highlight three findings on these results:

Effect of input language: Switching from Korean input (K2E) to an entirely English setup (E2E) yields an average improvement of 11%. In particular, Qwen2.5-7B and Llama-3.1-8B drop by 10% and 13%, respectively, when forced to process Korean input. This underscores the significance of input language in model performance.
Effect of reasoning language: In contrast, comparing K2K to K2E shows an average difference of only 1%, suggesting that the language of the reasoning process has a relatively small impact once the model has already ingested Korean input. Simply allowing the model to produce its chain-of-thought in English does not fully recover performance lost from reading a Korean prompt.
Multilingual reasoning gap: Comparing problem-solving in a Korean (K2K) to an English (E2E), we observed an approximate 11% improvement in average performance in E2E. This finding highlights the multilingual reasoning gap, discussed in various studies, as evident in the HRM8K dataset.

To recap, we identify a clear multilingual reasoning gap between English and Korean, and model's capability to comprehend the problem seems as a critical factor.

HRMCR: Korean Cultural Reasoning Benchmark

HRMCR is benchmark consists of cultural multi-step reasoning questions automatically generated using templates and algorithms. The questions in HRMCR require LLMs to recall diverse aspects of Korean culture and perform multiple reasoning steps to solve them.

Benchmark Overview

The HRMCR comprises two subsets, each containing 50 systematically designed questions that require multi-step reasoning. The specific details and example of each subset are illustrated as follows:

Date: Questions in this subset are composed of two sentences and require recalling knowledge about Korean holidays and data expressions. The model must perform simple arithmetic and convert between lunar and solar years using this knowledge. Each question requires a four-step solution.
Zodiac: This subset contains longer questions, each spanning 10 to 12 lines. To solve these, the model must recall concepts such as the Korean age and various age expressions used in Korean conversations, understand honorifics, and make logical inferences based on the provided premises. Additionally, it must perform arithmetic to deduce the zodiac sign from the age. The gold response consists of six steps, making this subset more challenging than the Date subset.

The above figure shows generated example questions (left) alongside their automatically generated solutions (right). The top panel represents the "Date" subset, while the bottom corresponds to the "Zodiac" subset. Questions are translated into Korean to enhance accessibility.

All questions of each subset are created by a dedicated algorithm with template. Each algorithm natively includes a solution generator that solves the question step-by-step as it is generated and provides a gold solution. The unique characteristics of HRMCR enable quick and easy generation of new data compared to other private and public datasets.

Experimental Setup

We evaluate a total of 20 LLMs, including both proprietary (o1, GPT-4o, Claude-3.5) and open-source (Qwen2.5, Llama3, EXAONE-3.5, DeepSeek-V3) models. For evaluation process, we employ GPT-4o as an LLM-as-a-Judge. The judge reviews each question alongside the model-generated response and the gold step-by-step solution. It first provides a brief comparison with the gold solution and then assesses whether the model’s response is correct. In cases of incorrect responses, the judge pinpoints the specific step where the error occurred. All evaluations were conducted using greedy decoding.

Results & Findings

The above table shows the evaluation results on HRMCR. We only display the performance of top-performing models per model family. We can derive three key observations from the evaluation results:

HRMCR is highly challenging: The leading models such as GPT-4o, DeepSeek-V3, and Claude-3.5-Sonnet all score under 30%. This is particularly noteworthy given that the benchmark is built on fixed, deterministic rules rather than specialized domain knowledge.
The effectiveness of inference-time scaling: O1 achieves an average score of 45. This suggests that inference-time scaling can generalize effectively to previously unseen domains.
Computational resources: EXAONE3.5-32B, despite its size, shows near-zero performance on the benchmark. On the other hand, as shown in the following figure(X-axis: ExaFLOPs(10^18) training compute scale, Y-axis: HRMCR performance), Qwen-2.5-14B, despite a much smaller model size, demonstrates superior performance compared to EXAONE-3.5-32B due to its higher training compute. This indicates that solving HRMCR requires not just model scale but also advanced training strategies and sufficient computational resources.

Conclusion

We introduce HRM8K and HRMCR, designed to evaluate mathematical reasoning and cultural commonsense reasoning, respectively, to address the lack of reasoning benchmarks in Korean. We evaluate various models agains theses benchmarks, and analyzed their reasoning capabilities in the Korean language and multilingual setup. Along the way, we uncover insights such as the multilingual reasoning gap and the role of computational resources in model performance.

Our findings underscore the significant potential for further research into the Korean reasoning capabilities of LLMs. We hope that introduction of HRM8K and HRMCR contributes to Korean LLM community and provides valuable resources for future researches in this domain. If you have any questions about our benchmarks, feel free to contact us at spthsrbwls123@yonsei.ac.kr.