Introducing HRM8K, the first Korean Math Reasoning Benchmark

4 minutes

2024. 12. 5.

We are excited to introduce HRM8K (HAE-RAE Math 8K), the first publicly available benchmark for mathematical reasoning in Korean. HRM8K comprises 8,011 instances, sourced through a combination of translations from established English benchmarks (e.g., GSM8K, MATH, NuminaMath, MMMLU) and original problems curated from existing Korean math exams by our team.

Benchmark Details

The HRM8K benchmark consists of two subsets.

Korean School Math (KSM): This subset includes 1,428 challenging Korean mathematical problems from Korean examinations and competitions. The problems are manually captured as screenshots by a human labeler, processed through OCR using the GPT-4o API and cross-checked by two labelers.
Prior Sets: This subset comprises 6,583 math problems translated from existing English benchmarks, including GSM8K, MATH, Numina-Math, and MMLU. For GSM8K, MATH, and Numina-Math, translations are done using the GPT-4o API, followed by a human quality check. For MMLU, we utilize the human-translated version provided by OpenAI (MMMLU).

Evaluation Settings

LLMs are initially prompted in the following manner for our experiments. In cases where system roles are not available, the system_message is included in the user query.

system_message = """Solve the given question. After solving the problem, state your final answer in the following format: $\\\\boxed{N}$."""
problem = ""

messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": problem + "Respond in Korean."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

We include the “Respond in Korean” in our prompt, as the ability of interest is each model's capability to solve and explain math questions in Korean. Moreover, while using a low temperature is ideal for pass@1 settings, we observe that multiple LLMs tend to reason in their preferred language (e.g., English or Chinese) in such an environment. Accordingly, we use the following sampling parameters as our default setting unless otherwise specified.

temperature = 0.7
top_p = 0.95
min_tokens = 8
max_tokens = 2048

Performance Report

Here, we report the performance of existing LLMs on HRM8K.

To our surprise, we observe that the performance of Qwen2.5-72B-Instruct is on par with GPT-4o. GPT-4o-Mini and Llama-3.1-70B-Instruct follow the two models. Llama-3.1-405B-Instruct was excluded due to hardware constraints.

Frontier Models (Proprietary / Large & Open)

We also present the evaluation results for smaller LLMs (<20B), including several models developed by Korean companies. Below are key details about the models:

VARCO-8B-Instruct: A fine-tuned version of Llama-3.1-8B.
Solar Pro (preview) Instruct: A derivative of Phi-3-Medium, scaled through model merging and further trained. Please note that this version of the model does not officially support Korean.
Exaone-3-7.8B-Instruct: A model trained from scratch on 8T tokens, with a mix of Korean and English data.

In terms of performance, Qwen2.5-14B-Instruct leads with a score of 43.5, followed by Llama-3.1-8B-Instruct (39.4), Qwen2.5-7B-Instruct (37.0), and Exaone-3.0-7.8B-Instruct (36.02). Notably, despite not being specifically pre-trained on Korean data, Qwen2.5 and Llama-3.1 demonstrate competitive performance.

Open LLMs ( < 20B)

We further compare the performance of Qwen2.5-7B, Llama3.1-8B, and Exaone3-7.8B relative to their computational budget during pretraining. Following Kaplan et al. (2020), we estimate pretraining compute as 6×#parameters×#tokens. For brevity, we use ExaFLOPs to represent the scale of computational operations, where each unit corresponds to 10^{18} floating-point operations (FLOPs).

The table below shows that Exaone-3.0-7.8B-Instruct is trained with a relatively smaller compute budget than its counterparts. Notably, for Exaone-3.0-7.8B-Instruct, each unit of ExaFLOP compute during pretraining contributes to a score of 0.00058 on HRM8K, outperforming Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct in efficiency. We hypothesize that this is due to its focus on Korean data during pretraining data curation.

We are operating the HRM8K leaderboard here, if you are interested in performance of more models.

Scaling Test-Time Compute

Considering the recent hype on scaling test-time compute, we detail relevant experiments in this section. A common approach is training a dedicated LLM with Long-CoT (Chain-of-Thought) reasoning. A notable example is QwQ-preview by the Qwen Team. Interestingly, as shown in the table below, QwQ-preview underperforms compared to Qwen2.5-32B-Instruct on HRM8K. While little is known about the specific training recipe for QwQ-preview, our educated guess is that it is primarily trained on English or Chinese datasets, making it less effective for Korean questions. Please note that we observe the model respond in either English or Chinese for most instances, even while being prompted to explain in Korean.

To mitigate the issue, we train two versions of Long-CoT models of our own: OLV-0.1 and OLV-0.2.

For OLV-0.1, we train Qwen2.5-7B to generate a two-step chain of CoT reasoning. In its first step, the model takes time to understand the question and plan a solution. In its second step, the solution is generated.

OLV-0.2 is the highlight. We train it to generate a three-step CoT reasoning. To the above-mentioned steps, we add a final step where the model re-iterates its response to check whether it has made any mistakes. To our surprise, the model successfully learns to fix its own mistakes.

We detail the performance of Qwen2.5-7B-Instruct, OLV-0.1, and OLV-0.2 in the table below. OLV models outperform their baseline greatly. Furthermore, the added re-iteration step successfully pushes the model further.

Below, we provide examples of OLV-0.2 self-correcting itself. (Yes, these are cherry-picked examples, and OLV sometimes fixes correct responses.)

### Question: Carla는 200GB 파일을 다운로드 중입니다. 보통 1분에 2GB를 다운로드할 수 있지만, 다운로드의 40%가 진행됐을 때 Windows가 업데이트 설치를 위해 재시작합니다. 이 과정은 20분이 걸립니다. 그러고 나서 Carla는 다운로드를 처음부터 다시 시작해야 합니다. 파일을 다운로드하는 데 총 얼마나 걸리나요?
### Answer: 160.0
### Response by OLV-0.2:

[Understanding Question]
카를라는 200GB 파일을 다운로드 중입니다.

1. **다운로드 속도**: 그녀는 분당 2GB의 속도로 파일을 다운로드합니다.
2. **재시작 이벤트**: 다운로드의 40%가 완료될 때, Windows가 업데이트를 설치하기 위해 그 다운로드를 재시작합니다.
3. **재시작 시간**: 이 재시작 과정은 20분이 걸립니다.
4. **재시작 후**: 재시작 후에는 다운로드를 처음부터 다시 시작합니다.

질문은 200GB 파일을 다운로드하는 데 걸리는 총 시간을 묻고 있습니다.

[Solving Question]
먼저, 파일의 40%를 다운로드하는 데 걸리는 시간을 계산해 보겠습니다.
200GB 파일의 40%는 0.4 * 200 = 80GB입니다.
분당 2GB의 속도로 80GB를 다운로드하는 데 걸리는 시간은 80 / 2 = 40분입니다.
이때 Windows가 업데이트를 설치하기 위해 재시작하는 시간 20분을 더하면 40 + 20 = 60분이 됩니다.
이제 남은 60%의 파일을 다운로드하는 데 걸리는 시간을 계산해 보겠습니다.
200GB 파일의 60%는 0.6 * 200 = 120GB입니다.
분당 2GB의 속도로 120GB를 다운로드하는 데 걸리는 시간은 120 / 2 = 60분입니다.
따라서 파일을 다운로드하는 데 걸리는 총 시간은 60 + 60 = 120분입니다.
따라서 파일을 다운로드하는 데 걸리는 총 시간은 \boxed{120}분입니다.

[Re-Iterating Solution]

### Question: Candice는 커피숍으로 출근하기 전에 80개의 포스트잇을 가방에 넣었습니다. 그녀는 가는 길에 가게에 들러 포스트잇 한 패키지를 구입했습니다. 직장에서 그녀는 220개의 다른 커피 컵 각각에 하나씩 포스트잇을 붙였습니다. 전체적으로 23개의 포스트잇이 남아있었다면, 그녀가 구입한 패키지에는 포스트잇이 몇 개 들어 있었나요?
### Answer: 163.0
### Response by OLV-0.2:

[Understanding Question]
캐니스는 커피숍으로 가기 위해 가방에 80개의 포스트잇을 넣었습니다. 

1. 그녀는 출근하기 전에 가방에 80개의 포스트잇을 가지고 있었습니다.
2. 그녀는 가게에 가서 포스트잇 한 패키지를 구입했습니다.
3. 직장에서 그녀는 220개의 커피 컵 각각에 포스트잇 하나씩 붙였습니다.
4. 그녀가 모든 포스트잇을 사용한 후에는 총 23개의 포스트잇이 남아 있었습니다.

질문은 그녀가 가게에서 구입한 포스트잇 패키지에 몇 개가 있었는지를 묻고 있습니다. 

이를 찾기 위해 우리는 다음을 추적해야 합니다:
- 처음의 포스트잇 수: 80
- 직장에서 붙인 포스트잇 수: 220
- 최종적으로 남은 포스트잇 수: 23

이 정보를 사용하여 그녀가 구입한 포스트잇 패키지의 수를 계산할 수 있습니다.

[Solving Question]
캐니스는 처음에 80개의 포스트잇을 가지고 있었습니다.
그녀는 가게에서 포스트잇 한 패키지를 구입했습니다.
그녀는 직장에서 220개의 커피 컵 각각에 포스트잇 하나씩 붙였습니다.
그녀는 총 23개의 포스트잇이 남아 있었습니다.
따라서 그녀가 가진 포스트잇의 총 수는 80 + 패키지에 있는 포스트잇 수 - 220 + 23입니다.
패키지에 있는 포스트잇 수를 x라고 하겠습니다.
그러면 80 + x - 220 + 23 = 23이므로 x = 114입니다.
따라서 그녀가 구입한 포스트잇 패키지는 \boxed{114}개였습니다.

[Re-Iterating Solution]

### Question: 볼록오각형 \( ABCDE \) 에서 \( BE \) 와 \( BC \) 가 각각 \( CD \) 와 \( AD \) 에 평행하고 \( BC = ED \), \( AB = CD \) 이다. \( \angle BCD = 130^\circ \) 이고 \( \angle ACE = x^\circ \) 일 때, \( x \) 를 구하여라.
### Answer: 80
### Response by OLV-0.2:

[Understanding Question]
1. **문제 이해하기**: 먼저 문제 설명을 주의 깊게 읽어 \( ABCDE \) 다각형의 기하학적 관계를 이해할 것입니다. \( BE \)와 \( BC \)가 각각 \( CD \)와 \( AD \)에 평행하다는 점을 주목하며, 이는 각도 사이의 관계를 설정하는 데 도움이 될 것입니다.

2. **주어진 각도 확인하기**: \( \angle BCD = 130^\circ \)라는 것을 알고 있습니다. 이 각도는 계산에서 중요하므로, 나중에 참조할 수 있도록 기록해 두겠습니다.

3. **평행선의 성질 사용하기**: \( BC \)가 \( AD \)에 평행하므로, 교차 내각을 적용할 수 있습니다. 이로 인해 \( \angle BCD \)와 관련된 각도를 찾아보겠습니다.

4. **다른 각도 계산하기**: \( \angle BCD = 130^\circ \)를 알고 있으므로, \( \angle ACD \)를 찾을 수 있습니다. \( ACD \)는 \( BCD \)와 보각 관계에 있으므로, 보각의 성질을 사용하여 이 각도를 계산하겠습니다.

5. **삼각형 \( ACE \) 분석하기**: 다음으로 삼각형 \( ACE \)에 집중하겠습니다. \( AC = AE \) (주어진 조건으로부터 유도됨)라는 것을 알고 있으므로, 이등변 삼각형의 성질을 사용할 수 있습니다.

6. **삼각형 각의 합 이용하기**: 모든 삼각형에서 각의 합은 \( 180^\circ \)라는 것을 기억하겠습니다. \( ACE \) 삼각형에서 \( \angle CAE \)와 \( \angle AEC \)를 \( \angle ACE \)에 연관시켜 표현하겠습니다.

7. **방정식 설정하기**: 삼각형 \( ACE \)의 각도를 포함하는 방정식을 작성하고, 이를 \( \angle ACE \)에 대해 풀겠습니다.

8. **최종 계산하기**: 마지막으로, 방정식을 단순화하여 \( \angle ACE \)의 값을 찾겠습니다. 모든 계산을 주의 깊게 수행하여 정확성을 보장하도록 하겠습니다.

[Solving Question]
다음 문제를 해결해 보겠습니다:
볼록 오각형 \( ABCDE \)에서, \( BE \)와 \( BC \)는 각각 \( CD \)와 \( AD \)에 평행하고, \( BC = ED \), \( AB = CD \)입니다. \( \angle BCD = 130^\circ \)이고, \( \angle ACE = x^\circ \)일 때, \( x \)의 값을 구해보겠습니다.

1. **주어진 조건을 이해하고 활용하기:**
    - \( BC \parallel AD \)이고 \( BC = ED \)입니다.
    - \( AB = CD \)입니다.
    - \( \angle BCD = 130^\circ \)입니다.

2. **평행선의 성질을 사용하여 각도 결정하기:**
    - \( BC \parallel AD \)이므로, 교차 내부 각이 같다는 성질을 사용할 수 있습니다:
    \[
    \angle BCD = \angle ACD = 130^\circ
    \]
    - 이는 \( \angle BCD \)와 \( \angle ACD \)가 선형 쌍을 이루기 때문입니다.

3. **\( \angle ACE \)를 찾기 위한 설정:**
    - \( AB = CD \)이고 \( BC = ED \)이므로, 우리는 \( \angle ACE \)를 찾아야 합니다.

4. **이등변 삼각형의 성질 사용하기:**
    - \( AB = CD \)이고 \( BC = ED \)이므로, 삼각형 \( ACE \)는 \( AC = AE \)인 이등변 삼각형입니다. \( \angle ACE \)는 이등변 삼각형의 꼭짓점 각입니다.

5. **\( \angle ACE \) 계산하기:**
    - 삼각형의 각의 합은 \( 180^\circ \)입니다. 이등변 삼각형 \( ACE \)에서:
    \[
    \angle CAE = \angle AEC = \frac{180^\circ - \angle ACE}{2}
    \]
    - \( \angle BCD = 130^\circ \)이므로, \( \angle ACD = 130^\circ \)입니다. 따라서:
    \[
    \angle ACE = 180^\circ - \angle ACD = 180^\circ - 130^\circ = 50^\circ
    \]

6. **결론:**
    - 따라서 \( \angle ACE \)의 값은:
    \[
    \boxed{50^\circ}
    \]

[Re-Iterating Solution]
제 답변을 검토하며, 문제가 있는지 살펴보겠습니다. 

1. 주어진 조건을 바탕으로 \( BC \parallel AD \)이고, \( AB = CD \)임을 사용하여 \( \angle BCD = 130^\circ \)라는 점을 확인했습니다. 이 각이 \( \angle ACD \)와 같은 이유는 \( BC \parallel AD \)에 따른 것입니다. 이 부분은 맞습니다.

2. 그러나 \( \angle ACD \)를 \( 130^\circ \)라고 잘못 언급한 부분을 발견했습니다. 실제로는 \( \angle ACD \)가 \( \angle BCD \)와는 별개의 각도이므로, \( \angle ACD \)는 \( 180^\circ - 130^\circ = 50^\circ \)여야 합니다.

3. 따라서 \( \angle ACE \)를 구하는 과정에서 \( \angle ACD = 130^\circ \)를 잘못 사용하여 \( \angle ACE = 50^\circ \)라고 잘못 도출했습니다. 이는 주어진 정보와 모순되므로, 정답은 다음과 같이 수정해야 합니다.

검토 결과 풀이에는 문제가 있었습니다. 따라서 수정된 정답은 \( \angle ACE = 80^\circ \)

Final Remarks

The OLV series will take some time to be made public as we are still actively making progress for longer and more dynamic reasoning chains. We just wanted to share our initial results to provide a glimpse of what we are working on. HRM8K, however, will be released soon (during December). If you want your models evaluated before the release, please contact us at spthsrbwls123@yonsei.ac.kr.