1강

안녕하세요, 여러분! 논문 마스터, 일타 강사 저스틴입니다. 오늘부터 저와 함께 “Absolute Zero: Reinforced Self-play Reasoning with Zero Data”라는 아주 뜨거운 논문을 한 줄 한 줄 씹어 먹어보겠습니다. 정신 바짝 차리고 따라오세요!

자, 먼저 논문 제목부터 살펴볼까요?

AbsoluteZero: Reinforced Self-play Reasoning with Zero Data

“AbsoluteZero”, 이름부터가 심상치 않죠? “완전한 제로 상태”를 의미하는데, 이게 뭘 뜻하는지는 뒤에서 자세히 나옵니다. 핵심은 “Reinforced Self-play Reasoning”, 즉 “강화된 셀프 플레이 추론”이고, 가장 중요한 포인트는 바로 “Zero Data“입니다. 데이터를 전혀 사용하지 않고도 추론 능력을 학습한다니, 정말 대단하지 않습니까? 이게 이 논문의 핵심 아이디어라고 할 수 있겠습니다.

다음은 저자들입니다.

Andrew Zhao1, Yiran Wu3, Yang Yue1, Tong Wu2, Quentin Xu1, Yang Yue1, Matthieu Lin1, Shenzhi Wang1, Qingyun Wu3, Zilong Zheng2, and Gao Huang1,

1 Tsinghua University 2 Beĳing Institute for General Artificial Intelligence 3 Pennsylvania State University

칭화대학교, 베이징 일반 인공지능 연구소, 펜실베이니아 주립대학교 등 유수의 기관 연구자들이 참여했네요. 특히 칭화대학교의 Gao Huang 교수님은 DenseNet 같은 유명한 연구를 하신 분이죠. 이메일 주소도 친절하게 공개되어 있으니, 궁금한 점이 있다면 직접 연락해 보는 것도 좋겠습니다. (물론, 예의를 갖춰서요!)

자, 이제 이 논문이 어떤 내용을 담고 있는지 요약해 놓은 초록(Abstract)을 함께 보겠습니다. 한 문장씩 꼼꼼하게 짚어 드릴게요.

Abstract

“Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards.” 자, 첫 문장입니다. RLVR, 즉 “검증 가능한 보상을 이용한 강화학습”이 요즘 핫한 거대 언어 모델(LLM)의 추론 능력을 향상시키는 데 유망하다는 이야기로 시작합니다. 여기서 중요한 건 “outcome-based rewards”, 즉 결과 기반 보상으로부터 직접 학습한다는 점입니다. 과정보다는 결과를 보고 배우는 거죠.

“Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training.” 최근의 “제로 세팅” RLVR 연구들은 추론 과정에 대한 지도 학습(supervision)은 피하고 있지만, 여전히 사람이 직접 만든 질문과 답변 데이터셋에 의존하고 있다는 한계를 지적하고 있습니다. “제로 세팅”이라는 건, 처음부터 아무런 사전 지식 없이 시작한다는 의미로 해석할 수 있겠네요.

“The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining.” 퀄리티 높은 사람이 만든 예시 데이터가 부족하다는 점, 이게 바로 인간의 지도에 의존하는 방식의 장기적인 확장성에 대한 우려를 낳고 있다는 겁니다. LLM 사전 학습 분야에서도 이미 이런 문제가 나타나고 있다고 하네요. 데이터 만드는 게 보통 일이 아니거든요.

“Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system.” 더 나아가서, 만약 AI가 인간 지능을 뛰어넘는 미래가 온다면, 인간이 제공하는 작업은 초지능 시스템에게는 학습 잠재력이 제한적일 수 있다는 겁니다. AI가 우리보다 똑똑해지면 우리가 주는 문제가 너무 쉬울 수 있다는 거죠.

“To address these concerns, we propose a new RLVR paradigm called AbsoluteZero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data.” 자, 드디어 이 논문의 핵심 제안이 나옵니다! 이러한 문제들을 해결하기 위해 “AbsoluteZero”라는 새로운 RLVR 패러다임을 제안합니다. 이 패러다임에서는 외부 데이터 없이, 단일 모델이 스스로 학습 진행을 극대화하는 “작업을 제안”하고, 그 작업을 “해결”함으로써 추론 능력을 향상시킨다는 겁니다. 스스로 문제를 내고 스스로 풀면서 똑똑해진다는 거죠! “Zero Data”의 의미가 여기서 명확해지네요.

“Under this paradigm, we introduce the AbsoluteZero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning.” 이 패러다임 하에서 “AbsoluteZero Reasoner”, 줄여서 AZR이라는 시스템을 소개합니다. AZR은 코드 실행기를 사용해서, 스스로 제안한 코드 추론 작업의 유효성을 검증하고 답변도 검증합니다. 이를 통해 훈련 커리큘럼과 추론 능력을 스스로 발전시키는 시스템이죠. 이게 바로 검증 가능한 보상의 단일화된 원천이 되어, 개방적이면서도 현실에 기반한 학습을 이끌어준다고 합니다. “코드 실행기”가 핵심적인 역할을 하는 것 같네요.

“Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples.” 놀랍게도, 외부 데이터 없이 완전히 훈련되었음에도 불구하고, AZR은 코딩 및 수학 추론 작업에서 전반적으로 SOTA, 즉 최고 수준의 성능을 달성했습니다! 수만 개의 해당 분야(in-domain) 인간 제작 예시에 의존하는 기존 제로 세팅 모델들을 능가했다는 겁니다. 데이터 없이도 이렇게 잘할 수 있다니, 정말 대단한 결과입니다.

“Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.” 게다가 AZR은 다양한 모델 크기에서도 효과적으로 적용될 수 있고, 다양한 모델 종류와도 호환된다는 것을 보여줍니다. 특정 모델에만 국한된 기술이 아니라는 거죠. 범용성이 있다는 건 큰 장점입니다.

“Code Project Page Logs Models” 연구의 투명성을 위해 코드, 프로젝트 페이지, 로그, 모델까지 공개한다고 하니, 관심 있는 분들은 꼭 확인해 보세요.

자, 다음은 Figure 1을 보면서 이 AZR이 얼마나 대단한지 한번 더 확인해 봅시다.

Figure 1. AbsoluteZero Reasoner (AZR) achieves state-of-the-art performance with ZERO DATA. 그림 1의 제목입니다. AZR이 “ZERO DATA”로 SOTA 성능을 달성했다는 것을 다시 한번 강조하고 있네요.

“Without relying on any gold labels or human-defined queries, AbsoluteZero Reasoner trained using our proposed self-play approach demonstrates impressive general reasoning capabilities improvements in both math and coding, despite operating entirely out-of-distribution.” 어떤 정답 레이블이나 인간이 정의한 질문에도 의존하지 않고, 이들이 제안한 “셀프 플레이” 접근 방식으로 훈련된 AZR은 수학과 코딩 양쪽 모두에서 인상적인 일반 추론 능력 향상을 보여줍니다. 심지어 완전히 분포를 벗어난(out-of-distribution) 상황에서도 말이죠! 이건 정말 일반화 성능이 뛰어나다는 것을 의미합니다.

“Remarkably, AZR surpasses models trained on tens of thousands of expert-labeled in-domain examples in the combined average score across both domains.” 놀랍게도, AZR은 수학과 코딩 두 분야의 종합 평균 점수에서 수만 개의 전문가가 레이블링한 해당 분야 예제로 훈련된 모델들을 능가합니다. 앞서 초록에서 언급된 내용을 그림으로 다시 한번 보여주는 것이죠.

“Corresponding author(s)” 교신 저자에 대한 표시도 있네요.

여기까지가 1강입니다. 논문의 제목과 저자 정보, 그리고 논문의 핵심 내용을 요약한 초록과 Figure 1을 통해 이 연구가 얼마나 혁신적인지 맛보았습니다. “데이터 없이 스스로 학습하는 AI”, 정말 흥미진진하지 않습니까?

2강

자, 여러분! 일타 강사 저스틴과 함께하는 논문 완전 정복, 2강 시작합니다! 지난 시간에는 이 논문의 제목과 초록을 통해 “AbsoluteZero”라는 데이터 없이 스스로 학습하는 엄청난 녀석에 대해 맛보았습니다. 오늘은 서론(Introduction)을 파헤치면서 왜 이런 연구가 필요했고, 이 논문이 어떤 큰 그림을 그리고 있는지 자세히 알아보겠습니다. 집중하시고, 따라오세요!

먼저 Figure 2를 보면서 시작하죠. 이 그림은 AbsoluteZero 패러다임이 기존 학습 방법들과 어떻게 다른지 보여줍니다.

Figure 2. AbsoluteZero Paradigm.

“Supervised learning relies on human-curated reasoning traces for behavior cloning.” 자, 첫 번째 그림은 “지도 학습(Supervised Learning)”입니다. 인간이 만든 정답, 즉 추론 과정(reasoning traces)을 그대로 따라 하도록 학습하는 방식이죠. 이걸 “행동 복제(behavior cloning)”라고도 합니다.

“Reinforcement learning from verified rewards, enables agents to self-learn reasoning, but still depends on expert-defined learning distribution and a respective set of curated QA pairs, demanding domain expertise and manual effort.” 두 번째는 “검증된 보상 기반 강화학습(Reinforcement learning from verified rewards, RLVR)”입니다. 이건 에이전트가 스스로 추론을 학습할 수 있게 하지만, 여전히 전문가가 정의한 학습 분포와 직접 만든 Q&A 쌍에 의존합니다. 즉, 해당 분야의 전문 지식과 수작업이 필요하다는 거죠. 1강에서 살짝 언급됐었죠?

“In contrast, we introduce a new paradigm, AbsoluteZero, for training reasoning models without any human-curated data.” 이에 반해, 이 논문에서 제안하는 새로운 패러다임, 바로 “AbsoluteZero”입니다! 이건 인간이 만든 데이터 없이 추론 모델을 훈련하는 방식입니다. “without any human-curated data” – 이게 핵심입니다.

“We envision that the agent should autonomously propose tasks optimized for learnability and learn how to solve them using an unified model.” 이 패러다임에서는 에이전트가 학습 가능성에 최적화된 작업을 “스스로 제안”하고, 통합된 단일 모델을 사용해서 그 해결 방법을 학습해야 한다고 보고 있습니다. 자기가 문제 내고 자기가 푸는 거죠.

“The agent learns by interacting with an environment that provides verifiable feedback, enabling reliable and continuous self-improvement entirely without human intervention.” 에이전트는 검증 가능한 피드백을 제공하는 환경과 상호작용하면서 학습합니다. 이를 통해 인간의 개입 없이도 안정적이고 지속적인 자가 개선이 가능하다는 겁니다. 정말 꿈 같은 이야기죠?

자, 이제 본격적으로 Introduction 텍스트를 읽어보겠습니다.

1. Introduction

“Large language models (LLMs) have recently achieved remarkable improvements in reasoning capabilities by employing Reinforcement Learning with Verifiable Rewards (RLVR) (Lambert et al., 2024).” 첫 문장입니다. 거대 언어 모델(LLM)이 최근 RLVR을 사용해서 추론 능력에서 놀라운 발전을 이루었다고 하네요. Lambert 등의 2024년 연구를 인용하고 있습니다. LLM의 추론 능력 향상에 RLVR이 기여했다는 배경 설명입니다.

“Unlike methods that explicitly imitate intermediate reasoning steps, RLVR uses only outcome-based feedback, enabling large-scale reinforcement learning over vast task datasets (DeepSeek-AI et al., 2025; Team et al., 2025; Jaech et al., 2024; OpenAI, 2025b; a).” 중간 추론 단계를 명시적으로 모방하는 방법들과 달리, RLVR은 오직 “결과 기반 피드백”만을 사용합니다. 그래서 방대한 작업 데이터셋에 대해 대규모 강화학습을 가능하게 한다고 합니다. DeepSeek-AI, Team, Jaech, OpenAI 등의 최신 연구들을 언급하고 있네요. 과정보다는 결과에 집중하는 RLVR의 특징을 다시 한번 강조합니다.

“A particularly compelling variant is the “zero” RLVR paradigm (DeepSeek-AI et al., 2025), which forgoes any cold-start distillation data, using neither human-generated nor AI-generated reasoning traces, and applies RLVR directly on the base model with task rewards.” 특히 주목할 만한 변형으로 “제로” RLVR 패러다임이 있다고 합니다. 이건 DeepSeek-AI 팀이 2025년에 제시한 개념인데, “콜드 스타트 증류 데이터(cold-start distillation data)”, 즉 초기 학습을 위한 사전 정제 데이터 없이, 인간이나 AI가 생성한 추론 과정 데이터를 전혀 사용하지 않고, 기본 모델에 직접 작업 보상을 주면서 RLVR을 적용하는 방식입니다. 아무것도 없는 상태에서 시작하는 거죠.

“However, these methods still depend heavily on expertly curated distributions of reasoning question–answer pairs, which raises serious concerns about their long-term scalability (Villalobos et al., 2024).” 하지만 이런 방법들도 여전히 전문가가 공들여 만든 질문-답변 쌍의 분포에 크게 의존하고 있습니다. 이는 장기적인 확장성에 심각한 우려를 제기한다고 Villalobos 등이 2024년에 지적했네요. 결국 사람이 데이터를 만들어야 한다는 한계가 있다는 겁니다.

“As reasoning models continue to advance, the effort required to construct large-scale, high-quality datasets may soon become unsustainable (Yue et al., 2025).” 추론 모델이 계속 발전함에 따라, 대규모 고품질 데이터셋을 구축하는 데 필요한 노력이 곧 감당할 수 없는 수준이 될 수도 있다고 Yue 등이 2025년에 경고했습니다. 모델은 발전하는데 데이터 만드는 속도가 못 따라간다는 거죠.

“A similar scalability bottleneck has already been identified in the domain of LLM pretraining (Sutskever et al., 2024).” 비슷한 확장성 병목 현상은 이미 LLM 사전 학습 분야에서도 확인되었다고 Sutskever 등이 2024년에 언급했습니다. 데이터 부족 문제는 LLM 분야 전반의 고민거리인 셈입니다.

“Furthermore, as AI systems continue to evolve and potentially exceed human intellect, an exclusive dependence on human-designed tasks risks imposing constraints on their capacity for autonomous learning and growth (Hughes et al., 2024).” 더욱이, AI 시스템이 계속 진화하여 잠재적으로 인간의 지능을 능가하게 되면, 인간이 설계한 작업에만 의존하는 것은 AI의 자율적인 학습 및 성장 능력에 제약을 가할 위험이 있다고 Hughes 등이 2024년에 주장했습니다. AI가 우리보다 똑똑해지면 우리가 내주는 숙제는 너무 시시해서 발전에 도움이 안 될 수 있다는 거죠.

“This underscores the need for a new paradigm that begins to explore possibilities beyond the constraints of human-designed tasks and prepares for a future in which AI systems may surpass human intelligence.” 이는 인간이 설계한 작업의 제약을 넘어선 가능성을 탐색하기 시작하고, AI 시스템이 인간 지능을 능가할 수 있는 미래에 대비하는 새로운 패러다임의 필요성을 강조합니다. 즉, AI가 스스로 발전할 수 있는 새로운 판을 짜야 한다는 겁니다.

“To this end, we propose “AbsoluteZero”, a new paradigm for reasoning models in which the model simultaneously learns to define tasks that maximize learnability and to solve them effectively, enabling self-evolution through self-play without relying on external data.” 이를 위해, 이 논문에서는 “AbsoluteZero”라는 새로운 추론 모델 패러다임을 제안합니다. 이 패러다임에서 모델은 학습 가능성을 극대화하는 작업을 “정의”하는 동시에 이를 효과적으로 “해결”하는 것을 학습합니다. 외부 데이터에 의존하지 않고 셀프 플레이를 통해 스스로 진화할 수 있게 되는 거죠. 앞서 초록에서 봤던 핵심 내용이 다시 한번 강조됩니다.

“In contrast to prior self-play methods that are limited to narrow domains, fixed functionalities, or learned reward models that are prone to hacking (Silver et al., 2017; Chen et al., 2025; 2024), the AbsoluteZero paradigm is designed to operate in open-ended settings while remaining grounded in a real environment.” 기존의 셀프 플레이 방법들은 좁은 영역, 고정된 기능, 또는 해킹(보상 허점을 이용해 부정확한 학습을 하는 것)에 취약한 학습된 보상 모델에 제한되어 있었습니다. Silver, Chen 등의 연구가 그 예시죠. 하지만 AbsoluteZero 패러다임은 실제 환경에 기반을 두면서도 개방적인 환경에서 작동하도록 설계되었습니다. 더 넓고 현실적인 문제에 적용 가능하다는 거죠.

“It relies on feedback from the environment as a verifiable source of reward, mirroring how humans learn and reason through interaction with the world, and helps prevent issues such as hacking with neural reward models (Hughes et al., 2024).” 이 패러다임은 환경으로부터의 피드백을 검증 가능한 보상 원천으로 사용합니다. 이는 인간이 세상과의 상호작용을 통해 배우고 추론하는 방식을 모방한 것이며, 신경망 보상 모델의 해킹과 같은 문제를 방지하는 데 도움이 된다고 합니다. Hughes 등의 2024년 연구를 다시 언급하고 있네요. 자연스러운 학습 방식이라는 점과 안정성을 강조합니다.

“Similar to AlphaZero (Silver et al., 2017), which improves through self-play, our proposed paradigm requires no human supervision and learns entirely through self-interaction.” 셀프 플레이를 통해 발전하는 알파제로(AlphaZero)와 유사하게, 이들이 제안하는 패러다임은 인간의 감독이 필요 없으며 전적으로 자기 자신과의 상호작용을 통해 학습합니다. 알파제로가 바둑에서 인간의 기보 없이 스스로 학습해서 최고가 된 것처럼 말이죠.

“We believe the AbsoluteZero paradigm represents a promising step toward enabling large language models to autonomously achieve superhuman reasoning capabilities.” 연구진들은 AbsoluteZero 패러다임이 LLM이 자율적으로 초인적인 추론 능력을 달성하도록 하는 유망한 단계라고 믿고 있습니다. 인간을 뛰어넘는 AI를 향한 중요한 발걸음이라는 포부를 밝히고 있네요.

“Building on this new reasoning paradigm, we introduce the AbsoluteZero Reasoner (AZR), which proposes and solves coding tasks.” 이 새로운 추론 패러다임을 기반으로, 이 논문은 코딩 작업을 제안하고 해결하는 AbsoluteZero Reasoner (AZR)을 소개합니다. 드디어 AZR이 구체적으로 어떤 작업을 하는지 나오기 시작합니다. “코딩 작업”에 초점을 맞추고 있네요.

“We cast code executor as an open-ended yet grounded environment, sufficient to both validate task integrity and also provide verifiable feedback for stable training.” 코드 실행기를 개방적이면서도 현실에 기반한 환경으로 설정했습니다. 이 환경은 작업의 무결성을 검증하고 안정적인 훈련을 위한 검증 가능한 피드백을 제공하기에 충분하다고 합니다. 코드 실행기가 바로 AZR의 놀이터이자 선생님인 셈입니다.

“We let AZR construct three types of coding tasks: infer and reason about one particular element in a program, input, output triplet, which corresponds to three complementary modes of reasoning: induction, abduction, and deduction.” AZR은 세 가지 유형의 코딩 작업을 구성하도록 합니다. 프로그램, 입력, 출력 이 세 가지 요소(triplet) 중 특정 하나의 요소를 추론하는 것인데요. 이는 각각 귀납(induction), 귀추(abduction), 연역(deduction)이라는 세 가지 상호 보완적인 추론 방식에 해당합니다. 추론의 기본적인 세 가지 방식을 코딩 작업에 적용한 것이죠. 이 부분은 뒤에서 더 자세히 나올 것 같습니다.

“We train the entire system end-to-end with a newly proposed reinforcement learning advantage estimator tailored to the multitask nature of the proposed approach.” 제안된 접근 방식의 다중 작업(multitask) 특성에 맞게 새롭게 제안된 강화학습 어드밴티지 추정기(advantage estimator)를 사용하여 전체 시스템을 종단간(end-to-end)으로 훈련합니다. 새로운 학습 방법론도 개발했다는 이야기입니다.

“Despite being trained entirely without any in-distribution data, AZR demonstrates remarkable capabilities across diverse reasoning tasks in math and coding.” 분포 내(in-distribution) 데이터 없이 완전히 훈련되었음에도 불구하고, AZR은 수학 및 코딩 분야의 다양한 추론 작업에서 놀라운 능력을 보여줍니다. 앞서 초록에서 강조했던 내용이죠. 데이터 없이도 잘한다는 것!

“In mathematics, AZR achieves competitive performance compared to zero reasoner models explicitly fine-tuned with domain-specific supervision.” 수학에서는 해당 분야별 지도 학습으로 명시적으로 미세 조정된 제로 추론 모델들과 비교하여 경쟁력 있는 성능을 달성합니다. 수학 문제도 잘 푼다는 거죠.

“In coding tasks, AZR establishes a new state-of-the-art performance, surpassing models specifically trained with code datasets using RLVR.” 코딩 작업에서는 RLVR을 사용하여 코드 데이터셋으로 특별히 훈련된 모델들을 능가하며 새로운 SOTA 성능을 달성합니다. 코딩은 더 잘한다는 이야기입니다.

“Furthermore, AZR outperforms all previous models by an average of 1.8 absolute points compared to models trained in the “zero” setting using in-domain data.” 더 나아가, AZR은 해당 분야 데이터를 사용하여 “제로” 설정에서 훈련된 모델들과 비교하여 평균 1.8 절대 포인트만큼 모든 이전 모델들을 능가합니다. 전반적으로 성능이 더 좋다는 것을 수치로 보여주고 있습니다.

“These surprising results highlight that general reasoning skills can emerge without human-curated domain targeted data, positioning AbsoluteZero as an promising research direction and AZR as a first pivotal milestone.” 이러한 놀라운 결과는 인간이 만든 특정 분야 데이터 없이도 일반적인 추론 기술이 나타날 수 있음을 강조하며, AbsoluteZero를 유망한 연구 방향으로, AZR을 첫 번째 중요한 이정표로 자리매김합니다. 이 연구의 의의를 다시 한번 강조하며 서론을 마무리합니다.

자, 이렇게 해서 Introduction 부분을 모두 살펴보았습니다. 왜 AbsoluteZero라는 아이디어가 나왔고, 이게 기존 연구들과 어떤 차별점을 가지는지, 그리고 AZR이 어떤 방식으로 작동하며 얼마나 대단한 성과를 냈는지에 대한 큰 그림을 이해하셨을 겁니다. 핵심은 “데이터 없이, 스스로 문제를 만들고 풀면서, 실제 환경(코드 실행기)과의 상호작용을 통해 학습하고 발전한다”는 것이죠!

3강

지난 2강에서는 서론을 통해 AbsoluteZero가 왜 필요하고 어떤 대단한 녀석인지 감을 잡았습니다. 오늘은 드디어 이 논문의 핵심 철학이 담긴 “2. The AbsoluteZero Paradigm” 섹션으로 들어갑니다.

2. The AbsoluteZero Paradigm

이 섹션에서는 AbsoluteZero 패러다임이 정확히 무엇인지, 기존 방법론들과 어떤 점에서 다른지를 정의합니다.

2.1. Preliminaries

“Preliminaries”, 즉 사전 지식입니다. AbsoluteZero를 이해하기 위해 먼저 알아야 할 두 가지 개념, SFT와 RLVR에 대해 설명하고 있습니다.

“Supervised Fine-Tuning (SFT). SFT requires the datasets of task-rationale-answer demonstrations D = {(x, c⋆, y⋆)}, where x is the query, c⋆ is the gold chain-of-thought (CoT)) and y⋆ is the gold answer, all provided by human experts or superior AI models.” 첫 번째는 SFT, 즉 “지도 미세조정”입니다. SFT는 데이터셋 D가 필요한데, 이 D는 (x, c⋆, y⋆)의 묶음으로 이루어져 있습니다. 여기서 x는 질문(query), c⋆는 정답 추론 과정(gold chain-of-thought, CoT), y⋆는 정답(gold answer)을 의미합니다. 중요한 건 이 모든 것이 “인간 전문가”나 “더 뛰어난 AI 모델”에 의해 제공된다는 점입니다. 즉, 정답지가 있는 상태에서 학습하는 거죠.

“The model trains to imitate the reference responses to minimize the conditional negative log-likelihood (Ouyang et al., 2022):” 모델은 이 정답지를 모방하도록 훈련되며, 조건부 음의 로그 가능도(conditional negative log-likelihood)를 최소화하는 것을 목표로 합니다. 수식은 다음과 같습니다.

L_SFT(θ) = -E_( (x,c⋆,y⋆)∼D ) log π_θ(c⋆,y⋆|x) (1)

자, 수식 나왔다고 쫄지 마세요! 간단히 말해, 질문 x가 주어졌을 때, 모델 π_θ가 정답 추론 과정 c⋆과 정답 y⋆를 얼마나 잘 예측하는지를 측정하고, 이걸 최대화 (음수니까 최소화) 하려는 겁니다. 즉, 정답을 최대한 똑같이 따라 하도록 만드는 거죠. Ouyang 등이 2022년에 제시한 방식입니다.

“However, at the frontier level, there’s no stronger model to distill from, and expert human labeling doesn’t scale well.” 하지만, 최첨단 수준에서는 이보다 더 뛰어난 모델이 없어서 증류(distill, 더 큰 모델에서 작은 모델로 지식을 옮기는 것)할 대상이 없고, 전문가가 직접 레이블링하는 것은 확장성이 떨어진다는 문제가 있습니다. 사람이 일일이 정답 만드는 건 너무 힘들다는 거죠.

“Reinforcement Learning with Verifiable Rewards (RLVR). To move beyond the limits of pure imitation, RLVR only requires a dataset of task and answer D = {(x, y⋆)}, without labeled rationale.” 그래서 나온 것이 RLVR, 즉 “검증 가능한 보상을 사용한 강화학습”입니다. 단순 모방의 한계를 넘어서기 위해, RLVR은 (x, y⋆), 즉 질문과 정답만 있는 데이터셋을 필요로 합니다. 추론 과정(rationale)에 대한 레이블은 필요 없어요.

“RLVR allows the model to generate its own CoT and calculate a verifiable reward with the golden answer r(y, y⋆).” RLVR은 모델이 스스로 추론 과정(CoT)을 생성하도록 하고, 생성된 답 y와 실제 정답 y⋆을 비교하여 “검증 가능한 보상” r(y, y⋆)을 계산합니다. 모델이 스스로 생각하고, 그 결과에 대해 보상을 받는 방식이죠.

“However, the learning task distribution D, with its set of queries and gold answers are still labeled by human experts.” 하지만, 여기서도 학습 작업 분포 D, 즉 질문과 정답 자체는 여전히 인간 전문가에 의해 레이블링된다는 한계가 있습니다. 문제 자체는 사람이 만들어야 한다는 거죠.

“The trainable policy π_θ is optimized to maximize expected reward:” 훈련 가능한 정책 π_θ는 기대 보상을 최대화하도록 최적화됩니다. 수식은 다음과 같습니다.

J_RLVR(θ) = E_( (x,y⋆)∼D, y∼π_θ(·|x) ) r(y,y⋆) (2)

이 수식도 간단합니다. 데이터셋 D에서 질문 x와 정답 y⋆를 가져오고, 모델 π_θ가 질문 x에 대해 답변 y를 생성했을 때, 그 답변 y와 실제 정답 y⋆ 간의 보상 r(y,y⋆)의 기댓값을 최대화하겠다는 겁니다. 즉, 보상을 많이 받는 쪽으로 모델을 학습시키는 거죠.

“In summary, both SFT and RLVR still rely on human-curated datasets of either queries, demonstrations, or verifiers, which ultimately limit scalability.” 요약하자면, SFT와 RLVR 모두 여전히 인간이 만든 질문, 시연(demonstrations), 또는 검증기(verifiers) 데이터셋에 의존합니다. 이것이 결국 확장성을 제한하는 요인이 됩니다. 데이터 만드는 게 병목이라는 거죠.

2.2. AbsoluteZero

자, 이제 드디어 이 논문의 핵심, AbsoluteZero 패러다임입니다!

“We propose the AbsoluteZero paradigm, where during training, the model simultaneously proposes tasks, solves them, and learns from both stages.” 연구진들은 AbsoluteZero 패러다임을 제안하는데, 이 패러다임에서는 훈련 중에 모델이 동시에 “작업을 제안”하고, “그것들을 해결”하며, 이 “두 단계 모두로부터 학습”합니다. 스스로 문제 내고, 스스로 풀고, 그 과정에서 배우는 겁니다.

“No external data is required and the model learns entirely through self-play and experience, aided by some environment.” “외부 데이터가 전혀 필요 없고”, 모델은 전적으로 “셀프 플레이”와 “경험”을 통해 학습합니다. 어떤 “환경”의 도움을 받아서요. 정말 혁신적이죠?

“We illustrate this paradigm in Figure 2, which contrasts AbsoluteZero with supervised learning and RLVR, highlighting how our approach eliminates the need for any human-curated data by enabling self-improving task proposal and solution through self-play.” 이 패러다임은 Figure 2에서 보여줬었죠. 지도 학습이나 RLVR과 대조하면서, 이 접근 방식이 셀프 플레이를 통해 스스로 작업을 제안하고 해결함으로써 인간이 만든 데이터의 필요성을 어떻게 제거하는지를 강조했습니다.

“To make the AbsoluteZero setting concrete, we now define how one model can act both as the proposer and solver role.” AbsoluteZero 설정을 구체화하기 위해, 이제 단일 모델이 어떻게 “제안자(proposer)” 역할과 “해결자(solver)” 역할을 모두 수행할 수 있는지 정의합니다. 한 놈이 두 가지 역할을 다 한다는 거죠.

“To aid understanding, we include an illustration in Figure 3.” 이해를 돕기 위해 Figure 3에 그림을 포함했다고 하네요. 자, Figure 3을 같이 봅시다!

(Figure 3. The AbsoluteZero Loop 설명)

자, Figure 3을 보세요. “The AbsoluteZero Loop”라고 되어 있죠? 이 반복적인 과정을 통해 모델이 스스로 발전하는 겁니다.

“The AbsoluteZero loop begins with the agent π proposing task τ, which is transformed by f with the environment e into a validated problem (x, y⋆), and also emits a reward r_propose for learnability.” 루프는 에이전트 π가 작업 τ를 “제안(proposing)”하는 것으로 시작합니다. 이 제안된 작업 τ는 환경 e와 함수 f를 통해 (x, y⋆)라는 검증된 문제로 변환됩니다. 동시에 학습 가능성에 대한 보상 r_propose도 발생합니다. 즉, 모델이 문제를 내면, 환경이 이걸 쓸만한 문제인지 검증하고, 얼마나 배우기 좋은 문제인지에 대한 점수(보상)를 매긴다는 겁니다.

“Then, a standard RL step follows: the agent solves x by producing y, receiving reward r_solve from e by matching with y⋆.” 그다음에는 표준적인 강화학습(RL) 단계가 이어집니다. 에이전트는 문제 x를 해결하여 답변 y를 생성하고, 이 y를 정답 y⋆와 비교하여 환경 e로부터 보상 r_solve를 받습니다. 문제를 풀고, 그 결과에 따라 점수를 받는 거죠.

“π_propose and π_solve are jointly trained and this process can be repeated indefinitely.” 중요한 것은 제안자 역할(π_propose)과 해결자 역할(π_solve)이 “함께 훈련”된다는 점이고, 이 과정은 무한히 반복될 수 있습니다. 계속 스스로 문제 내고 풀면서 똑똑해지는 거죠.

(다시 본문으로 돌아와서)

“Let π_θ be our parameterized language model, it is used to play two roles, proposer π_propose_θ and solver π_solve_θ during training.” 우리의 파라미터화된 언어 모델을 π_θ라고 합시다. 이 모델은 훈련 중에 제안자 π_propose_θ와 해결자 π_solve_θ라는 두 가지 역할을 수행합니다.

“The proposer first samples a proposed task conditioned on variable z: τ ∼ π_propose_θ(·|z), which will then be validated and used to construct a valid reasoning task together with the environment e: (x, y⋆) ∼ f_e(·|τ), where x is the task query and y⋆ is the gold label.” 제안자는 먼저 변수 z에 조건화된 제안 작업 τ를 샘플링합니다 (τ ∼ π_propose_θ(·|z)). 이 작업은 환경 e와 함께 검증되어 유효한 추론 작업 (x, y⋆)를 구성하는 데 사용됩니다 ( (x, y⋆) ∼ f_e(·|τ) ). 여기서 x는 작업 질문이고 y⋆는 정답 레이블입니다. z는 문제를 생성하기 위한 일종의 시드(seed)라고 생각할 수 있겠네요.

“Then the solver produces an answer y ∼ π_solve_θ(·|x).” 그런 다음 해결자는 질문 x에 대한 답변 y를 생성합니다 (y ∼ π_solve_θ(·|x)).

“Each proposed task τ is scored by a learnability reward r_propose_e(τ, π_θ), which captures the expected improvement in π_θ after training on the task query x.” 각각의 제안된 작업 τ는 학습 가능성 보상 r_propose_e(τ, π_θ)에 의해 점수가 매겨집니다. 이 보상은 작업 질문 x에 대해 훈련한 후 모델 π_θ가 얼마나 개선될 것으로 예상되는지를 나타냅니다. 즉, 이 문제를 풀면 얼마나 똑똑해질까? 를 점수화하는 거죠.

“Moreover, the same policy also receives a solution reward r_solve_e(y, y⋆) for its answer to the task query x, with the environment again serving as the verifier.” 또한, 동일한 정책(모델)은 작업 질문 x에 대한 답변에 대해 해결 보상 r_solve_e(y, y⋆)도 받습니다. 이때도 환경이 검증자 역할을 합니다. 문제를 잘 풀었는지에 대한 점수죠.

“A non-negative coefficient λ balances the trade-off between exploring new, learnable tasks and improving the model’s reasoning and problem-solving abilities.” 음이 아닌 계수 λ(람다)는 새롭고 학습 가능한 작업을 탐색하는 것과 모델의 추론 및 문제 해결 능력을 향상시키는 것 사이의 균형을 맞춥니다. 너무 어려운 문제만 내거나 너무 쉬운 문제만 푸는 것을 방지하는 역할을 하는 거죠.

“We formally define the absolute zero setting’s objective as follows:” AbsoluteZero 설정의 목표를 공식적으로 다음과 같이 정의합니다.

J(θ) := max_θ E_z∼p(z) [ E_( (x,y⋆)∼f_e(·|τ), τ∼π_propose_θ(·|z) ) r_propose_e(τ,π_θ) + λ E_y∼π_solve_θ(·|x) r_solve_e(y,y⋆) ] (3)

자, 이 논문의 가장 중요한 수식 중 하나입니다! J(θ)를 최대화하는 것이 목표입니다. 풀어서 설명드리면,

p(z)에서 z를 샘플링하고 (문제를 내기 위한 초기 조건),
제안자 π_propose_θ가 z를 바탕으로 작업 τ를 제안하고, 이 τ로부터 환경 f_e를 통해 문제 (x,y⋆)가 만들어집니다.
이때, 제안된 작업 τ에 대한 “학습 가능성 보상” (r_propose)을 받습니다.
그리고 해결자 π_solve_θ가 문제 x에 대해 답변 y를 내고, 이에 대한 “해결 보상” (r_solve)을 받습니다.
이 두 가지 보상의 합 (λ로 가중치 조절)의 기댓값을 최대화하도록 모델 파라미터 θ를 학습시키는 겁니다.

“Notice that we shift the burden of scaling data away from human experts and onto the proposer policy π_propose_θ and the environment e.” 중요한 점은 데이터 확장의 부담을 인간 전문가에게서 제안자 정책 π_propose_θ와 환경 e로 옮겼다는 것입니다. 이제 사람이 아니라 AI가 데이터를 만드는 주체가 되는 거죠.

“These two roles are both responsible for defining/evolving the learning task distribution, validating proposed tasks, and providing grounded feedback that supports stable and self-sustainable training.” 이 두 역할(제안자와 환경)은 학습 작업 분포를 정의하고 발전시키며, 제안된 작업을 검증하고, 안정적이고 자립 가능한 훈련을 지원하는 현실 기반의 피드백을 제공하는 책임을 모두 집니다.

“When proposing, z acts as a conditional variable that seeds generation of tasks. Practically, z can be instantiated by sampling a small subset of past (task, answer) pairs from a continually updated task memory, yet there is no specific implementation tied to the paradigm.” 제안할 때 z는 작업 생성을 위한 시드 역할을 하는 조건 변수입니다. 실제로는 지속적으로 업데이트되는 작업 메모리에서 과거의 (작업, 답변) 쌍의 작은 부분집합을 샘플링하여 z를 구체화할 수 있지만, 이 패러다임에 묶인 특정 구현 방식은 없다고 합니다. 유연하게 적용할 수 있다는 거죠.

“To guide the proposing process, we use a learnability reward r_propose(τ, π_θ), which measures how much the model is expected to improve by solving a proposed task τ.” 제안 과정을 안내하기 위해 학습 가능성 보상 r_propose(τ, π_θ)를 사용하는데, 이는 제안된 작업 τ를 해결함으로써 모델이 얼마나 향상될 것으로 예상되는지를 측정합니다. “이 문제를 풀면 얼마나 배울 게 많을까?”를 기준으로 삼는 거죠.

“Moreover, the solver reward r_solve(y, y∗) evaluates the correctness of the model’s output.” 또한, 해결자 보상 r_solve(y, y∗)는 모델 출력의 정확성을 평가합니다. “문제를 얼마나 잘 풀었나?”를 보는 거죠.

“Together, these two signals guide the model to propose tasks that are both challenging and learnable, while also enhancing its reasoning abilities, ultimately enabling continuous improvement through self-play.” 이 두 가지 신호(학습 가능성 보상과 해결 보상)가 함께 작용하여 모델이 도전적이면서도 학습 가능한 작업을 제안하도록 유도하는 동시에 추론 능력을 향상시켜 궁극적으로 셀프 플레이를 통한 지속적인 개선을 가능하게 합니다. 정말 똑똑한 시스템이죠?

자, 오늘 3강에서는 AbsoluteZero 패러다임의 핵심 철학과 수학적 정의에 대해 자세히 살펴보았습니다. SFT와 RLVR의 한계를 지적하며, 어떻게 AbsoluteZero가 외부 데이터 없이 스스로 문제를 만들고 풀면서 발전해 나가는지, 그리고 그 목표 함수가 어떻게 구성되는지 이해하셨을 겁니다. Figure 3의 루프와 수식 (3)이 이 패러다임의 심장이라고 할 수 있겠네요!

4강

지난 시간에는 AbsoluteZero 패러다임이라는 큰 그림을 봤다면, 오늘부터는 이 패러다임을 실제로 구현한 “AbsoluteZero Reasoner”, 줄여서 AZR에 대해 아주 구체적으로 파헤쳐 볼 겁니다. AZR이 어떻게 두 가지 역할을 동시에 해내는지, 그리고 어떤 방식으로 추론을 학습하는지, 눈 크게 뜨고 따라오세요!

3. AbsoluteZero Reasoner

“In this section, we present AbsoluteZero Reasoner (AZR) as the first attempt to embrace the AbsoluteZero Paradigm.” 이 섹션에서는 AbsoluteZero 패러다임을 실제로 구현한 첫 번째 시도로서 AbsoluteZero Reasoner (AZR)을 소개한다고 합니다. 이론을 현실로 만든 거죠!

“In AZR, an unified LLM serves as both a proposer and a solver: it generates tasks to evolve its learning curriculum and attempts to solve them to improve its reasoning capabilities.” AZR에서는 “하나의 통합된 LLM”이 “제안자(proposer)”와 “해결자(solver)” 역할을 모두 수행합니다. 스스로 학습 커리큘큘럼을 발전시키기 위해 작업을 생성하고, 동시에 그 작업들을 풀면서 추론 능력을 향상시킨다는 거죠. 한 놈이 북 치고 장구 치고 다 하는 겁니다.

“The model is trained jointly with both roles, learning to create tasks that push the boundary of reasoning capacity while enhancing its ability to solve them effectively (Section 3.1).” 모델은 이 두 가지 역할을 “함께 훈련”받습니다. 추론 능력의 한계를 밀어붙이는 작업을 만들어내는 동시에, 그 작업들을 효과적으로 해결하는 능력도 함께 향상시키는 법을 배웁니다. 이게 섹션 3.1에서 자세히 다룰 내용이고요.

“Within this self-play training paradigm, the model learns from three distinct types of coding tasks, which corresponding to three fundamental modes of reasoning: abduction, deduction and induction (Section 3.2).” 이 셀프 플레이 훈련 패러다임 안에서, 모델은 세 가지 서로 다른 유형의 “코딩 작업”으로부터 학습합니다. 이 세 가지 작업은 각각 귀추(abduction), 연역(deduction), 귀납(induction)이라는 세 가지 근본적인 추론 방식에 해당합니다. 이 내용은 섹션 3.2에서 자세히 다룹니다. 코딩을 통해 추론을 배운다는 게 흥미롭네요!

“Using coding tasks is motivated by the Turing-completeness of programming languages (Stuart, 2015) and empirical evidence that code-based training improves reasoning (Aryabumi et al., 2024).” 왜 하필 코딩 작업일까요? 프로그래밍 언어가 “튜링 완전성(Turing-completeness)”을 가지고 있기 때문입니다. 튜링 완전성이란 어떤 계산 가능한 문제든 다 풀 수 있다는 의미죠. (Stuart, 2015) 그리고 코드 기반 훈련이 추론 능력을 향상시킨다는 경험적 증거도 있다고 합니다. (Aryabumi et al., 2024) 즉, 코딩은 뭐든지 표현할 수 있고, 추론 능력 향상에도 도움이 된다는 거죠.

“We adopt code as an open-ended, expressive, and verifiable medium for enabling reliable task construction and verification (Section 3.3).” 그래서 코드를 개방적이고, 표현력이 풍부하며, 검증 가능한 매체로 채택했습니다. 이를 통해 신뢰할 수 있는 작업 구성 및 검증이 가능해진다는 겁니다. 이 부분은 섹션 3.3에서 더 자세히 나옵니다.

“Finally, the model is updated using a newly proposed advantage estimator designed for multitask learning (Section 3.3.5).” 마지막으로, 모델은 다중 작업 학습을 위해 새롭게 제안된 어드밴티지 추정기를 사용하여 업데이트됩니다. 이것도 섹션 3.3.5에서 다룹니다.

“We outline the overall algorithm in Algorithm 1 and highlight an illustration of our AbsoluteZero Reasoner approach in Figure 4.” 전체 알고리즘은 Algorithm 1에 요약되어 있고, AZR 접근 방식에 대한 그림은 Figure 4에 나와 있다고 하네요. 자, 그럼 먼저 Figure 4부터 보면서 AZR이 어떻게 돌아가는지 큰 그림을 잡아봅시다.

(Figure 4. AbsoluteZero Reasoner Training Overview 설명)

자, Figure 4를 보세요. “AbsoluteZero Reasoner Training Overview”라고 되어있죠. AZR 훈련의 전체적인 흐름도입니다.

맨 위에는 “Absolute Zero Reasoner”라는 통합 모델이 있습니다. 이 모델이 두 가지 핵심 단계를 수행해요. “PROPOSE”와 “SOLVE”죠.

PROPOSE (제안 단계):
- “Task Types”라고 해서 Abduction, Deduction, Induction 세 가지 작업 유형 중 하나를 선택합니다.
- 과거에 스스로 생성했던 예제들(past self-generated triplets)이 저장된 버퍼(buffer)와 특정 작업 유형을 조건으로 새로운 작업을 “제안”합니다.
- 제안된 작업들 중에서 파이썬(Python)을 사용해 유효한 코드 기반 추론 질문을 만들고 필터링합니다.
- 그리고 각 제안된 작업에 대해 “학습 가능성 보상(Learnability Reward, r_propose)”을 계산합니다. 이 작업이 얼마나 배울 만한 가치가 있는지를 평가하는 거죠. (수식 4에서 정의)
SOLVE (해결 단계):
- 제안 단계에서 만들어진 추론 질문들을 “해결”하려고 시도합니다.
- 생성된 답변에 대해 다시 파이썬을 사용해 검증하고, “정확도 보상(Accuracy Reward, r_solve)”을 계산합니다. (수식 5에서 설명)
Joint Update (통합 업데이트):
- 마지막으로, 이 두 가지 보상 r_propose와 r_solve를 모두 사용해서 세 가지 작업 유형 전체에 걸쳐 AbsoluteZero Reasoner를 “통합적으로 업데이트”합니다. 이때 TRR++라는 방법을 사용한다고 하네요. (섹션 3.3.5)

이 PROPOSE – SOLVE – Joint Update 과정이 계속 반복되면서 모델이 점점 똑똑해지는 겁니다. 오른쪽에는 각 작업 유형에 대한 간단한 설명도 있네요.

Induction: ?X = F ( ) Program, Output -> Input을 추론 (논문에서는 Input/Output 쌍으로부터 Program을 생성하는 것으로 설명되어 있으니 그림의 표현은 약간 다를 수 있습니다. 본문을 우선해야 합니다.)
Abduction: ?X = ( ) Output, Input -> Program을 추론 (논문에서는 Program, Output -> Input)
Deduction: ? = F ( ) Program, Input -> Output을 추론 (논문과 일치) (주의: 그림의 각주와 본문 설명 간에 약간의 불일치가 있을 수 있으므로, 항상 본문 설명을 기준으로 이해해야 합니다. 여기서는 본문 설명을 따르겠습니다.)

“To expedite future exploration in this area, we also present several attempts that did not yield fruitful results but still warrant discussion in Appendix D.” 이 분야의 향후 탐구를 촉진하기 위해, 좋은 결과를 내지는 못했지만 논의할 가치가 있는 몇 가지 시도들도 부록 D에 제시했다고 합니다. 실패 경험도 공유하는 아주 바람직한 자세죠!

3.1. Two Roles in One: Proposer and Solver

자, 이제 AZR의 핵심 작동 방식, 하나의 모델이 어떻게 제안자와 해결자 역할을 모두 하는지 자세히 봅시다.

“Large language models are naturally suited for implementing AZR in a multitask learning context (Radford et al., 2019), as both the formulation of reasoning tasks and their solutions occur within a unified language space.” LLM은 다중 작업 학습 환경에서 AZR을 구현하는 데 자연스럽게 적합하다고 합니다. (Radford 등의 2019년 연구) 왜냐하면 추론 작업의 공식화와 그 해결책 모두 “통일된 언어 공간” 내에서 발생하기 때문입니다. 즉, 문제도 언어로 만들고 답도 언어로 하니 LLM이 잘 할 수 있다는 거죠.

“To this end, we propose rewarding a single model for both generating high learning potential tasks and solving them effectively, as specified by the AbsoluteZero objective in Equation (3).” 이를 위해, 단일 모델이 높은 학습 잠재력을 가진 작업을 생성하고 동시에 이를 효과적으로 해결하는 것 모두에 대해 보상을 주는 방식을 제안합니다. 이건 앞서 3강에서 봤던 AbsoluteZero 목표 함수 (수식 3)에 명시된 내용이죠.

“At each iteration of the online rollout, AZR proposes new reasoning tasks by conditioning on the task type (as defined in Section 3.2) and K past self-generated examples.” 온라인 rollout의 각 반복에서, AZR은 작업 유형(섹션 3.2에서 정의)과 과거에 스스로 생성한 K개의 예제에 조건화하여 새로운 추론 작업을 제안합니다. 과거 경험을 바탕으로 새로운 문제를 만들어내는 거죠.

“The model is explicitly prompted to generate tasks that differ from these examples, promoting diversity and broader coverage of the task space.” 모델은 이 예제들과 “다른” 작업을 생성하도록 명시적으로 프롬프트를 받습니다. 이를 통해 작업 공간의 다양성과 더 넓은 범위를 확보하려는 겁니다. 맨날 똑같은 문제만 풀면 재미없잖아요?

“These task proposals are filtered and transformed into valid reasoning tasks that can be verified using the environment, outlined later in Section 3.3.” 이러한 작업 제안들은 필터링되고, 환경을 사용하여 검증할 수 있는 유효한 추론 작업으로 변환됩니다. 이건 섹션 3.3에서 자세히 설명합니다.

“AZR then attempts to solve these newly proposed tasks, receiving grounded feedback for its model responses.” 그런 다음 AZR은 새롭게 제안된 이 작업들을 해결하려고 시도하고, 모델 응답에 대해 현실에 기반한 피드백을 받습니다.

“Both task proposal and problem solving are trained using reinforcement learning. We now outline the rewards used for each role.” 작업 제안과 문제 해결 모두 강화학습을 사용하여 훈련됩니다. 이제 각 역할에 사용되는 보상에 대해 설명합니다.

“Reward Design. Prior work has shown that setting appropriate task difficulty is critical for promoting effective learning in reasoning systems (Zeng et al., 2025b).” 보상 설계입니다. 이전 연구에서 추론 시스템의 효과적인 학습을 촉진하기 위해서는 적절한 작업 난이도를 설정하는 것이 매우 중요하다고 밝혀졌습니다. (Zeng 등의 2025b년 연구) 너무 쉽거나 너무 어려우면 학습 효과가 떨어진다는 거죠.

“Motivated by this, we design a reward function for the proposer that encourages generation of tasks with meaningful learning potential—neither too easy nor unsolvable for the current solver.” 이에 동기를 부여받아, 제안자를 위한 보상 함수를 설계했는데, 이 함수는 현재 해결자에게 너무 쉽지도 않고 풀 수 없지도 않은, 즉 의미 있는 학습 잠재력을 가진 작업 생성을 장려합니다. 딱 적절한 난이도의 문제를 만들도록 유도하는 거죠.

“Concretely, we use the same language model in its solver role to estimate the learnability of a proposed task, a similar type of reward used in unsupervised environment design literature (Sukhbaatar et al., 2018).” 구체적으로, 제안된 작업의 학습 가능성을 추정하기 위해 “동일한 언어 모델”을 해결자 역할로 사용합니다. 이는 비지도 환경 설계 문헌에서 사용된 유사한 유형의 보상이라고 하네요. (Sukhbaatar 등의 2018년 연구) 해결자가 풀어봤을 때 얼마나 잘 풀 수 있느냐를 가지고 문제의 학습 가치를 평가하는 겁니다.

“We perform n Monte Carlo rollouts of the solver and compute the average success rate: r̄_solve = (1/n) Σ r^(i)_solve.” 해결자를 n번 몬테카를로 롤아웃(실행)하고 평균 성공률 r̄_solve를 계산합니다. 즉, 여러 번 풀어보게 해서 평균적으로 얼마나 성공하는지를 보는 거죠.

“The proposer’s reward is then defined as: r_propose = 0, if r̄_solve = 0 or r̄_solve = 1; 1 – r̄_solve, otherwise. (4)” 그리고 제안자의 보상 r_propose는 다음과 같이 정의됩니다 (수식 4): 만약 평균 성공률 r̄_solve가 0 (전혀 못 품)이거나 1 (너무 쉽게 다 품)이면 보상은 0입니다. 그렇지 않으면 1 – r̄_solve 가 보상이 됩니다. 이게 무슨 뜻일까요? 너무 쉽거나 너무 어려운 문제는 학습에 도움이 안 되니 보상이 없고, 적당히 어려운 문제(성공률이 0과 1 사이)일수록, 특히 약간 더 어려운 문제일수록 (성공률이 0에 가까울수록) 더 높은 보상을 받는다는 겁니다. (논문에서는 1 – r̄_solve로 되어 있는데, 이러면 성공률이 낮을수록 보상이 커집니다. 즉, 어려운 문제를 제안할수록 보상이 커지네요. 다만, 아예 못 푸는 문제는 보상이 0입니다.)

“The intuition is that if a task is either trivial to solve (r̄_solve = 1) or unsolvable (r̄_solve = 0), the task provides little to no learning signal for the proposer.” 직관적으로, 작업이 풀기에 너무 사소하거나(r̄_solve = 1) 풀 수 없다면(r̄_solve = 0), 그 작업은 제안자에게 거의 또는 전혀 학습 신호를 제공하지 않는다는 겁니다.

“In contrast, tasks of moderate difficulty, where the solver occasionally succeeds are rewarded the most, as they offer the richest feedback and greatest potential for learning.” 반대로, 해결자가 가끔 성공하는 중간 정도 난이도의 작업이 가장 풍부한 피드백과 가장 큰 학습 잠재력을 제공하므로 가장 많이 보상받는다고 합니다. (여기서 논문의 수식 (4)와 이 문장 설명 사이에 약간의 뉘앙스 차이가 있을 수 있습니다. 수식은 (1-성공률)이므로 성공률이 낮을수록, 즉 어려울수록 보상이 큰데, 아예 못 푸는 경우는 0입니다. 이 문장은 ‘가끔 성공하는’ 중간 난이도를 강조하고 있네요.) 핵심은 극단적인 난이도는 피한다는 겁니다.

“For the solver, we assign a simple binary reward based on the correctness of its final output, r_solve = I(y = y⋆). (5)” 해결자에 대해서는 최종 출력의 정확성에 기반한 간단한 이진 보상을 할당합니다 (수식 5): r_solve = I(y = y⋆). 여기서 I는 지시 함수(indicator function)로, y(모델의 답)가 y⋆(실제 정답)와 같으면 1, 다르면 0을 주는 겁니다. 맞히면 1점, 틀리면 0점! 아주 간단하죠.

“where y⋆ is the ground-truth answer, and equality is evaluated based on value equality in Python.” y⋆는 실제 정답이고, 동등성은 파이썬에서의 값 동등성을 기준으로 평가됩니다.

“With the primary rewards for the proposing and solving roles defined, we adopt the following composite reward structure, which integrates r_propose and r_solve with a format-aware penalty inspired by DeepSeek-AI et al. (2025):” 제안 및 해결 역할에 대한 주요 보상이 정의되었으므로, 이제 r_propose와 r_solve를 DeepSeek-AI 등의 연구에서 영감을 받은 형식 인식 페널티(format-aware penalty)와 통합하는 다음과 같은 복합 보상 구조를 채택합니다.

R(y_π) = { r_role if the response is passable, role ∈ {propose, solve}; -0.5 if the response is wrong but well-formatted; -1 if the answer has formatting errors. } (6)

자, 수식 (6)을 봅시다. 모델의 응답 y_π에 대한 최종 보상 R(y_π)입니다.

만약 응답이 “passable” (통과 가능)하다면, 해당 역할(제안 또는 해결)에 대한 보상 r_role을 받습니다.
만약 응답이 틀렸지만 형식은 잘 갖췄다면 -0.5점.
만약 답변에 형식 오류가 있다면 -1점. 즉, 정답을 맞히는 것도 중요하지만, 정해진 형식을 잘 따르는 것도 중요하다는 것을 보상으로 알려주는 겁니다.

“where y_π is the response of the language model. The main format that the proposing and solving tasks need to follow is the DeepSeek R1 and format, as shown inFigure 33.” y_π는 언어 모델의 응답입니다. 제안 및 해결 작업이 따라야 하는 주요 형식은 Figure 33에 표시된 DeepSeek R1의 <think>와 <answer> 형식이라고 하네요. 생각하는 과정과 최종 답변을 특정 태그로 감싸는 형식인 것 같습니다.

“Moreover, for the proposer, the reward criterion for format goes beyond simply following the XML structure. As detailed in Section 3.3.3, only responses that produce valid triplets and pass the filtering stage are considered to be correctly formatted.” 더욱이, 제안자의 경우 형식에 대한 보상 기준은 단순히 XML 구조를 따르는 것 이상입니다. 섹션 3.3.3에서 자세히 설명하겠지만, 유효한 (프로그램, 입력, 출력) 삼중항을 생성하고 필터링 단계를 통과하는 응답만이 올바르게 형식을 갖춘 것으로 간주됩니다. 즉, 제안자는 문제 형식뿐만 아니라 내용적으로도 제대로 된 문제를 만들어야 한다는 거죠.

3.2. Learning Different Modes of Reasoning: Deduction, Induction, and Abduction

이제 AZR이 학습하는 세 가지 추론 방식에 대해 알아봅시다: 연역, 귀납, 귀추.

“AZR uses code executor as both a flexible interface and a verifiable environment.” AZR은 코드 실행기를 유연한 인터페이스이자 검증 가능한 환경으로 사용합니다. 이게 아주 중요한 부분이죠.

“This setup enables automatic construction, execution, and validation of code reasoning tasks (Stuart, 2015; Aryabumi et al., 2024).” 이 설정을 통해 코드 추론 작업의 자동 구성, 실행 및 검증이 가능해집니다. 앞서 언급된 참고문헌들이죠.

“Give program space P, input space I and output space O of a coding language, we define an AZR reasoning task as a triplet (p, i, o), where p ∈ P is a program, i ∈ I is an input, and o ∈ O is the corresponding output produced by running program on input, o = p(i).” 코딩 언어의 프로그램 공간 P, 입력 공간 I, 출력 공간 O가 주어졌을 때, AZR 추론 작업을 (p, i, o)라는 삼중항으로 정의합니다. 여기서 p는 프로그램, i는 입력, o는 프로그램을 입력 i로 실행했을 때 생성되는 해당 출력입니다 (o = p(i)). 이게 모든 작업의 기본 구성 요소입니다.

“AZR learns by reasoning about different parts of this task triplet, using three distinct core reasoning modes, each of which focuses on inferring one part of the triplet given the others:” AZR은 이 작업 삼중항의 서로 다른 부분에 대해 추론함으로써 학습하며, 세 가지 서로 다른 핵심 추론 방식을 사용합니다. 각 방식은 나머지 두 부분이 주어졌을 때 삼중항의 한 부분을 추론하는 데 중점을 둡니다.

Deduction (연역): “predicting the output o given a program p and input i, capturing step-by-step logical reasoning.” 프로그램 p와 입력 i가 주어졌을 때 출력 o를 예측하는 것입니다. 단계별 논리적 추론을 포착합니다.
- “As a proposer, AZR is conditioned on the task type α = deduction and K reference examples from the deduction buffer D_deduction (all task buffers are outlined in Section 3.3), and generates a pair (p, i).” 제안자로서 AZR은 작업 유형 α=연역 및 연역 버퍼 D_deduction의 K개 참조 예제에 조건화되어 (p, i) 쌍을 생성합니다.
- “The environment e then executes p(i) to compute o, completing the triplet (p, i, o), which is added to the buffer if non-error output was produced.” 그런 다음 환경 e는 p(i)를 실행하여 o를 계산하고, (p, i, o) 삼중항을 완성합니다. 오류 없는 출력이 생성되면 버퍼에 추가됩니다.
- “As a solver, the model receives (p, i) and predicts the output o_π.” 해결자로서 모델은 (p, i)를 받고 출력 o_π를 예측합니다.
- “The predicted output is verified using type-aware value equality in python to account for possible variations (such as set ordering or fractions).” 예측된 출력은 가능한 변형(예: 집합 순서 또는 분수)을 고려하기 위해 파이썬의 유형 인식 값 동등성을 사용하여 검증됩니다. 단순히 문자열 비교가 아니라 실제 값이 같은지를 본다는 거죠.
Abduction (귀추): “inferring a plausible input i given the program p and an output o, resembling trial-and-error or online search.” 프로그램 p와 출력 o가 주어졌을 때 가능한 입력 i를 추론하는 것입니다. 시행착오나 온라인 검색과 유사합니다.
- “As a proposer, the policy π_propose’s input and output is almost the same as the proposer for the deduction task, except that the task type α = abduction is changed as an input.” 제안자로서 정책 π_propose의 입력과 출력은 연역 작업의 제안자와 거의 동일하지만, 작업 유형 α=귀추가 입력으로 변경된다는 점만 다릅니다. (p, i) 쌍을 생성합니다.
- “Then we execute p(i) and get the triplet (p, i, o).” 그런 다음 p(i)를 실행하여 (p, i, o) 삼중항을 얻습니다.
- “As a solver, the model receives (p, o) and predicts i_π.” 해결자로서 모델은 (p, o)를 받고 입력 i_π를 예측합니다.
- “The solution is verified by checking whether p(i_π) = o. Since programs may not be bijective, we use output value equivalence rather than requiring exact input matches.” 해결책은 p(i_π) = o 인지 확인하여 검증됩니다. 프로그램이 일대일 대응(bijective)이 아닐 수 있으므로 정확한 입력 일치를 요구하는 대신 출력 값 동등성을 사용합니다. 즉, 다른 입력을 넣었어도 같은 출력이 나오면 정답으로 인정한다는 겁니다.
Induction (귀납): “synthesizing a program p from a set of in-out examples {(i_n, o_n)}, requiring generalization from partial information.” 일련의 입력-출력 예제 {(i_n, o_n)}로부터 프로그램 p를 합성(생성)하는 것입니다. 부분적인 정보로부터 일반화가 필요합니다.
- “As a proposer, AZR samples a valid program p from D_abduction ∪ D_deduction, generates N new inputs and a message m, and uses the environment to compute corresponding outputs. This forms an extended task representation (p, {(i_n, o_n)}, m), which is stored in the induction buffer D_induction.” 제안자로서 AZR은 귀추 버퍼나 연역 버퍼에서 유효한 프로그램 p를 샘플링하고, N개의 새로운 입력과 메시지 m을 생성한 다음, 환경을 사용하여 해당 출력을 계산합니다. 이렇게 확장된 작업 표현 (p, {(i_n, o_n)}, m)을 형성하여 귀납 버퍼 D_induction에 저장합니다.
- “Since infinitely many functions can map the inputs to the outputs, making the induction task under-constrained, the message m helps properly condition the problem for the solver.” 무한히 많은 함수가 입력을 출력에 매핑할 수 있어 귀납 작업이 과소 제약(under-constrained)될 수 있으므로, 메시지 m은 해결자를 위해 문제를 적절히 조건화하는 데 도움이 됩니다. 즉, 너무 많은 정답 가능성을 줄여주기 위해 힌트를 주는 거죠.
- “As a solver, the model is shown the first half of the input-output pairs and the message m, and must synthesize a program p_π that correctly maps the remaining hidden inputs to their outputs.” 해결자로서 모델은 입력-출력 쌍의 절반과 메시지 m을 보고, 나머지 숨겨진 입력을 해당 출력에 올바르게 매핑하는 프로그램 p_π를 합성해야 합니다.
- “The use of held-out examples discourages overfitting through if-else logic and promotes generalized induction.” 숨겨진 예제(held-out examples)를 사용하면 단순한 if-else 논리를 통한 과적합(overfitting)을 방지하고 일반화된 귀납을 촉진합니다. 즉, 훈련 데이터에만 맞는 꼼수 프로그램을 만드는 것을 막는다는 거죠.

자, 이제 Figure 5를 잠깐 볼까요?

(Figure 5. The Seed AZR Zero Triplet 설명)

“Program Triplet Input: “HelloWorld” 1 def f(x): 2 return x Output: “HelloWorld”” 그림 5는 “Seed AZR Zero Triplet”을 보여줍니다. 이게 뭐냐면, AZR이 셀프 부트스트랩, 즉 스스로 학습을 시작하기 위해 제공된 “유일한” 삼중항입니다. 입력이 “HelloWorld”이고, 프로그램은 받은 입력을 그대로 반환하는 아주 간단한 항등 함수(identity function) f(x) = x 이고, 출력도 당연히 “HelloWorld”죠.

“The above identity function triplet was the only triplet provided to AZR to initiate its self-bootstrap propose-and-solve RLVR loop.” 이 항등 함수 삼중항이 AZR이 스스로 문제를 제안하고 해결하는 RLVR 루프를 시작하기 위해 제공된 유일한 삼중항이었다고 합니다. 정말 “Zero”에 가깝게 시작하는 거죠.

“We note that the base LLM is fully capable of initiating the AZR loop without any seed program; its inclusion illustrates our approach’s flexibility: we can optionally initialize seed programs with existing datasets of varying complexity, and we initialized ours with the simplest program.” 기본 LLM은 사실 이런 시드 프로그램 없이도 AZR 루프를 시작할 수 있는 능력이 충분하다고 합니다. 이걸 포함한 이유는 이 접근 방식의 유연성을 보여주기 위해서라고 하네요. 선택적으로 기존의 다양한 복잡도를 가진 데이터셋으로 시드 프로그램을 초기화할 수도 있는데, 이 연구에서는 가장 간단한 프로그램으로 초기화했다는 겁니다.

다시 본문으로 돌아와서,

“Each reasoning task type leverages code as an expressive and verifiable medium, aligning with the AbsoluteZero Paradigm’s goals of fully self-improving systems in open-ended domains (DeepSeek-AI et al., 2025; Lambert et al., 2024).” 각각의 추론 작업 유형은 코드를 표현력 있고 검증 가능한 매체로 활용하며, 이는 개방형 도메인에서 완전히 자가 개선되는 시스템이라는 AbsoluteZero 패러다임의 목표와 일치합니다.

“All prompts used by three different task types and two types of roles within a task type are shown in Figures 34 to 39.” 세 가지 다른 작업 유형과 각 작업 유형 내의 두 가지 역할(제안자, 해결자)에 사용된 모든 프롬프트는 Figure 34부터 39에 나와 있다고 합니다. 실제 어떤 명령어로 모델에게 작업을 시키는지 궁금하면 이 그림들을 보면 되겠네요.

“Next, we outline exact details of our algorithm.” 다음으로는 이 알고리즘의 정확한 세부 사항을 설명한다고 합니다. 이건 5강에서 다루도록 하죠!

4강에서는 AbsoluteZero Reasoner(AZR)가 하나의 LLM으로 제안자와 해결자 역할을 어떻게 수행하는지, 그리고 어떤 보상 체계를 사용하는지 알아봤습니다. 또한, 연역, 귀추, 귀납이라는 세 가지 추론 방식을 코딩 작업을 통해 어떻게 학습하는지도 살펴봤죠. Figure 4는 이 전체 과정을 한눈에 보여줬고, Figure 5는 정말 간단한 시드 하나로 이 모든 것을 시작한다는 것을 알려줬습니다.

5강

“AbsoluteZero Reasoner 학습 알고리즘”이 실제로 어떻게 돌아가는지 아주 자세하게, 단계별로 파헤쳐 볼 겁니다. Algorithm 1을 중심으로 버퍼 초기화부터 작업 제안, 검증, 그리고 최종적인 모델 업데이트까지!

3.3. AbsoluteZero Reasoner Learning Algorithm

“In this section, we will discuss details of our AZR self-play algorithm, including initialization of buffers 3.3.1, usage of these buffers 3.3.2, construction of valid tasks 3.3.3, validating solutions 3.3.4, and finally advantage estimator calculation 3.3.5.” 이 섹션에서는 AZR 셀프 플레이 알고리즘의 세부 사항을 논의합니다. 버퍼 초기화(3.3.1), 이 버퍼들의 사용법(3.3.2), 유효한 작업 구성(3.3.3), 해결책 검증(3.3.4), 그리고 마지막으로 어드밴티지 추정기 계산(3.3.5) 순서로 진행됩니다.

“We outline the overall recipe of the self-play procedure of AZR in Algorithm 1.” AZR의 셀프 플레이 절차에 대한 전체 레시피를 Algorithm 1에 요약해 두었다고 합니다. 자, 그럼 Algorithm 1을 보면서 큰 틀을 잡아봅시다!

(Algorithm 1 Self-Play Training of AbsoluteZero Reasoner (AZR) 설명)

알고리즘 1의 제목은 “Self-Play Training of AbsoluteZero Reasoner (AZR)”입니다.

Require (필요한 것):

Pretrained base LLM π_θ: 미리 학습된 기본 LLM 모델
batch size B: 배치 크기
#references K: 참조할 예제의 개수 K
iterations T: 총 반복 횟수 T

1: D_ded, D_abd, D_ind ← InitSeeding(π_θ) (▷ see §3.3.1) 가장 먼저, 세 가지 추론 유형(연역, 귀추, 귀납)에 대한 버퍼 D_ded, D_abd, D_ind를 초기 시딩(InitSeeding) 함수를 통해 초기화합니다. 이건 섹션 3.3.1에서 자세히 설명합니다.

2: for t ← 1 to T do (T번 반복) 이제 본격적인 학습 루프입니다.

3: for b ← 1 to B do (배치 크기 B만큼 반복) ▷ PROPOSE PHASE (제안 단계)

4: p ∼ D_abd ∪ D_ded (▷ sample a program for induction task proposal) 귀추(D_abd) 또는 연역(D_ded) 버퍼에서 프로그램 p를 하나 샘플링합니다. 이건 귀납 작업 제안을 위해 사용됩니다.

5: {in_π^(n)}_(n=1)^N, m_π ← π_propose_θ(ind, p) (▷ generate N inputs and a description) 제안자 π_propose_θ가 귀납(ind) 작업 유형과 샘플링된 프로그램 p를 조건으로 N개의 입력 {in_π}과 설명(description) m_π를 생성합니다.

6: if ({in_π^(n)}, {on_π^(n)})_(n=1)^N ← ValidateByExecuting p, {in_π}, syntax then (▷ validate I/Os, see §3.3.3) 생성된 입력 {in_π}들을 프로그램 p로 실행시켜서 유효한 입출력(I/O) 쌍인지, 문법적으로 오류가 없는지 검증합니다. (섹션 3.3.3 참조)

7: D_ind ← D_ind ∪ (p, {({in_π^(n)}, {on_π^(n)})}, m_π) (▷ update induction buffer) 검증을 통과하면, (프로그램 p, 입출력 쌍들, 설명 m_π)을 귀납 버퍼 D_ind에 추가합니다.

8: for α ∈ {ded, abd} do (연역 작업과 귀추 작업에 대해 각각 수행)

9: {(p_k, i_k, o_k)}_(k=1)^K ∼ D_α (▷ sample K reference examples) 해당 작업 유형(α)의 버퍼 D_α에서 K개의 참조 예제 (프로그램 p_k, 입력 i_k, 출력 o_k)를 샘플링합니다.

10: (p_π, i_π) ← π_propose_θ(α, {(p_k, i_k, o_k)}) (▷ propose new task) 제안자 π_propose_θ가 작업 유형 α와 K개의 참조 예제를 조건으로 새로운 작업 (프로그램 p_π, 입력 i_π)을 제안합니다.

11: if o_π ← ValidateByExecuting p_π, i_π, syntax, safety, determinism then (▷ see §3.3.3) 제안된 프로그램 p_π와 입력 i_π를 실행하여 문법, 안전성, 결정론적 실행 여부를 검증하고 출력 o_π를 얻습니다. (섹션 3.3.3 참조)

12: D_α ← D_α ∪ (p_π, i_π, o_π) (▷ if valid, update deduction or abduction buffers) 검증을 통과하면, (프로그램 p_π, 입력 i_π, 출력 o_π) 삼중항을 해당 작업 유형(α)의 버퍼 D_α에 추가합니다.

13: for all α ∈ {ded, abd, ind} do (세 가지 작업 유형 모두에 대해 수행) ▷ SOLVE PHASE (해결 단계)

14: (x, y⋆) ← SamplePrepareTasks(D_α, B, t) (▷ x, y⋆ prepared based on α, see §3.3.3 & 3.3.4) 각 작업 유형(α)의 버퍼 D_α에서 B개의 문제를 샘플링하고 해결 단계에 맞게 문제 x와 정답 y⋆를 준비합니다. (섹션 3.3.3 및 3.3.4 참조)

15: y_π ∼ π_solve_θ(x) 해결자 π_solve_θ가 문제 x에 대한 답변 y_π를 생성합니다.

16: Reward: Use proposed task triplets and solved answers to get r_propose & r_solve (▷ see §3.1) 제안된 작업 삼중항과 해결된 답변을 사용하여 제안 보상 r_propose와 해결 보상 r_solve를 계산합니다. (섹션 3.1 참조)

17: RL update: use TaskRelativeREINFORCE++ to update π_θ (▷ see §3.3.5) Task-Relative REINFORCE++ (TRR++)라는 방법을 사용하여 모델 π_θ를 업데이트합니다. (섹션 3.3.5 참조)

이 PROPOSE 단계와 SOLVE 단계, 그리고 마지막 RL 업데이트가 계속 반복되면서 AZR이 학습하는 겁니다. 생각보다 복잡하지만, 각 단계가 명확하죠?

자, 이제 알고리즘의 각 구성 요소를 자세히 살펴봅시다.

3.3.1. BUFFER INITIALIZATION (버퍼 초기화)

“To initialize AZR self-play, we first generate a seed set of valid triplets using the base language model.” AZR 셀프 플레이를 초기화하기 위해, 먼저 기본 언어 모델을 사용하여 유효한 삼중항의 “시드 세트(seed set)”를 생성합니다. 아무것도 없는 상태에서 시작할 수 없으니, 처음에는 기본 모델이 약간의 문제를 만들어내는 거죠.

“Each prompt samples up to K triplets from the current seed buffer D_seed as references.” 각 프롬프트는 현재 시드 버퍼 D_seed에서 최대 K개의 삼중항을 참조로 샘플링합니다.

“When D_seed is empty at time 0, we fallback to the zero triplet show in Figure 5.” 만약 시간 0에 D_seed가 비어 있다면, Figure 5에서 보여준 “제로 삼중항” (항등 함수)을 사용합니다. 정말 아무것도 없을 때를 대비한 최소한의 시작점이죠.

“During theseeding stage, we use the same proposer prompts detailed in Figures 34 to 36.” 이 시딩 단계에서는 Figure 34-36에 자세히 설명된 것과 동일한 제안자 프롬프트를 사용합니다.

“First, for deduction and abduction tasks, the LLM is prompted to generate (p, i) pairs, which are filtered, executed, and stored as valid triplets.” 먼저, 연역 및 귀추 작업의 경우, LLM은 (프로그램, 입력) 쌍을 생성하도록 프롬프트를 받고, 이는 필터링, 실행 및 유효한 삼중항으로 저장됩니다.

“We initialize D^{0_abduction = D}0_deduction = D_seed, where |D_seed| = B × S, where B is the batch size, and S=4 is a factor we fix in all experiments.” 초기 귀추 버퍼와 연역 버퍼는 D_seed로 초기화하며, D_seed의 크기는 배치 크기 B 곱하기 S입니다. S는 모든 실험에서 4로 고정된 요소라고 하네요.

“All seed triplet’s program are stripped of global variables and comments (Appendix C), but subsequent iterations of adding new triplets to the buffers are unaltered.” 모든 시드 삼중항의 프로그램에서는 전역 변수와 주석이 제거됩니다 (부록 C 참조). 하지만 이후 버퍼에 새로운 삼중항을 추가하는 반복 과정에서는 변경되지 않습니다. 초기에는 좀 더 깔끔한 문제로 시작하려는 의도 같네요.

“No model updates occur during this phase.” 이 시딩 단계에서는 모델 업데이트가 발생하지 않습니다. 순수하게 초기 문제 풀을 만드는 과정입니다.

“Similarly, to initialize the induction buffer, we sample programs from D_seed, generate matching input sets and messages, and collect valid examples until |D^0_induction| = B × S.” 유사하게, 귀납 버퍼를 초기화하기 위해 D_seed에서 프로그램을 샘플링하고, 일치하는 입력 세트와 메시지를 생성하며, 귀납 버퍼의 크기가 B × S가 될 때까지 유효한 예제를 수집합니다.

3.3.2. Task Proposal Inputs and Buffer Management (작업 제안 입력 및 버퍼 관리)

“During the actual self-play stage of AZR, we use the task buffer in three ways.” AZR의 실제 셀프 플레이 단계 동안, 작업 버퍼를 세 가지 방식으로 사용합니다.

“First, for the proposer of abduction and deduction tasks, we uniformly sample K past triplets from the buffer, present them as in-context examples to the proposer and let it generate a new task.” 첫째, 귀추 및 연역 작업의 제안자를 위해, 버퍼에서 과거 K개의 삼중항을 균일하게 샘플링하여 제안자에게 인-컨텍스트(in-context) 예제로 제공하고 새로운 작업을 생성하도록 합니다. 과거 문제를 보여주고 비슷한 스타일로 새 문제 내봐! 하는 거죠.

“The design is to show it past examples, and prompt it to generate a different one to promote diversity (Zhao et al., 2025a).” 디자인 의도는 과거 예제를 보여주고, 다양성을 촉진하기 위해 “다른” 예제를 생성하도록 유도하는 것입니다. (Zhao 등의 2025a 연구)

“Second, we sample one triplet from the union of abduction and deduction buffers D_abd ∪ D_ded, and present the program p from that triplet to the induction proposer to generate a set of N matching inputs and a natural language message m.” 둘째, 귀추 버퍼와 연역 버퍼의 합집합에서 삼중항 하나를 샘플링하고, 그 삼중항의 프로그램 p를 귀납 제안자에게 제공하여 N개의 일치하는 입력 세트 과 자연어 메시지 m을 생성하도록 합니다. 기존에 만들어진 프로그램을 가지고 귀납 문제를 만드는 거죠.

“Lastly, to maintain stable training, if a batch of solver problems contains fewer than B valid proposed tasks (proposer not adhering to formatting), we fill the remainder by uniformly sampling from the corresponding task buffer of previously validated triplets.” 마지막으로, 안정적인 훈련을 유지하기 위해, 만약 해결자 문제 배치가 B개보다 적은 유효한 제안 작업(제안자가 형식을 따르지 않은 경우)을 포함하면, 이전에 검증된 삼중항의 해당 작업 버퍼에서 균일하게 샘플링하여 나머지를 채웁니다. 문제 수가 모자라면 기존 문제로 채워서라도 훈련은 계속한다는 겁니다.

“The buffer grows for abduction and deduction tasks whenever π_propose a valid triplet (p, i, o), regardless if it gets any task reward.” 귀추 및 연역 작업의 버퍼는 제안자 π_propose가 유효한 삼중항 (p, i, o)를 제안할 때마다, 작업 보상을 받는지 여부와 관계없이 커집니다. 일단 쓸만한 문제가 나오면 저장하는 거죠.

“Similarly, for induction tasks, all valid triplets (p, {{i_n, o_n}}), m are added to the buffer.” 유사하게, 귀납 작업의 경우 모든 유효한 삼중항 (p, {{i_n, o_n}}), m 이 버퍼에 추가됩니다.

3.3.3. Constructing Valid Tasks (유효한 작업 구성)

“Proposal Task Validation. We first describe how we construct valid tasks from the proposals generated by the policy π.” 제안 작업 검증입니다. 정책 π에 의해 생성된 제안으로부터 어떻게 유효한 작업을 구성하는지 먼저 설명합니다.

“For deduction and abduction tasks, each proposal consists of a program and an input (p, i). To validate the task, we use the task validation procedure (steps shown below) on the input to obtain the correct output o, resulting in a complete triplet (p, i, o).” 연역 및 귀추 작업의 경우, 각 제안은 프로그램과 입력 (p, i)으로 구성됩니다. 작업을 검증하기 위해, 입력에 대해 아래에 설명된 작업 검증 절차를 사용하여 올바른 출력 o를 얻고, 완전한 삼중항 (p, i, o)를 만듭니다.

“For induction tasks, given a program p the policy proposes a set of inputs and message m. We also use the task validation procedure on each of the input i_n in the set to obtain a corresponding output o_n, forming a set of input-output pairs {i_n, o_n}. We do not impose any constraints on m.” 귀납 작업의 경우, 프로그램 p가 주어지면 정책은 입력 세트 과 메시지 m을 제안합니다. 또한 세트의 각 입력 i_n에 대해 작업 검증 절차를 사용하여 해당 출력 o_n을 얻어 입력-출력 쌍 {i_n, o_n} 세트를 형성합니다. 메시지 m에는 제약 조건을 두지 않습니다.

“The resulting task is considered valid only when all inputs yield valid outputs and the formatting requirements are satisfied.” 결과 작업은 모든 입력이 유효한 출력을 생성하고 형식 요구 사항이 충족될 때만 유효한 것으로 간주됩니다.

“The task validation procedure entails (작업 검증 절차는 다음을 수반합니다):”

Program Integrity (프로그램 무결성): “We first use Python to run the program p with the input i. If no errors are raised and something is returned, we then gather the output o of that (p, i) pair and determine that the program at least has valid syntax.” 먼저 파이썬을 사용하여 프로그램 p를 입력 i로 실행합니다. 오류가 발생하지 않고 무언가가 반환되면, 해당 (p, i) 쌍의 출력 o를 수집하고 프로그램이 적어도 유효한 구문을 가지고 있다고 판단합니다. 일단 돌아는 가야죠.
Program Safety (프로그램 안전성): “We also check whether a program is safe for execution by restricting the use of certain sensitive packages that might cause harm to the Python environment, i.e., os.sys, sys, shutil. The list of packages used to filter out invalid programs is provided in Figure 8. This list is also included in the instructions when prompting the language model to generate questions. See Figures 34 to 36.” 또한 프로그램이 실행하기에 안전한지 확인합니다. 파이썬 환경에 해를 끼칠 수 있는 특정 민감한 패키지(예: os.sys, sys, shutil)의 사용을 제한합니다. 유효하지 않은 프로그램을 필터링하는 데 사용된 패키지 목록은 Figure 8에 제공되며, 언어 모델에 질문 생성을 요청할 때 지침에도 포함됩니다. 위험한 코드는 못 만들게 막는 겁니다.
Check for Determinism (결정론적 실행 확인): “In our setting, we only consider deterministic programs, i.e., p ∈ P_deterministic ⊂ P, where P is the space of all valid programs and I is the space of all valid inputs:” (수식 7 생략) “That is, for all inputs i, the output of p(i) remains identical with any independent execution of the program.” 이 설정에서는 결정론적 프로그램만 고려합니다. 즉, 모든 입력 i에 대해 p(i)의 출력은 독립적인 실행에도 불구하고 동일하게 유지되어야 합니다. 실행할 때마다 결과가 달라지면 검증하기 어렵겠죠. “A valid program/input/output triplet (p, i, o) is defined such that o = p(i), where p ∈ P_deterministic.” 유효한 (p, i, o) 삼중항은 o = p(i)이고 p가 결정론적 프로그램일 때 정의됩니다. “To implement the filtering of invalid probabilistic programs, … we approximate this procedure by independently running the program j finite times and checking that all the outputs are equal. For computational budget reasons, we fixed j=2 for all experiments.” 확률적 프로그램을 필터링하기 위해, 프로그램을 독립적으로 j번 실행하고 모든 출력이 같은지 확인하는 근사 절차를 사용합니다. 계산 예산 때문에 모든 실험에서 j=2로 고정했다고 하네요. 두 번 실행해서 같으면 결정론적으로 간주하는 겁니다.

“Solving Task Construction (해결 작업 구성). If a task proposal passes these three checks, we deem it a valid task and apply appropriate procedures to present part of the triplet to the solver.” 작업 제안이 이 세 가지 검사를 통과하면 유효한 작업으로 간주하고, 삼중항의 일부를 해결자에게 제시하기 위한 적절한 절차를 적용합니다.

“Specifically, we set x = (p, i) for deduction;” (연역: 프로그램과 입력 제공)
“x = (p, o) for abduction;” (귀추: 프로그램과 출력 제공)
“and x = ({(i_n, o_n)}_(N//2, n=1), m) for induction, where half of the test cases and a program description m is used.” (귀납: 테스트 케이스의 절반과 프로그램 설명 제공)

“We use all valid tasks from timestep t; if the batch B is not full, we uniformly sample from previously validated tasks to fill the batch.” 시간 t에서 모든 유효한 작업을 사용합니다. 만약 배치 B가 꽉 차지 않으면, 이전에 검증된 작업에서 균일하게 샘플링하여 배치를 채웁니다.

3.3.4. Answer Verification (답변 검증)

“For abduction task, we receive i_π from the solver policy, then we equivalence match using p(i_π) = p(i⋆), where ∗ refers to the privileged gold information. The reason we do not just match i_π and i⋆ is because p is not necessarily bijective.” 귀추 작업의 경우, 해결자 정책으로부터 i_π를 받고, p(i_π) = p(i⋆)를 사용하여 동등성 일치를 확인합니다. (i⋆는 실제 정답 입력) i_π와 i⋆를 직접 비교하지 않는 이유는 프로그램 p가 반드시 일대일 대응이 아니기 때문입니다. 다른 입력이라도 같은 출력을 내면 정답으로 인정하는 거죠.

“For deduction task, we match o_π = o⋆.” 연역 작업의 경우, 모델의 출력 o_π와 실제 정답 출력 o⋆를 비교합니다.

“For induction, we match all ({p_π(i⋆_n) = o⋆_n}_N).” 귀납의 경우, 생성된 프로그램 p_π가 모든 N개의 숨겨진 테스트 케이스 (i⋆_n, o⋆_n)에 대해 올바른 출력을 내는지 확인합니다.

“This part might be convoluted to explain in language, therefore we recommend the reader to see how we did abduction, deduction and induction verification in code in Figures 10 to 12, respectively.” 이 부분은 말로 설명하기 복잡할 수 있으므로, Figure 10-12에서 코드로 귀추, 연역, 귀납 검증을 어떻게 수행했는지 보라고 권장하네요.

3.3.5. Task-Relative REINFORCE++ (작업 상대적 REINFORCE++)

“Since AZR trains the combination of roles and task types, it operates in a multitask reinforcement learning setup…” AZR은 역할과 작업 유형의 조합을 훈련하므로 다중 작업 강화학습 설정에서 작동합니다.

“Instead of computing a single global baseline as in REINFORCE++ (Hu, 2025) (Appendix A), we compute separate baselines for each of the six task-role configurations.” REINFORCE++ (Hu, 2025, 부록 A)에서처럼 단일 전역 기준선을 계산하는 대신, 6개의 작업-역할 구성 각각에 대해 “별도의 기준선(baseline)”을 계산합니다. (3가지 작업 유형 x 2가지 역할 = 6개)

“This can be viewed as an interpolation between per-question baselines, as in GRPO (Shao et al., 2024), and a global baseline, allowing for more structured variance reduction tailored to each task setup.” 이는 GRPO(Shao et al., 2024)에서처럼 질문별 기준선과 전역 기준선 사이의 보간으로 볼 수 있으며, 각 작업 설정에 맞춰진 보다 구조화된 분산 감소를 가능하게 합니다. 즉, 각 상황에 맞게 기준선을 다르게 설정해서 학습을 더 안정적으로 만든다는 겁니다.

“We refer to this variant as Task-Relative REINFORCE++ (TRR++).” 이 변형을 TRR++라고 부릅니다.

“The normalized advantage A_norm is computed as: A_norm^(task,role) = (r – µ^(task,role)) / σ^(task,role), task ∈ {ind, ded, abd}, role ∈ {propose, solve}. (8)” 정규화된 어드밴티지 A_norm은 수식 (8)과 같이 계산됩니다. 보상 r에서 해당 (작업, 역할)의 평균 보상 µ를 빼고, 표준편차 σ로 나눈 값입니다.

“where the mean and standard deviation are computed within each task type and role, yielding six baselines.” 평균과 표준편차는 각 작업 유형과 역할 내에서 계산되어 6개의 기준선을 생성합니다.

6강

5강까지 우리는 AbsoluteZero 패러다임과 그 구현체인 AZR의 내부 작동 방식을 속속들이 파헤쳐 봤습니다. 이제 가장 흥미진진한 시간이죠! 그래서, 이 AZR이라는 녀석이 실제로 얼마나 잘하는지, “Experiments (실험)” 섹션을 통해 확인해 보겠습니다.

4. Experiments

4.1. Experiment Setup (실험 설정)

“Training Details. For all experiments, we initialize the buffers as described in Section 3.1. AZR models are trained using a batch size of 64 × 6 (2 roles × 3 task types).” 훈련 세부 정보입니다. 모든 실험에서 섹션 3.1에서 설명한 대로 버퍼를 초기화합니다. AZR 모델은 64 × 6 (2가지 역할 × 3가지 작업 유형)의 배치 크기를 사용하여 훈련됩니다. 즉, 한 번에 64개의 문제 묶음을 6가지 경우에 대해 처리하는 거죠. 총 384개의 인스턴스를 한 번에 학습합니다.

“We use constant learning rate = 1e-6 and the AdamW optimizer (Loshchilov & Hutter, 2019).” 학습률(learning rate)은 1e-6 (백만 분의 일)로 고정하고 AdamW 옵티마이저를 사용했다고 합니다. (Loshchilov & Hutter, 2019) 비교적 작은 학습률을 사용했네요.

“Complete list of hyperparameters is provided in Table 3.” 전체 하이퍼파라미터 목록은 Table 3에 제공되어 있다고 합니다. (논문에서는 Table 3이 Appendix B에 있습니다.)

“For the main experiments, we train AZR models on Qwen2.5-7B and Qwen2.5-7B-Coder, resulting in AbsoluteZeroReasoner-base-7B and AbsoluteZeroReasoner-Coder-7B, respectively.” 주요 실험에서는 Qwen2.5-7B 모델과 Qwen2.5-7B-Coder 모델에 AZR을 훈련시켰습니다. 그 결과 각각 AbsoluteZeroReasoner-base-7B (AZR-Base-7B)와 AbsoluteZeroReasoner-Coder-7B (AZR-Coder-7B)가 탄생했습니다. Qwen이라는 중국 LLM을 기반으로 했군요. “Coder” 버전은 코딩 능력에 특화된 모델이겠죠.

“Additional experiments include training Qwen2.5-Coder-3B, Qwen2.5-Coder-14B, Qwen2.5-14B, Llama-3.1-8B (Yang et al., 2024a; Hui et al., 2024; Dubey et al., 2024).” 추가 실험으로는 Qwen2.5-Coder 모델의 30억(3B), 140억(14B) 파라미터 버전, 그리고 Qwen2.5-14B 일반 버전, 그리고 요즘 핫한 Llama-3.1-8B 모델도 훈련시켰다고 합니다. 다양한 크기와 종류의 모델에 적용해봤다는 거네요.

“Evaluation Protocol. To evaluate our models, we divide the datasets into in-distribution (ID) and out-of-distribution (OOD) categories.” 평가 프로토콜입니다. 모델을 평가하기 위해 데이터셋을 “분포 내(ID)”와 “분포 외(OOD)” 범주로 나눕니다. ID는 학습시킨 작업과 유사한 작업, OOD는 학습시키지 않은 새로운 유형의 작업이라고 생각하면 됩니다.

“For OOD benchmarks, which we emphasize more, we further categorize them into coding and mathematical reasoning benchmarks.” 특히 OOD 벤치마크를 더 강조하는데, 이를 다시 코딩 추론 벤치마크와 수학 추론 벤치마크로 분류합니다. 즉, AZR이 코딩 작업을 통해 학습했지만, 전혀 다른 영역인 수학 문제도 잘 푸는지 보겠다는 거죠. 이게 진짜 일반화 성능을 보여주는 겁니다.

“For coding tasks, we evaluate using Evalplus (Liu et al., 2023) on the HumanEval+ and MBPP+ benchmarks, as well as LiveCodeBench Generation (v1-5, May23-Feb25) (Jain et al., 2024).” 코딩 작업 평가는 Evalplus를 사용하여 HumanEval+ 와 MBPP+ 벤치마크, 그리고 LiveCodeBench Generation 벤치마크를 사용합니다. 다들 코딩 능력 평가에 널리 쓰이는 표준 벤치마크들입니다.

“For mathematical reasoning, we utilize six standard benchmarks commonly used in recent zero-shot trained reasoners: AIME’24, AIME’25, OlympiadBench (He et al., 2024), Minerva, Math500 (Hendrycks et al., 2021), and AMC’23.” 수학 추론 평가는 최근 제로샷 훈련 추론기에서 일반적으로 사용되는 6개의 표준 벤치마크를 활용합니다: AIME (미국 수학 경시대회), OlympiadBench (수학 올림피아드 수준), Minerva, Math500, AMC (미국 수학경시대회). 정말 어려운 수학 문제들이죠.

“For ID benchmarks, we use CruxEval-I(nput), CruxEval-O(utput), and LiveCodeBench-Execution (Gu et al., 2024; Jain et al., 2024), which assess reasoning capabilities regarding the input and output of programs (Li et al., 2025).” 분포 내(ID) 벤치마크로는 CruxEval-I (입력 추론), CruxEval-O (출력 추론), 그리고 LiveCodeBench-Execution을 사용합니다. 이는 프로그램의 입력과 출력에 관한 추론 능력을 평가합니다. AZR이 학습한 연역, 귀추 작업과 직접적으로 관련된 평가겠죠.

“Greedy decoding is used for all baseline methods and AZR results to ensure reproducibility.” 모든 비교 모델(baseline)과 AZR 결과에 대해 재현성을 보장하기 위해 그리디 디코딩(greedy decoding)을 사용합니다. 그리디 디코딩은 가장 확률 높은 토큰을 계속 선택하는 단순하고 결정론적인 방식입니다.

“Baselines. For our main results, we use Qwen2.5-7B as the base model, along with its specialized base model variants: Qwen2.5-7B-Coder, Qwen2.5-7B-Instruct, and Qwen2.5-Math-7B (Yang et al., 2024a; Hui et al., 2024; Yang et al., 2024b).” 비교 모델(Baselines)입니다. 주요 결과에서는 Qwen2.5-7B를 기본 모델로 사용하고, 이의 특화된 변형 모델들인 Qwen2.5-7B-Coder (코딩 특화), Qwen2.5-7B-Instruct (지시 따르기 특화), Qwen2.5-Math-7B (수학 특화)도 비교합니다.

“Furthermore, the zero-style models are usually trained specifically on either code or math data; and only Eurus-2-7B-PRIME-Zero (Cui et al., 2025) was trained jointly on both domains.” 더 나아가, “제로 스타일” 모델들은 보통 코딩 데이터나 수학 데이터 중 하나에 특화되어 훈련되는데, Eurus-2-7B-PRIME-Zero 모델만이 두 도메인 모두에서 공동으로 훈련되었다고 합니다.

“For code data models, we present four variants of the AceCoder (Zeng et al., 2025a) and two different CodeR1 models (Liu & Zhang, 2025).” 코딩 데이터로 학습한 모델로는 AceCoder의 4가지 변형과 CodeR1의 2가지 다른 모델을 제시합니다.

“For math data models, we have Qwen2.5-Math-7B-Oat-Zero (Liu et al., 2025), Open-Reasoner-Zero-7B (ORZ) (Hu et al., 2025), Qwen-2.5-7B-SimpleRL-Zoo (Zeng et al., 2025b).” 수학 데이터로 학습한 모델로는 Qwen2.5-Math-7B-Oat-Zero, Open-Reasoner-Zero-7B (ORZ), Qwen-2.5-7B-SimpleRL-Zoo가 있습니다. 다들 최근에 나온 제로샷 추론 모델들입니다.

“All baseline models’ training data and initializations settings are summarized in Table 4.” 모든 비교 모델의 훈련 데이터 및 초기화 설정은 Table 4에 요약되어 있다고 합니다. (논문에서는 Table 4가 Appendix B에 있습니다.)

“For follow-up scaling experiments, we compare each AZR model against its own corresponding base model, due to the lack of established baselines across different parameter scales.” 후속 확장성 실험에서는, 다른 파라미터 스케일에 대한 확립된 비교 모델이 없기 때문에 각 AZR 모델을 해당 기본 모델과 비교합니다. 즉, 3B 모델은 3B 기본 모델과, 14B 모델은 14B 기본 모델과 비교하는 거죠.

“Finally, we compare our Llama3.1-8B-trained model with Llama-3.1-8B-SimpleRL-Zoo (Zeng et al., 2025b) and the base model.” 마지막으로, Llama3.1-8B로 훈련된 AZR 모델을 Llama-3.1-8B-SimpleRL-Zoo 모델 및 기본 Llama 모델과 비교합니다.

4.2. Results (결과)

자, 이제 드디어 대망의 결과입니다! Table 1을 중심으로 살펴보겠습니다.

“Research Question 1: How does AZR compare to other zero setting models trained with human expert data?” 연구 질문 1: AZR은 인간 전문가 데이터로 훈련된 다른 제로 세팅 모델들과 비교하여 어떤가요?

“We present the main results of reasoning models trained under both the standard zero and our proposed absolute zero settings in Table 1.” 표준 제로 세팅과 이 논문에서 제안하는 앱솔루트 제로 세팅에서 훈련된 추론 모델들의 주요 결과를 Table 1에 제시합니다.

(Table 1. Performance of RL-Trained Reasoner on Reasoning Benchmarks Based on Qwen2.5-7B Models. 설명)

Table 1을 봅시다. 엄청나게 많은 숫자와 모델 이름들이 있죠? 정신 바짝 차리세요! 이 표는 Qwen2.5-7B 모델 기반의 강화학습 훈련 추론기들의 성능을 보여줍니다. 코딩 벤치마크 3개(HumanEval+, MBPP+, LCB v1-5)와 수학 벤치마크 6개(AIME’24, AIME’25, AMC’23, MATH500, Minerva, OlympiadBench)에 대한 성능이 나와 있습니다. 그리고 각 도메인별 평균(CAvg: 코딩 평균, MAvg: 수학 평균)과 전체 평균(AVG)도 있습니다. “+” 표시는 기본 모델 대비 절대적인 성능 향상치를 의미합니다.

가장 아래 두 줄, “AbsoluteZero Training w/ No Curated Data (Ours)” 부분을 주목하세요!

AZR (Ours) Base: Qwen2.5-7B (일반) 모델을 AZR로 훈련시킨 것. “Data” 항목이 “0”입니다! 외부 데이터를 전혀 안 썼다는 거죠.
AZR (Ours) Coder: Qwen2.5-7B-Coder (코딩 특화) 모델을 AZR로 훈련시킨 것. 이것도 “Data”가 “0”입니다.

“Notably, AbsoluteZeroReasoner-Coder-7B achieves state-of-the-art performance in both the 7B overall average and the coding average categories.” 주목할 점은, AZR-Coder-7B가 70억 파라미터 모델 중에서 “전체 평균(AVG)”과 “코딩 평균(CAvg)” 모두에서 SOTA(state-of-the-art, 최고 수준) 성능을 달성했다는 겁니다! 데이터를 전혀 안 썼는데도 말이죠!

“Despite being entirely out-of-distribution for both math and code reasoning benchmarks, it surpasses the previous best model by 1.8 absolute percentages.” 수학 및 코드 추론 벤치마크 모두에 대해 완전히 분포 외(OOD)임에도 불구하고, 이전 최고 모델보다 1.8 절대 퍼센트 포인트만큼 능가했습니다. 정말 놀라운 결과죠.

“Even more strikingly, it outperforms models trained with expert-curated human data in the coding category by 0.3 absolute percentages, while never having access to such data itself.” 더욱 놀라운 것은, 코딩 분야에서는 전문가가 만든 인간 데이터를 사용하여 훈련된 모델들보다 0.3 절대 퍼센트 포인트만큼 더 높은 성능을 보였다는 겁니다. 그런 데이터를 전혀 사용하지 않았는데도요! 스스로 문제를 만들고 풀면서 학습한 결과가 인간이 만든 데이터를 사용한 모델보다 더 좋다는 건 정말 충격적입니다.

“Strong Cross-domain Generalization. To assess cross-domain generalization after RLVR, we evaluate math performance before and after training, comparing AZR models with other expert code models, since AZR was trained in coding environments.” 강력한 교차 도메인 일반화 성능입니다. RLVR 후 교차 도메인 일반화 성능을 평가하기 위해, AZR 모델이 코딩 환경에서 훈련되었기 때문에, 훈련 전후의 수학 성능을 다른 전문가 코드 모델들과 비교했습니다.

“After training, most expert code models showed minimal changes or even declines in performance compared to their base versions, with an average increase of only 0.65 points across these models, indicating very limited cross-domain generalization.” 훈련 후, 대부분의 전문가 코드 모델들은 기본 버전에 비해 성능 변화가 미미하거나 심지어 감소했으며, 이들 모델 전체에서 평균 0.65 포인트 증가에 그쳐 매우 제한적인 교차 도메인 일반화 성능을 보였습니다. 즉, 코딩 데이터로 학습한 모델들은 수학은 별로 잘 못하게 된다는 거죠.

“In contrast, AZR base and coder models achieved gains of 10.9 and 15.2 percentage points, respectively, demonstrating substantially stronger generalized reasoning improvements.” 대조적으로, AZR 기본 모델과 코더 모델은 각각 10.9 및 15.2 퍼센트 포인트의 성능 향상을 달성하여 훨씬 강력한 일반화된 추론 능력 향상을 보여주었습니다! 코딩 작업만으로 학습했는데 수학 능력까지 엄청나게 향상된 겁니다. 이게 바로 AZR의 무서운 점이죠.

“Similarly, although also out-of-distribution on human-defined code generation tasks, our AZR models improved by 3.2 and 5.0 points, while the math models on average showed just a moderate increase in coding (+2.0 on average).” 유사하게, 인간이 정의한 코드 생성 작업에 대해서도 분포 외임에도 불구하고, AZR 모델들은 3.2 및 5.0 포인트 향상되었습니다. 반면, 수학 모델들은 평균적으로 코딩에서 약간의 증가(+2.0 평균)만을 보였습니다.

“Overall, these results highlight the surprising effectiveness of our approach. Unlike other RLVR models trained and evaluated on human-defined tasks, our AZR models demonstrate strong general reasoning capabilities without any direct training on downstream human-defined math or coding data, only had access to self-proposed tasks during training.” 전반적으로, 이러한 결과는 이 접근 방식의 놀라운 효과를 강조합니다. 인간이 정의한 작업으로 훈련되고 평가되는 다른 RLVR 모델들과 달리, AZR 모델은 하위 작업인 인간 정의 수학 또는 코딩 데이터에 대한 직접적인 훈련 없이도 강력한 일반 추론 능력을 보여주며, 훈련 중에는 오직 스스로 제안한 작업에만 접근했습니다. 정말 대단하죠!

“Research Question 2: How do initializing from different base model variants (base vs. coder) affect performance?” 연구 질문 2: 서로 다른 기본 모델 변형(일반 vs. 코더)으로 초기화하는 것이 성능에 어떤 영향을 미칩니까?

“As shown in Table 1, the coder variant achieved better overall performance in both math and coding after the AZR self-play process.” Table 1에서 볼 수 있듯이, 코더 변형(AZR-Coder-7B)이 AZR 셀프 플레이 과정 후 수학과 코딩 모두에서 더 나은 전체 성능을 달성했습니다.

“Strikingly, although the coder base model variant started with a lower average performance in math than the vanilla base model (23.9 vs. 27.5), it ultimately outperformed it after AZR training.” 놀랍게도, 코더 기본 모델 변형은 처음에는 일반 기본 모델보다 수학 평균 성능이 낮았지만(23.9 대 27.5), AZR 훈련 후에는 결국 이를 능가했습니다. 즉, 코딩 능력이 뛰어난 모델을 AZR로 학습시켰더니 수학 능력까지 더 좋아졌다는 겁니다.

“This highlights the importance of initial code competency as a catalyst for enhancing broader reasoning abilities within the AbsoluteZero Reasoner approach.” 이는 AbsoluteZero Reasoner 접근 방식 내에서 더 넓은 추론 능력을 향상시키기 위한 촉매로서 초기 코드 역량의 중요성을 강조합니다. 시작부터 코딩을 잘하는 모델이 AZR 훈련을 통해 더 크게 성장할 수 있다는 거죠.

7강

지난 6강에서는 AZR의 놀라운 성능, 특히 데이터 없이도 SOTA를 달성하고 강력한 교차 도메인 일반화 능력을 보여준다는 것을 확인했습니다. 오늘은 실험 섹션의 나머지 부분들을 통해 AZR의 더 다양한 측면을 살펴보겠습니다. 모델 크기를 바꾸면 어떻게 되는지, 다른 종류의 모델에도 잘 통하는지, 학습 과정에서 어떤 재미있는 행동들이 나타나는지, 그리고 어떤 요소들이 AZR의 성능에 중요한지(Ablation Study) 등을 Figure 6와 Table 2를 중심으로 샅샅이 파헤쳐 봅시다!

4.2. Results (Continued)

“Research Question 3: How does varying model size effect AZR’s in-distribution and out-of-distribution capabilities?” 연구 질문 3: 모델 크기를 변경하는 것이 AZR의 분포 내(ID) 및 분포 외(OOD) 능력에 어떤 영향을 미칩니까?

“We examine the effects of scaling model size and present both in-distribution and out-of-distribution results in Figure 6(a) and (b), respectively.” 모델 크기 확장 효과를 조사하고 분포 내 및 분포 외 결과를 각각 Figure 6(a)와 (b)에 제시합니다.

(Figure 6. (a) In-Distribution & (b) Out-of-Distribution Reasoning Task Performances. 설명)

Figure 6을 봅시다. (a)는 분포 내(ID) 성능, (b)는 분포 외(OOD) 성능입니다. (a) ID 성능은 CruxEval-I, CruxEval-O, LiveCodeBench-Execution 점수로 측정되었고, 이는 각각 귀추, 연역, 연역 작업 유형에 해당합니다. 모델 크기(3B, 7B, 14B Coder 모델 및 Llama 3.1-8B)와 훈련 스텝에 따른 성능 변화를 보여줍니다. 전반적으로 훈련이 진행됨에 따라 ID 성능이 향상되는 것을 볼 수 있습니다. 특히 더 큰 모델(7B, 14B)이 작은 모델(3B)보다 더 꾸준히 성능이 향상되는 경향을 보입니다.

(b) OOD 성능은 코딩 작업 평균, 수학 작업 평균, 그리고 이 둘의 전체 평균으로 보고됩니다.

Llama3.1-8b 모델의 경우, AZR 적용 시 코딩, 수학, 전체 평균 모두에서 기본 모델 및 SimpleRL 적용 모델보다 약간 낮은 성능 향상을 보이지만, 여전히 향상은 있습니다.
Qwen2.5-Coder 모델의 경우, 3B, 7B, 14B로 크기가 커질수록 AZR 적용으로 인한 성능 향상 폭(TotalAvg 옆의 + 값)이 점점 커지는 것을 명확히 볼 수 있습니다! (+5.7 → +10.2 → +13.2)

“Given the strong performance of coder models in the 7B category, we extend the analysis by evaluating smaller and larger variants: Qwen2.5-3B-Coder and Qwen2.5-14B-Coder.” 7B 코더 모델의 강력한 성능을 바탕으로, 더 작고 큰 변형 모델인 Qwen2.5-3B-Coder와 Qwen2.5-14B-Coder를 평가하여 분석을 확장했습니다.

“Due to the absence of existing baselines for these zero-style reasoner models, we compare each model’s performance to its corresponding base coder model.” 이러한 제로 스타일 추론 모델에 대한 기존 비교 기준이 없기 때문에, 각 모델의 성능을 해당 기본 코더 모델과 비교합니다.

“The results reveal a clear trend: our method delivers greater gains on larger, more capable models.” 결과는 명확한 경향을 보여줍니다: AZR 방법은 더 크고 유능한 모델에서 더 큰 이득을 제공합니다.

“In the in-distribution setting, the 7B and 14B models continue to improve beyond 200 training steps, whereas the smaller 3B model appears to plateau.” 분포 내 설정에서, 7B 및 14B 모델은 200 훈련 단계 이후에도 계속 향상되는 반면, 더 작은 3B 모델은 정체되는 것처럼 보입니다.

“For out-of-distribution domains, larger models also show greater overall performance improvements than smaller ones: +5.7, +10.2, +13.2 overall performance gains, respectively for 3B, 7B and 14B.” 분포 외 도메인에서도 더 큰 모델이 작은 모델보다 더 큰 전체 성능 향상을 보여줍니다: 3B, 7B, 14B에 대해 각각 +5.7, +10.2, +13.2의 전체 성능 이득이 있었습니다.

“This is an encouraging sign, since base models continue to improve and also suggesting that scaling enhances the effectiveness of AZR.” 이는 고무적인 신호입니다. 기본 모델이 계속 향상되고 있으며, 또한 모델 크기 확장이 AZR의 효과를 향상시킨다는 것을 시사하기 때문입니다. 모델이 클수록 AZR 빨을 잘 받는다는 거죠!

“In future work, we aim to investigate the scaling laws that govern performance in the AbsoluteZero paradigm.” 향후 연구에서는 AbsoluteZero 패러다임의 성능을 지배하는 스케일링 법칙(scaling laws)을 조사할 계획이라고 합니다.

“Research Question 4: Any interesting observations by changing the model class?” 연구 질문 4: 모델 클래스를 변경함으로써 흥미로운 관찰 결과가 있었습니까?

“We also evaluate our method on a different model class, using Llama3.1-8B as the base shown in Figure 6.” Figure 6에 표시된 것처럼 Llama3.1-8B를 기본 모델로 사용하여 다른 모델 클래스에서도 이 방법을 평가했습니다.

“Unlike the 3B and 14B categories, this setting has an existing baseline, SimpleRL (Zeng et al., 2025b), which enables a direct comparison.” 3B 및 14B 범주와 달리, 이 설정에는 기존 비교 기준인 SimpleRL이 있어 직접 비교가 가능합니다.

“Although Llama3.1-8B is less capable than the Qwen2.5 models, our method still produces moderate improvements (+3.2), demonstrating AZR’s effectiveness even on relatively weaker models.” Llama3.1-8B는 Qwen2.5 모델보다 성능이 낮지만, AZR 방법은 여전히 중간 정도의 향상(+3.2)을 만들어내어 상대적으로 약한 모델에서도 AZR의 효과를 보여줍니다.

“However, these gains appear more limited, which aligns with our earlier observation that performance improvements tend to scale with initial base model potency.” 그러나 이러한 이득은 더 제한적인 것으로 보이며, 이는 성능 향상이 초기 기본 모델의 잠재력에 따라 확장되는 경향이 있다는 이전 관찰과 일치합니다. 즉, 원래 좀 똑똑한 모델에 AZR을 적용해야 효과가 더 크다는 거죠.

“Research Question 5: Any interesting behaviors or patterns observed during AZR training?” 연구 질문 5: AZR 훈련 중 관찰된 흥미로운 행동이나 패턴이 있었습니까?

“We observed interesting response patterns in both the proposal and solution stages.” 제안 단계와 해결 단계 모두에서 흥미로운 응답 패턴을 관찰했습니다.

“The model is capable of proposing diverse programs, such as string manipulation tasks, dynamic programming problems, and practical cases (e.g., calculating a triangle’s area using Heron’s formula).” 모델은 문자열 조작 작업, 동적 프로그래밍 문제, 그리고 실제 사례(예: 헤론의 공식을 사용한 삼각형 넓이 계산)와 같은 다양한 프로그램을 제안할 수 있었습니다. 스스로 꽤 복잡하고 유용한 문제들을 만들어낸다는 거죠.

“We show a concrete example in Figure 7, where AZR proposes a code problem that searches for the sum of continuous sub-arrays matching a target value and solves it through trial-and-error.” Figure 7 (본문 그림)에서는 AZR이 목표 값과 일치하는 연속된 하위 배열의 합계를 찾는 코드 문제를 제안하고 시행착오를 통해 해결하는 구체적인 예를 보여줍니다.

“Overall, the models trained exhibit distinct reasoning patterns depending on the task type.” 전반적으로 훈련된 모델은 작업 유형에 따라 뚜렷한 추론 패턴을 보입니다.

“For example, when solving abduction tasks, it repeatedly tests different input patterns, self-correcting until the reasoned output matches the given input.” (귀추: 다양한 입력 패턴을 반복적으로 테스트하고, 추론된 출력이 주어진 입력과 일치할 때까지 자가 수정)
“When predicting outputs, it steps through the code and records structured intermediate results (such as dynamic programming arrays) until the final output is reached.” (연역: 코드를 단계별로 실행하고 구조화된 중간 결과(예: DP 배열)를 기록하며 최종 출력에 도달)
“When inducting programs from given inputs, outputs, and descriptions, the model systematically checks each test case to confirm that its program produces correct results.” (귀납: 주어진 입출력 및 설명에서 프로그램을 생성할 때, 프로그램이 올바른 결과를 생성하는지 각 테스트 케이스를 체계적으로 확인)

“Intermediate Planning During Code Response. Another interesting pattern emerged in our AZR models during the code induction task: the final code outputs were often interleaved with comments that resembled immediate step-by-step plans, reminiscent of the ReAct prompting framework (Yao et al., 2023).” 코드 응답 중 중간 계획 수립: 코드 귀납 작업 중 AZR 모델에서 또 다른 흥미로운 패턴이 나타났습니다. 최종 코드 출력에는 종종 ReAct 프롬프팅 프레임워크를 연상시키는 즉각적인 단계별 계획과 유사한 주석이 산재해 있었습니다. 코드 안에 생각의 흐름이나 계획을 주석으로 남긴다는 거죠. (Figure 19, Appendix C.3 참조)

“Cognitive Behavior in Llama. Interestingly, we also observed some emergent cognitive patterns in AbsoluteZeroReasoner-Llama3.1-8B, similar to those reported by Zeng et al. (2025b), and we include one example in Figure 26, where clear state-tracking behavior is demonstrated.” Llama 모델에서의 인지적 행동: 흥미롭게도, Llama3.1-8B로 학습한 AZR에서도 Zeng 등이 보고한 것과 유사한 몇 가지 창발적인 인지 패턴을 관찰했으며, 명확한 상태 추적 행동이 나타나는 예를 Figure 26 (부록 그림)에 포함했습니다.

“TokenLengthIncreaseDependsonTaskType. Finally, we observed that token length increases over the course of training… The most significant increase occurs in the abduction task, where the model engages in trial-and-error reasoning…” 토큰 길이 증가가 작업 유형에 따라 다름: 마지막으로, 훈련 과정에서 토큰 길이가 증가하는 것을 관찰했는데, 이는 최근 연구 결과와 일치합니다. 흥미롭게도, 귀추 작업에서 가장 큰 토큰 길이 증가가 발생했는데, 이는 모델이 프로그램의 출력과 일치시키기 위해 입력을 반복적으로 테스트하는 시행착오 추론에 참여하기 때문입니다. (Figures 15-17, Appendix C.3 참조) 즉, 어려운 고민을 할수록 말이 길어진다는 거죠.

“Research Question 6: Are all task types essential for good performance (Ablation)?” 연구 질문 6: 좋은 성능을 위해 모든 작업 유형이 필수적입니까 (제거 연구, Ablation Study)?

“Due to resource constraints, we perform the ablation studies in this section and thenext using only AbsoluteZeroReasoner-Base-7B.” 리소스 제약으로 인해, 이 섹션과 다음 섹션의 제거 연구는 AZR-Base-7B 모델만 사용하여 수행합니다.

“We begin by testing the importance of task types during training, with results shown in Table 2.” 훈련 중 작업 유형의 중요성을 테스트하며, 그 결과는 Table 2에 나와 있습니다.

(Table 2. Ablation Results. 설명)

Table 2를 봅시다. AZR-Base-7B 모델을 사용한 제거 연구 결과입니다.

Row 1 (Deduction only): 귀납(Induction)과 귀추(Abduction) 작업을 제거하고 연역(Deduction) 작업만 사용한 경우. 수학 평균(MathAvg)이 32.0으로 크게 떨어집니다. (기본 AZR은 38.4)
Row 2 (w/o Induction): 귀납 작업만 제거한 경우. 수학 평균이 33.3으로 역시 감소합니다.
결론: “These findings highlight the complementary role of the three task types in improving general reasoning capability, with each contributing in a distinct and essential way.” (이러한 발견은 세 가지 작업 유형이 일반 추론 능력 향상에 있어 상호 보완적인 역할을 하며, 각각이 독특하고 필수적인 방식으로 기여함을 강조합니다.) 즉, 세 가지 추론 모드(연역, 귀납, 귀추) 모두 중요하다는 겁니다!

“Research Question 7: How much do the designs of proposer contribute to the overall performance (Ablation)?” 연구 질문 7: 제안자(proposer)의 설계가 전체 성능에 얼마나 기여합니까 (제거 연구)?

“Next, we ablate two components of the proposer role and present the results in Table 2.” 다음으로, 제안자 역할의 두 가지 구성 요소를 제거하고 그 결과를 Table 2에 제시합니다.

Row 3 (w/o GenReference /0/): 과거 참조 삼중항(K historical triplets)에 동적으로 조건화하는 대신 고정된 프롬프트를 사용하여 귀추 및 연역 작업을 제안한 경우. 수학 성능이 5포인트, 코딩 성능이 1포인트 하락합니다. “This suggests that dynamically conditioning on reference programs helps improve performance, possibly by increasing diversity and achieving better coverage of the reasoning problem space.” (이는 참조 프로그램에 동적으로 조건화하는 것이 다양성을 높이고 추론 문제 공간의 더 나은 범위를 달성함으로써 성능 향상에 도움이 됨을 시사합니다.) 즉, 과거 예시를 참고해서 새롭고 다양한 문제를 만드는 것이 중요하다는 거죠.
Row 4 (Train SolverOnly //SolveOnly): 제안자를 전혀 훈련시키지 않고, 현재 학습자를 사용하여 프롬프트만 하고 해결자만 훈련한 경우. 전체 성능이 약간 하락(-1.4)합니다. “suggesting that while proposer training is beneficial, it may not be the most critical factor for now in the AZR framework.” (이는 제안자 훈련이 유익하지만, 현재 AZR 프레임워크에서 가장 중요한 요소는 아닐 수 있음을 시사합니다.) 제안자 훈련도 도움이 되긴 하지만, 다른 요소만큼 결정적이진 않다는 의미로 해석될 수 있겠네요.

“Overall, all components are essential for general reasoning.” 전반적으로, 모든 구성 요소가 일반 추론 능력에 필수적입니다.

8강

이 논문이 학문적으로 어떤 위치에 있는지 “Related Work (관련 연구)”를 통해 살펴보고, 저자들이 이 연구를 통해 무엇을 말하고 싶었는지, 그리고 앞으로 어떤 방향으로 나아가야 할지 “Conclusion and Discussion (결론 및 논의)”을 통해 명쾌하게 정리해 드리겠습니다.

5. Related Work (관련 연구)

이 섹션에서는 AbsoluteZero 연구와 관련된 기존 연구들을 크게 세 가지 갈래로 나누어 설명합니다.

Reasoning with RL (강화학습을 이용한 추론) “Using RL to enhance reasoning capabilities has recently emerged as an important step in the post-training process of strong reasoning-focused large language models (Lambert et al., 2024).” 강력한 추론 중심 LLM의 후속 훈련 과정에서 RL을 사용하여 추론 능력을 향상시키는 것이 최근 중요한 단계로 부상했습니다. “One of the first works to explore a self-bootstrapping approach to improving LLM reasoning is STaR, which employs expert iteration and rejection sampling of outcome-verified responses to iteratively improve the model’s CoT.” LLM 추론 개선을 위한 자가 부트스트래핑 접근법을 탐구한 초기 연구 중 하나는 STaR입니다. 이는 전문가 반복 및 결과 검증된 응답의 거부 샘플링을 사용하여 모델의 CoT(사고 과정)를 반복적으로 개선합니다. “A monumental work, o1 (Jaech et al., 2024), was among the first to deploy this idea on a scale, achieving state-of-the-art results in reasoning tasks at the time of release.” 기념비적인 연구인 o1 (OpenAI의 모델)은 이 아이디어를 대규모로 적용하여 출시 당시 추론 작업에서 SOTA 결과를 달성한 최초의 연구 중 하나였습니다. “More recently, the R1 model (DeepSeek-AI et al., 2025) became the first open-weight model to match or even surpass the performance of o1.” 더 최근에는 R1 모델 (DeepSeek-AI)이 o1의 성능과 비슷하거나 능가하는 최초의 오픈 웨이트 모델이 되었습니다. “Most notably, the zero setting was introduced, in which reinforcement learning is applied directly on top of the base LLM.” 가장 주목할 만한 점은 “제로 세팅”이 도입된 것인데, 이는 기본 LLM 위에 직접 강화학습을 적용하는 것입니다. “This inspired followup work, which are open source attempts to replicate the R1 process or to improve the underlying reinforcement learning algorithm…” 이는 R1 프로세스를 복제하거나 기본 강화학습 알고리즘을 개선하려는 오픈 소스 시도와 같은 후속 연구에 영감을 주었습니다. “We extend the zero setting to a new absolute zero setting, where not only is the RLVR process initialized from a base LLM without SFT, but no external prompt data or answers are provided to the learner. All data used to improve reasoning were self-proposed, and refined entirely through RLVR.” 이 논문은 제로 세팅을 새로운 “앱솔루트 제로 세팅“으로 확장합니다. 여기서는 RLVR 프로세스가 SFT 없이 기본 LLM에서 초기화될 뿐만 아니라, 학습자에게 어떠한 외부 프롬프트 데이터나 답변도 제공되지 않습니다. 추론을 개선하는 데 사용된 모든 데이터는 스스로 제안되었으며 전적으로 RLVR을 통해 정제되었습니다. 이것이 바로 이 논문의 핵심 차별점이죠! “Moreover, our goal is not to only match zero-setting models, but to surpass them in the long run.” 더욱이, 이들의 목표는 단순히 제로 세팅 모델과 동등해지는 것이 아니라 장기적으로 그들을 능가하는 것입니다. 야심 찬 목표죠!

Self-play (셀프 플레이) “The self-play paradigm can be traced back to early 2000s, where Schmidhuber (2003; 2011) (of course) explored a two-agent setup in which a proposal agent invents questions for a prediction agent to answer.” 셀프 플레이 패러다임은 2000년대 초반으로 거슬러 올라가는데, 슈미트후버 (물론 그분입니다!)가 제안 에이전트가 예측 에이전트가 답할 질문을 만들어내는 2-에이전트 설정을 탐구했습니다. “AlphaGo and AlphaZero (Silver et al., 2016; 2017) extend the self-play paradigm to the two-player zero-sum game of Go…” 알파고와 알파제로는 셀프 플레이 패러다임을 바둑과 같은 2인 제로섬 게임으로 확장하여 초인적인 성능을 보여주었죠. “Moreover, methods such as asymmetric self-play, unsupervised environment design, unsupervised reinforcement learning, and automatic goal generation all center around inventing new tasks for an agent to learn from—typically without supervision.” 비대칭 셀프 플레이, 비지도 환경 설계, 비지도 강화학습, 자동 목표 생성과 같은 방법들도 모두 에이전트가 학습할 새로운 작업을 (일반적으로 감독 없이) 만들어내는 데 중점을 둡니다. “Most recently, SPIN and Self-Rewarding Language Models… use the same instance of the language models themselves as the reward model to progressively improve the generative and discriminative abilities of the same LLM for alignment.” 최근의 SPIN이나 자기 보상 언어 모델 같은 연구들은 언어 모델 자체를 보상 모델로 사용하여 LLM의 생성 및 판별 능력을 점진적으로 개선하여 정렬(alignment)에 활용합니다. “Our work builds upon the self-play paradigm, but it is the first to use it to elicit long CoT for improved reasoning, and the first to frame the problem space as a Python input/output/function abduction/deduction/induction tasks, grounding it in an operationalizable environment to facilitate RLVR.” 이 연구는 셀프 플레이 패러다임을 기반으로 하지만, 개선된 추론을 위해 긴 CoT(사고 과정)를 유도하는 데 이를 사용한 최초의 연구이며, RLVR을 용이하게 하기 위해 문제 공간을 파이썬 입/출력/함수 기반의 귀추/연역/귀납 작업으로 구성하고 이를 실행 가능한 환경에 기반을 둔 최초의 연구입니다. 즉, 셀프 플레이를 추론 능력 향상과 구체적인 코딩 작업에 접목한 독창성이 있다는 거죠.

Weak-to-Strong Supervision (약한 감독에서 강한 감독으로) “The concept of weak-to-strong supervision has been studied in prior work, where a teacher—despite being weaker than the learner—still provides useful guidance…” 약한 교사가 더 강한 학습자에게도 유용한 지침을 제공할 수 있다는 약한 감독에서 강한 감독으로의 개념은 이전 연구에서 연구되었습니다. “We consider a similar setting in which the learner may possess superhuman capabilities. However, rather than relying on supervision from a weaker teacher, we propose an alternative approach: guiding the learner’s improvement through verifiable rewards, which potentially offer a more reliable and scalable learning signal.” 이 논문은 학습자가 초인적인 능력을 가질 수 있는 유사한 설정을 고려합니다. 그러나 약한 교사의 감독에 의존하는 대신, 검증 가능한 보상을 통해 학습자의 개선을 안내하는 대안적인 접근 방식을 제안합니다. 이것이 잠재적으로 더 신뢰할 수 있고 확장 가능한 학습 신호를 제공할 수 있다는 거죠. AI가 사람보다 똑똑해졌을 때, 사람이 가르치는 것보다 스스로 검증 가능한 목표를 통해 배우는 게 낫다는 겁니다. “Furthermore, in our proposed method, the learning task and goal distribution is not predefined by any external supervisor—they are entirely self-generated by the learner, enabling it to maximize its learning potential through autonomous self-practice.” 더욱이, 이 제안된 방법에서는 학습 작업과 목표 분포가 외부 감독자에 의해 미리 정의되지 않고 전적으로 학습자에 의해 자가 생성되어, 자율적인 자가 연습을 통해 학습 잠재력을 극대화할 수 있도록 합니다.

6. Conclusion and Discussion (결론 및 논의)

자, 드디어 마지막 결론 및 논의입니다! 저자들이 이 연구를 통해 무엇을 말하고 싶었는지, 그리고 앞으로의 방향은 무엇인지 정리해 봅시다.

Conclusion (결론) “In this work, we proposed the AbsoluteZero paradigm, an novel setting that addresses the data limitations of existing RLVR frameworks.” 이 연구에서는 기존 RLVR 프레임워크의 데이터 한계를 해결하는 새로운 설정인 AbsoluteZero 패러다임을 제안했습니다. “In this paradigm, reasoning agents are tasked with generating their own learning task distributions and improving their reasoning abilities with environmental guidance.” 이 패러다임에서 추론 에이전트는 자체 학습 작업 분포를 생성하고 환경적 안내를 통해 추론 능력을 개선하는 임무를 맡습니다. “We then presented our own instantiation, the AbsoluteZero Reasoner (AZR), which is trained by having them propose and solve code-related reasoning tasks grounded by code executor.” 그런 다음, 코드 실행기에 기반한 코드 관련 추론 작업을 제안하고 해결하도록 훈련되는 자체 구현체인 AbsoluteZero Reasoner (AZR)을 제시했습니다. “Remarkably, even though our models were not directly trained on these tasks and lacked human expert-curated datasets, our reasoning agents achieved exceptional performance, surpassing the state-of-the-art in combined general reasoning scores and in coding.” 놀랍게도, 모델이 이러한 작업에 직접 훈련되지 않았고 인간 전문가가 만든 데이터셋이 부족했음에도 불구하고, 이 추론 에이전트는 코딩 및 종합 일반 추론 점수에서 SOTA를 능가하는 뛰어난 성능을 달성했습니다. “This demonstrates the potential of the absolute zero paradigm to drive superior reasoning capabilities without the need for extensive domain-specific training data.” 이는 앱솔루트 제로 패러다임이 광범위한 특정 분야 훈련 데이터 없이도 우수한 추론 능력을 이끌어낼 수 있는 잠재력을 보여줍니다. “Furthermore, we showed that AZR scales efficiently, offering strong performance across varying model sizes, and can enhance the capabilities of other model classes as well.” 더 나아가, AZR이 효율적으로 확장되며 다양한 모델 크기에서 강력한 성능을 제공하고 다른 모델 클래스의 기능도 향상시킬 수 있음을 보여주었습니다. “To foster further exploration and advancement of this emerging paradigm, we are releasing the code, models, and logs as open-source…” 이 새로운 패러다임의 추가 탐구와 발전을 촉진하기 위해 코드, 모델, 로그를 오픈소스로 공개한다고 합니다. 연구 커뮤니티에 큰 도움이 되겠죠!

Discussion (논의) “We believe there remains much to explore, such as altering the environment from which the reasoner receives verifiable feedback, including sources like the world wide web, formal math languages, world simulators, or even the real world.” 검증 가능한 피드백을 받는 환경을 변경하는 것(예: 월드 와이드 웹, 형식 수학 언어, 세계 시뮬레이터, 심지어 현실 세계)과 같이 아직 탐구할 것이 많다고 믿습니다. AZR의 환경을 코드 실행기 너머로 확장할 수 있다는 거죠. “Furthermore, AZ’s generality could possibly be extended to domains such as embodied AI.” 더 나아가, AbsoluteZero의 일반성은 아마도 체화된 AI(로봇 등)와 같은 영역으로 확장될 수 있을 것입니다. “Additionally, more complex agentic tasks or scientific experiments, present exciting opportunities to further advance the absolute zero setting to different application domains.” 또한, 더 복잡한 에이전트 작업이나 과학 실험은 앱솔루트 제로 설정을 다른 응용 분야로 더욱 발전시킬 흥미로운 기회를 제공합니다. “Beyond that, future directions could include exploring multimodal reasoning models, modifying the distribution p(z) to incorporate privileged information, defining or even let the model dynamically learn how to define f (Equation (3)), or designing exploration/diversity rewards for both the propose and solver roles.” 그 외에도, 다중 모드 추론 모델 탐색, 특권 정보를 통합하도록 p(z) 분포 수정, 수식 (3)의 함수 f를 정의하거나 모델이 동적으로 정의하는 방법을 학습하도록 하는 것, 또는 제안 및 해결 역할 모두에 대한 탐색/다양성 보상 설계 등이 향후 연구 방향이 될 수 있습니다. 아직 할 게 많다는 거죠!

“While underappreciated in current reasoning literature, the exploration component of RL has long been recognized as a critical driver for emergent behavior in traditional RL…” 현재 추론 문헌에서는 과소평가되고 있지만, 강화학습의 “탐험(exploration)” 요소는 전통적인 RL에서 창발적 행동의 중요한 동인으로 오랫동안 인식되어 왔습니다. “Taking this a step further, our framework investigates an even more meta-level exploration problem: exploration within the learning task space—where the agent learns not just how to solve tasks, but what tasks to learn from and how to find them.” 한 걸음 더 나아가, 이 프레임워크는 훨씬 더 메타 수준의 탐험 문제를 조사합니다: 즉, 학습 작업 공간 내에서의 탐험입니다. 에이전트가 작업을 해결하는 방법뿐만 아니라 어떤 작업으로부터 배울 것인지, 그리고 그 작업들을 어떻게 찾을 것인지를 배우는 것이죠. 단순히 답을 찾는 것을 넘어, 어떤 문제를 풀어야 할지까지 스스로 결정하는 겁니다. “This shift opens a powerful new frontier—where agents explore not only solutions paces but also expand the boundaries of problem spaces. We believe this is a promising and important direction for future research.” 이러한 변화는 에이전트가 해결책 공간뿐만 아니라 문제 공간의 경계까지 확장하는 강력한 새로운 지평을 엽니다. 이것이 유망하고 중요한 미래 연구 방향이라고 믿습니다.

“One limitation of our work is that we did not address how to safely manage a system composed of such self-improving components.” 이 연구의 한 가지 한계점은 이렇게 자가 개선되는 구성 요소로 이루어진 시스템을 안전하게 관리하는 방법을 다루지 않았다는 것입니다. “To our surprise, we observed several instances of safety-concerning CoT from the Llama-3.1-8B model, which we term the “uh-oh moment”.” 놀랍게도, Llama-3.1-8B 모델에서 안전이 우려되는 여러 CoT(사고 과정) 사례를 관찰했으며, 이를 “어-오 모멘트(uh-oh moment)”라고 명명했습니다. (Figure 32 참조) AI가 스스로 학습하다 보니 예상치 못한 위험한 생각을 할 수도 있다는 거죠. “These findings suggest that the proposed absolute zero paradigm, while reducing the need for human intervention for curating tasks, still necessitates oversight due to lingering safety concerns and is a critical direction for future research.” 이러한 발견은 제안된 앱솔루트 제로 패러다임이 작업 큐레이션에 대한 인간의 개입 필요성을 줄이는 동시에, 여전히 남아있는 안전 문제로 인해 감독이 필요하며 이것이 미래 연구의 중요한 방향임을 시사합니다. 아무리 똑똑해도 안전 문제는 중요하다는 겁니다.

“As a final note, we explored reasoning models that possess experience—models that not only solve given tasks, but also define and evolve their own learning task distributions with the help of an environment.” 마지막으로, 이들은 “경험을 가진” 추론 모델, 즉 주어진 작업을 해결할 뿐만 아니라 환경의 도움을 받아 자체 학습 작업 분포를 정의하고 발전시키는 모델을 탐구했습니다. “Our results with AZR show that this shift enables strong performance across diverse reasoning tasks, even with significantly fewer privileged resources, such as curated human data.” AZR을 사용한 결과는 이러한 변화가 훨씬 적은 특권적 자원(예: 인간이 만든 데이터)으로도 다양한 추론 작업에서 강력한 성능을 가능하게 함을 보여줍니다. “We believe this could finally free reasoning models from the constraints of human-curated data (Morris, 2025) and mark the beginning of a new chapter for reasoning models: “welcome to the era of experience” (Silver & Sutton, 2025; Zhao et al., 2024).” 이것이 마침내 추론 모델을 인간이 만든 데이터의 제약으로부터 해방시키고(Morris, 2025), 추론 모델의 새로운 장, 즉 “경험의 시대(era of experience)“의 시작을 알릴 수 있다고 믿습니다. (Silver & Sutton, Zhao 등의 인용) 정말 멋진 마무리죠!