In recent times, the term “Reasoning Models” has become a buzzword in the AI community, especially with the emergence of models like DeepSeek-R1. If you’re still unclear about what these models entail and their practical applications, Sebastian Raschka, author of “Build a Large Language Model From Scratch,” has penned an insightful blog that demystifies reasoning models.
This Paper from CLOXLABS delves into the essence of reasoning models, their appropriate use cases, the architecture behind DeepSeek-R1, four primary methods for developing and enhancing these models, and tips for building reasoning models on a limited budget.
NOTE: This is an official Research Paper by “CLOXLABS“
Defining Reasoning Models
Reasoning in AI refers to the capability of models to solve complex, multi-step problems that require logical deduction and inference. Unlike straightforward queries—such as asking for the capital of France—reasoning models tackle intricate tasks like solving advanced physics or mathematics problems, where each step builds upon the previous one to arrive at a solution.
When to Avoid Using Reasoning Models
While powerful, reasoning models aren’t always the optimal choice. Raschka highlights three scenarios where their use might be counterproductive:
- Need for Speed and Low Cost: Reasoning models often demand significant computational resources, leading to increased costs and slower response times.
- Knowledge-Based Questions: For factual queries, such as “What is the capital of France?”, simpler models suffice. Employing reasoning models in these cases can lead to unnecessary complexity and potential inaccuracies.
- Simple Questions: For straightforward problems, reasoning models might overcomplicate the solution, akin to overthinking a simple task.

The DeepSeek-R1 Pipeline
DeepSeek-R1 stands out as a pioneering reasoning model, and understanding its development pipeline offers valuable insights into modern AI training methodologies. The process can be summarized as follows:
- DeepSeek-R1-Zero: Initiated with the DeepSeek-V3 model released in September, this phase employed Reinforcement Learning (RL) without Supervised Fine-Tuning (SFT), akin to RLHF (Reinforcement Learning with Human Feedback). Two reward metrics were utilized: accuracy and format adherence. For instance, accuracy rewards were derived from platforms like LeetCode, where code execution directly validated correctness. Format rewards ensured responses adhered to specific templates.
- DeepSeek-R1: Building upon R1-Zero, this phase incorporated a substantial SFT dataset. The original DeepSeek-V3 model underwent fine-tuning with this dataset, followed by RL training. An additional “consistency” reward was introduced to penalize language inconsistencies, ensuring the model’s responses remained in a single language. This phase also involved creating two SFT datasets: one with 600,000 chain-of-thought examples and another with 200,000 knowledge-based entries. Subsequent RL training focused on code and mathematics tasks, culminating in the development of the DeepSeek-R1 model.
- Distillation: The final phase utilized the previously generated SFT datasets to train smaller models through a process termed “distillation.” Here, a robust model generates data to fine-tune a smaller model (ranging from 1.5 to 70 billion parameters), effectively transferring knowledge and capabilities.
All Ads on this website are served by GOOGLE
Four Key Approaches to Building Reasoning Models
Raschka outlines four primary strategies for creating and refining reasoning models:
- Inference-Time Scaling: This approach leverages prompt engineering and other techniques during inference to enhance performance. While effective, it demands more computational resources, leading to higher costs due to increased processing time.
- Pure Reinforcement Learning (RL): Models like DeepSeek-R1-Zero are trained exclusively using RL without any supervised fine-tuning. This method fosters the emergence of complex reasoning behaviors, including extended chain-of-thought processes, reflection, and self-correction.
- Supervised Fine-Tuning (SFT) Combined with RL: This hybrid approach, exemplified by DeepSeek-R1, begins with supervised fine-tuning on a curated dataset, followed by reinforcement learning. This combination enhances both the model’s performance and the coherence of its outputs.
- Pure SFT with Distillation: In this method, a large, well-trained model generates data to train smaller models—a process known as distillation. Models like DeepSeek-R1-Distill-Qwen are products of this approach, offering efficient performance with reduced computational requirements.
Each method has its merits, but the SFT combined with RL approach often yields the most robust results. However, exploring other strategies can provide unique insights, such as the significant impact of pure RL on enhancing a model’s reasoning capabilities.
Insights into OpenAI’s O1 and Mini-O1 Models
The blog also offers educated conjectures regarding the training methodologies behind OpenAI’s O1 and Mini-O1 models. It’s posited that the O1 model employs a combination of the third (SFT + RL) and first (Inference-Time Scaling) approaches, while the O1-Mini model likely utilizes the fourth method (Pure SFT with Distillation).
All Ads on this website are served by GOOGLE
DeepSeek-R1 vs. OpenAI’s O1
In comparing DeepSeek-R1 to OpenAI’s O1, several key points emerge and Finally, Raschka compares DeepSeek-R1 and O1:
- Both models are similar in design, but R1 is more optimized and cost-effective.
- The reason? DeepSeek spent more time refining the training process, whereas O1 seems to rely more on inference-time techniques.
- However, since O1’s model size is unknown, a fair comparison is difficult.
Cost Considerations: How Expensive Is DeepSeek-R1?
- A widely cited figure suggests that DeepSeek-R1 cost $6 million to train.
- However, this includes both DeepSeek-R1 and its September predecessor (DeepSeek-V3).
- The exact cost of R1 alone remains undisclosed.
All Ads on this website are served by GOOGLE
What If You Have a Small Budget?
For those with limited financial resources, Raschka suggests focusing on distillation techniques. He references a recent Sky-T1 model, which was fine-tuned on just $450 (~40 million Tomans) worth of resources using 17,000 SFT data samples, yet performed similarly to O1 in some tasks.
Additionally, the blog touches on Journey Learning, a promising but underexplored concept in reasoning models—though this discussion extends beyond the current post.
Summation
Reasoning models like DeepSeek-R1 are revolutionizing AI by enabling more complex, multi-step problem-solving. However, they aren’t suitable for every task, especially when speed, cost, or factual accuracy is the priority.
Raschka’s blog provides valuable insights into how these models are built, the techniques behind their success, and strategies for developing reasoning models even on a low budget. Whether you’re an AI researcher, developer, or just an enthusiast, understanding these nuances is crucial as AI continues to advance.
All Ads on this website are served by GOOGLE