Hey friends, I’m Akash!
Software Synthesis analyses the intersection of AI, software and GTM strategy. Join thousands of founders, operators and investors from leading companies for weekly insights.
You can always reach me at akash@earlybird.com to exchange notes!
We tend to think of AI developments falling into two distinct eras: pre and post ChatGPT's release in November 2022.
January 2025 may be remembered as the month the economics of AI fundamentally changed.
Constraints breed innovation
Last week, January 20th, Chinese AI lab DeepSeek released their first reasoning model, R1. This followed the release of V3 on December 26th, a 671B sparse Mixture of Experts model that activates 37B parameters at inference.
V3 became the best open weights model whilst R1 nearly matches the performance of OpenAI’s o1 at a fraction of the cost.
Export controls on semiconductors have forced Chinese labs like DeepSeek (and several others like ModelBest, Zhipu, MiniMax, Moonshot, Baichuan, 01.AI, and Stepfun) to innovate on efficiency.
Is it in fact the case that, if you impose sanctions on China so that they can’t get as much compute, then all you do is give them this constraint to optimize against, which says, “How can we squeeze every little bit of IQ out of every FLOP that we’ve got?”, and they just find clever ways of doing a lot more with a lot less.
Nat Friedman
The cost of training these models naturally exceeds the touted $5.6m (c. 2.8m GPU hours) to train V3 when you factor in the cost of talent and research/experimentation.
Even so, these results were achieved with remarkable compute efficiency relative to the big AI labs (both open and closed-source).
It’s important to internalise the sequence in which these models were trained to begin contemplating the second-order effects.
Back in May 2024, DeepSeek released V2, a model that was cheaper and better than Llama 3 70B.
DeepSeek’s V3 was released in December, and R1 was released last week. R1 was in fact used for V3’s post-training, and it’s R1’s training recipe that’s most interesting.
R1 relied on synthetic data generated from R1-Zero, a separate reasoning model that was trained purely using Reinforcement Learning, with no supervised finetuning.
R1-Zero was given:
Problems to solve (math, coding, logic)
Two simple rewards: "Did you get the right answer?" and "Did you show your work in the right format?"
Through pure trial and error, R1-Zero developed emergent behaviours like learning to take more time to think ("aha moments"), self-correction ("Wait, let me check that again") and multi-step reasoning.
However, the reasoning process wasn’t legible to humans - the model was mixing languages (Chinese and English) and the formatting wasn’t easy to follow.
Now comes R1.
Cold Start SFT: R1 is fed R1-Zero’s synthetic reasoning data and more synthetic data from other undisclosed models
Large-Scale RL: Reward correctness and structure (e.g., enforcing
<think>
/</think>
tags) + language consistency (stick to one language).Rejection Sampling: Filter R1’s outputs to blend reasoning with general skills (e.g., writing, trivia).
Final RL: Combine reasoning rewards with human preferences (helpfulness, safety).
When V3 was released, many found it amusing to elicit this response:
The DeepSeek models are open weights and therefore we don’t have visibility into the training data. It’s plausible (and more than likely) that some of the synthetic data needed to train V3 and/or R1 came from OpenAI’s models.
This is inevitable when the activation energy needed to distil models is low.
But then there’s another theory which is actually anything can be competed away and what matters is how much attention your industry gets and if you’re just on the front page of every news publication every single day, any obvious parts of margin are going to get sucked out of the supply chain if the activation energy and joining in getting some of that money is low, and in software it’s just devilishly low to the point where a random company in China is just able to go for gold, I think it would’ve been far different if you had to, I don’t know, physically start assembling chemicals to get going in AI.
Daniel Gross
Yeah, you end up paying this huge first mover penalty for pushing the frontier out, because you have to build these expensive models that are easy to distill. There’s an analogy to humans though, which is that a very smart person can grow up and learn how the world works, and they are distilling the knowledge of all the humans that came before them, which was this gigantic model of humans and 1,000 years of human research and progress, and we all kind of do that essentially. I mean, the models are doing that from the Internet data as well.
Nat Friedman
DeepSeek then released 6 distilled versions of Alibaba’s Qwen 2.5 (14B, 32B, Math 1.5B and Math 7B) and Llama 3 (Llama-3.1 8B and Llama 3.3 70B Instruct) under the MIT license (which is incompatible with the underlying Llama license, but that’s another topic).
These models were fine-tuned using 800K high-quality examples generated via R1, teaching the smaller models to mimic its reasoning.
These tiny (7B) distilled model end up beating GPT-4o, lowering the cost of intelligence dramatically (the very goal OpenAI has been pursuing).
This is why these results are so profound.
For large AI labs, capital and scale were moats. It took literally billons of dollars of compute and data to pre-train a state-of-the-art model. Let alone all the dollars it took to pay the researchers (if you could even hire them)! There were only a small handful of companies that had access to this kind of capital and research talent. And when it takes that much money to create something, you'll want to charge for it
However, model distillation is SUPER important. And reasoning models are just as easy (if not easier) to distill. So what does this mean? It means anyone can take that super complex, SOTA model (that someone else spent billions on), spend a FRACTION of the cost and time distilling it, and end up with their own model that's nearly just as good. What does this mean?
It means the large AI labs (in the most pessimistic view) are providing free outsourced R&D and CapEx to the rest of the world. OpenAI o1 charges $15 / million input tokens and $60 / million output tokens. The same cost for the DeepSeek R1 model? $0.14 / million input and $2.19 / million output. Orders of magnitude difference
Jamin Ball
We don’t know the mix of synthetic data used to train R1, so one can only guess the amount of o1 tokens that were used to train R1. Directionally, though, this has lots of implications.
Between the research produced by Moonshot and DeepSeek, some other important findings include:
Training vs Inference Search: Instead of inference time search (as o1 purportedly reasons), R1 and k1.5 converge on more efficient RL methods for pre-training. Models can learn implicit search strategies during training that do not require complex search during inference.
Process Reward Models (PRMs): Instead of relying on complex, granular feedback at each step of the reasoning process, these models lean more heavily on outcome-based rewards. They are still getting feedback on individual steps, but they are not explicitly training a separate reward model.
Long-CoT to Short-CoT Distillation: Kimi introduces "long2short" methods to transfer reasoning abilities learned in long-context models to more efficient short-context models. This addresses a practical concern – long-context models can be expensive to run, so distilling their knowledge into smaller, faster models is valuable. This is how R1 was used to distil the Qwen and Llama Models
ModelBest, another Chinese company focused on Edge AI, has also produced some key findings. Cofounder Zhiyuan Liu proposed the Densing Law of Large Models, which suggests that model capability density increases exponentially over time.
"Capability Density" is the ratio of the effective parameter size of a given LLM to the actual parameter size. For example, if a 3B model can achieve the performance of a 6B reference model, then the capability density of this 3B model is 2 (6B/3B).
According to the Densing Law:
Every 3.3 months (about 100 days) the parameters a model needs to achieve the same capability decreases by half;
the model inference cost decreases exponentially over time;
the model training cost decreases rapidly over time;
the capacity density of large models is accelerating;
the miniaturization of models reveals the huge potential of edge intelligence;
the model capacity density cannot be enhanced by model compression;
the period of time it takes for density to double determines the "validity period" of the model.
As an example, a 2.4B parameter model released in February 2024 had the same capabilities as GPT-3 (released in 2020, 175B paremeters).
The research coming out of these labs has far-reaching consequences.
Financing Frontier Models
The entire premise of distillations rests on a teacher model teaching a student model.
The Stargate announcement, together with future capex estimates from the hyperscalers, speak to the degree of infrastructure investment going into frontier model training and inference.
The $600B question very much relies on revenue generation.
It's going to be very very difficult for any Frontier lab to invest at the level that's going to be required and not have a robust business model behind it.
What I would tell you about OpenAI revenues, they ended the year I think it was rumoured four to five billion run rate growing very robustly um you know you would expect a company at at this at this stage to be growing at least triple digits and so you start thinking about 10 billion plus in revenue for this company.
Brad Gerstner
But if model defensibility collapses due to distillation and the depreciation schedules accelerate, how will the AI labs be able to raise the capital needed to sustain investment?
It’s why the frontier labs are becoming product companies, as we’ve discussed before.
If DeepSeek’s rise up the App Store is any evidence, switching costs are even lower than we’d imagined, which places even more of an onus on the model companies to either become consumer tech companies (as OpenAI already clearly is) or enterprise infrastructure (as Anthropic will undoubtedly lean on Amazon for).
Distilled Domain-Specific Models
Distillation give a blueprint for training domain-specific models. If R1-Zero’s combination of pure RL + synthetic data from frontier labs can scale, the path to cheap reasoning models is clear.
An interesting direction to explore is combining these performant, cheap models with high quality human annotations from subject matter experts across different domains.
Companies like Mercor, Labelbox and many others have been serving labs like OpenAI and Meta with expert annotators. Mercor has 300k experts in their talent pool.
In addition to application vendors developing domain-specific models, providers like MosaicML (Databricks) will likely see renewed interest in model training from enterprises sitting on proprietary data.
Edge AI and Application Margins
DeepSeek R1 quantised to 4-bit requires 450GB in aggregate RAM, which can be achieved by a cluster of three 192 GB M2 Ultras ($16,797 will buy you three 192GB Apple M2 Ultra Mac Studios at $5,599 each).
If Densig’s Law holds, the promise of local inference will be realised. SaaS vendors will be able to charge subscriptions whilst keeping margins high by running inference on the end customer’s device.
Where it gets even cooler is right now no app developers can really offer free applications using AI because as a developer you have to you have to think about the cumulative aggregate cost of inference on cloud providers and then estimate how much that's going to be and then figure out some way to cover that cost. There's no way to defray that cost to the end consumer and so this is why we have these like kind of terribly inelegant models of monetization like subscriptions where it's like okay well I guess that's like a okay solve but what's even better is if all the inference can happen locally.
Amazing. Incredible. If Apple gets to a point where they're they release the Apple intelligence SDK and allow developers to use to to basically run inference locally on device all of a sudden as an app developer you can offer a free app. App developers literally can't offer free apps right now because they have to do the mental calculation of like okay like how much is this going to cost.
Chris Paik
Model distillation + routing is already seeing gross margins for application AI companies surpass typical software margins.
If you're sitting here today as an application developer versus two years ago, the cost of inference of these models is down a hundred X, 200 X. It's frankly outrageous. You've never seen cost curves that look this steep, that fast.
And this is coming off of 15 years of cloud cost curves, which were amazing and mind-blowing by themselves. The cost curves on AI are just a completely different level. We were looking at cost curves in the first wave of application companies that we funded in 2022. You look at the inference costs and it would be like $15 to $20 per million tokens on the latest frontier models.
And today, most companies don't even think about inference costs because it's just like, well, we've broken this task up and then we're using these small models for these tasks that are pretty basic, and then the stuff we're hitting with the most frontier models are these very few prompts. And the rest of the stuff we've just created this intelligent routing system. And so our cost of inference is essentially zero. And our gross margin for this task is 95%.
Chetan Puttagunta
Amazon’s Nova models show that they’re all in on model commoditisation, whilst Apple can finally catch up with distilled models.
In the coming months, we’ll likely see many followers emerge inspired by DeepSeek.
I’d love to hear your feedback - DM me here or at akash@earlybird.com to compare notes.
Thank you for reading. If you liked this piece, share it with your friends.