[Jan 7 2026] nanochat miniseries v1 #420

karpathy · 2026-01-07T22:12:28Z

karpathy
Jan 7, 2026
Maintainer

Why miniseries. The correct way to think about LLMs is that you are not optimizing for a single specific model but for a family models controlled by a single dial (the compute you wish to spend) to achieve monotonically better results. This allows you to do careful science of scaling laws and ultimately this is what gives you the confidence that when you pay for "the big run", the extrapolation will work and your money will be well spent. For the first public release of nanochat my focus was on end-to-end pipeline that runs the whole LLM pipeline with all of its stages. Now, I'm coming back around to flesh out some of the parts that I sped through, starting of course with pretraining, which is both computationally heavy and critical as the foundation of intelligence and knowledge in these models.

Miniseries v1. In nanochat, that single dial is the depth of the model. For example, d12 is my favorite model (it's the size of GPT-1!) - it has 12 layers and currently trains in ~6 minutes. The setting of depth determines the number of channels in the Transformer (via the constant "aspect ratio"), and in turn the number of parameters and flops per token of the Transformer, and the optimization hyperparameters (the learning rate in particular) and finally via scaling laws analysis the horizon of training to obtain a "compute optimal" model (more on this below). As of the latest commit the script miniseries.sh, sweeps out the family of nanochat models from d10 to d20. All of these fit into a single 8XH100 node at the training batch size of 2**19 = 524,288 tokens without having to reach for micro batches and gradient accumulation. nanochat already supports gradient accumulation and I've trained much larger models (e.g. d34 recently), but I wanted to focus on this simplest setting first. The wandb plots look like this. x-axis is flops and y-axis is validation bpb (bits per byte, i.e. loss):

What you're seeing here are models d10...d20. These 11 models took ~4 hours back to back on my trusty 8XH100 node to train for ~$100 of total cost. If your code, architecture and optimization is properly arranged and you did your scaling laws right these curves should not intersect. Each one represents the unique, compute optimal way to reach a target validation loss.

Comparison to GPT-2/GPT-3 miniseries. I did not want to use the validation loss to compare models because while it is simple, it can be subtle and deceiving. For example, modded nanogpt (which I otherwise love) merged a few changes that I thought were mildly gaming the metric (e.g. using very long sequences and batch size 1). It's a bit subtle but stretching out your validation batches into one long row (i.e. batch size B=1) just means there are fewer tokens with cropped contexts at the first few columns of your (B, T) batches when B>>1. This basically changes the validation loss by increasing the amount of context for many of these tokens, so it's not apples to apples, and the resulting "improvement" is not real. In addition, it's a bad form of comparison to GPT-2 and GPT-3 models because they were pretrained on a very different and unknown data mix distribution, so comparing FineWeb loss is not fair or informative. Only actual metrics are real and comparable. Earlier in the year I stumbled by the DCLM paper where they presented a nice ensemble metrics over a lot of different datasets. The metric is called CORE metric and it incorporates performance across 22 nice and high quality datasets. DCLM code had a complicated and bloated way of calculating it, so I stripped it all the way to a single, simple, dependency-free file that evaluates the CORE metric given a model. Then we can chart a nice, valid comparison of our miniseries v1 models to GPT-2 and GPT-3 (more on how I calculated/estimated their CORE scores below) where the x axis is resource spend (FLOPs, time) and y axis is CORE score. To get $ as the x-axis, simply take the time (hours) and multiply by $24 (as the cost of 8XH100 is $3/GPU/hour X 8 GPUs = $24/hour).

The goal for miniseries v2 is now simple: to further optimize the pretraining code and to lift up (and ideally tilt!) this line; to get more bang for the buck.

Details

Scaling laws. One of the important and trickier aspects of getting this to work is doing a good job with your scaling laws (see Kaplan et al. and Chinchilla/Hoffmann et al.). The problem essentially is as follows. Suppose I want to train a d12 model. How many iterations should I train it for? Remember that at this small scale we are in the infinite data regime so there are no concerns of overfitting and therefore it makes no sense to, for example, train until your validation loss starts climbing. In the infinite data regime, the validation loss keeps going down indefinitely as you train longer (it just starts to level off slowly), moreover your train loss is basically equal to your validation loss - no overfitting. The answer to the problem is that the question is not quite right - you don't really want to train a d12. Instead, you have a certain compute budget of FLOPs (e.g. I want to run my cluster for exactly one day) and you want the lowest achievable loss. The real question then is: should you train a small model for many iterations or should you train a bigger model for fewer iterations? Scaling laws are a way of determining how to map from the single variable you have control over (the total number of flops) to the optimal setting of N (the number of parameters of your model) and D (the number of tokens you will train for, which is trivially related to the number of iterations or the length of time given a fixed batch size per step of the optimization).

Given a nanochat model of a certain depth, the way to calculate its flops is as follows:

    def estimate_flops(self):
        """
        Return the estimated FLOPs per token for the model (forward + backward).
        Each matmul weight parameter contributes 2 FLOPs (multiply *, accumulate +) in forward, and 2X that in backward => 2+4=6.
        Cleanest explanation of this: https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-language-model-training-3b19c1f025e4
        On top of that, the term 12 * l * h * q * t accounts for key @ query matmul flops inside attention.
        Ref: https://arxiv.org/abs/2204.02311 (PaLM paper).
        This is ~1% off from the exact formulas of Chinchilla paper, the difference is:
        - Chinchilla counts the embedding layer as flops (? weird, it's just a lookup => we ignore)
        - Chinchilla counts exp/sum/divide in attention softmax as flops (a little sus and very tiny => we ignore)
        """
        nparams = sum(p.numel() for p in self.parameters())
        nparams_embedding = self.transformer.wte.weight.numel()
        l, h, q, t = self.config.n_layer, self.config.n_head, self.config.n_embd // self.config.n_head, self.config.sequence_len
        num_flops_per_token = 6 * (nparams - nparams_embedding) + 12 * l * h * q * t
        return num_flops_per_token

And the way to calculate its parameters is simply as:

    def num_scaling_params(self):
        """
        Return all of the parameters, same as Chinchilla paper.
        Kaplan et al. did not include embedding parameters and said that this led to cleaner scaling laws.
        But Kaplan et al. also had a bug in their results (as pointed out by Chinchilla).
        My own experiments in nanochat confirm the Chinchilla approach gives the much cleaner scaling law.
        Ref: https://arxiv.org/abs/2203.15556 (Chinchilla paper <- good).
        Ref: https://arxiv.org/abs/2001.08361 (Kaplan et al. original scaling laws paper <- bad)
        """
        nparams = sum(p.numel() for p in self.parameters())
        return nparams

Next we use the --target_flops of base_train.py to fix the flops to a specific target (e.g. 3e18) and run models of a few depths. The code will automatically scale the number of iterations so that it exactly gets to your desired target flops. Small depths train long, large depths train short. However all of these models of different depths will end up costing exactly 3e18 FLOPs. The individual runs will look like this:

You see how the big models (e.g. d20, brown) ran for very iterations and small models (e.g. d8, magenta) ran for many iterations. All of these models cost the exact same amount of FLOPs, but clearly somewhere in the between one of them (d16 here) struck the correct balance and reached the lowest loss. That specific of model size and training length is compute optimal. When you repeat this process for a few FLOP budgets, you get surprisingly nice U shapes where for each one there is a concrete setting of the model size that is compute optimal:

When you look at the optimal points (stars) after a quadratic fit you get:

FLOPs        Params          Tokens          Ratio      Val BPB   
-----------------------------------------------------------------
1e+18        136,706,248     1,116,751,173   8.2        0.9781    
3e+18        240,236,324     1,880,909,903   8.1        0.9202    
6e+18        330,186,832     2,711,097,322   8.5        0.8874

For comparison, here is the same plot from the Chinchilla paper except with more compute:

Now there are a few important things to note. First, notice how (exactly like Chinchilla), the optimal number of parameters and tokens to train for is proportional to compute C to the power of ~0.5. First, sanity check they add up to ~1 because C ~= 6ND. But more importantly, they are equal to each other. This is a remarkable result that we reproduce, it means that parameters and tokens are on equal footing w.r.t. compute optimal models. For example if you double (2X) the compute budget, this is saying you should 1.41X your parameters and 1.41X your number of tokens (1.41 ~= sqrt(2)). In particular and most importantly, the fact that they are equal means that the optimal ratio of D:N is constant, regardless of the compute budget C. This is because if we model the number of parameters as N = k₁ · C^a and tokens D = k₂ · C^b, then D/N = (k₂/k₁) · C^(b-a) and so if a = b = 0.5, then C^(0.5 - 0.5) = C^0 = 1 = constant! So the optimal ratio of D/N = k₂/k₁ is constant regardless of the compute level of interest C. Note that this really didn't have to be the case, it could have been something else and complicated and a function of the model size (the original scaling laws paper of Kaplan et al. found this to be the case incorrectly due to a major bug in learning rate decay), but nature decided that compute optimal Transformers fall exactly on a straight line (what???). Chinchilla pointed it out and nanochat reproduces this surprising finding. In any case, practically speaking this is huge because it means that we have a single constant telling us the optimal ratio between D and N, and therefore we can simply have a target_param_data_ratio in nanochat base_train script, which calculates the optimal number of tokens to train for regardless of the depth of the model. In Chinchilla, they empirically measure k₂/k₁ to be 20. In nanochat, when you do the fits you actually get something much lower: 8. It's possible that some specifics of nanochat (the Muon optimizer, or...?) make it so that nanochat prefers bigger models trained shorter. Or it could be an artifact of the smaller model sizes we're looking at here. In any case, we now know how long we should train any given model. You take the number of its parameters N, you multiply by 8 to get the number of target tokens D, then divide D by the batch size (~0.5M) to get your number of iterations, done. We're now only training compute optimal models.

Hyperparameter sweeps. I ran a few more tuning sweeps that I won't spend a lot of time on. The learning rates are close to optimal after small nudge to the embedding learning rate. The warmdown ratio was the biggest surprise and I nudged it 0.2 -> 0.4. Sequence length of 2048 turns out to be quite good, balancing context length and document diversity in our batch of 0.5M tokens. The batch size 0.5M is a little bit on the larger size and purely flops-wise should be a little bit smaller (~half), but wallclock-wise is good as is. All this to say that I did some basic tuning for miniseries v1, but by no means exhaustive and there are still many ideas to try.

GPT-2 / GPT-3 CORE scores. Another challenge was calculating CORE scores for the GPT-2 and GPT-3 miniseries. GPT-2 miniseries was easy because the models are available and were released. So you can just download the models and run the eval. GPT-3 miniseries I had to get more creative because the models were never released. But we do have the paper with their evaluation results in the tables. I posted the full approach to a jupyter notebook but basically I found 6 tasks that are both in the CORE metric and reported in the GPT-3 paper in a very similar evaluation setting. They are ['HellaSwag 0-shot', 'LAMBADA', 'HellaSwag 10-shot', 'PIQA', 'ARC Easy', 'ARC Challenge']. I then use the GPT-2 models for calibration, meaning that I trained a simple model that takes the performance on these 6 tasks and estimates the CORE score (of 22 tasks) using 3 different approaches. The fact that these 6 are solid evals and that the points lined up very nicely gave me confidence that these CORE scores are not far off.

After all this work we get our targets:

GPT-2	Params	Calculated CORE	GPT-3	Params	Estimated CORE
Small	124M	0.114	Small	125M	0.148
Medium	355M	0.185	Medium	350M	0.216
Large	774M	0.215	Large	760M	0.266
XL	1.6B	0.257	XL	1.3B	0.291
-	-	-	2.7B	2.7B	0.329
-	-	-	6.7B	6.7B	0.361
-	-	-	13B	13B	0.385
-	-	-	175B	175B	0.427

(Note 1: You'll notice that GPT-3 at the same parameter counts as GPT-2 is a slightly better model with stronger performance due to various improvements to data, architecture, optimization and they are also trained for a lot more tokens (300B) compared to GPT-2's estimated token budget of somewhere around ~100B tokens. Note 2: All of these are CORE scores v1, not v2 (iykyk otherwise nvm)).

Miniseries v1 CORE scores Second, here are the nanochat miniseries v1 in their full detail:

depth	model_dim	params_M	tokens_B	val_bpb	CORE	train_time_min
10	640	91	0.73	1.0312	0.071	5
11	704	112	0.89	1.0096	0.0918	6.6
12	768	135	1.08	0.9825	0.1059	7.8
13	832	163	1.3	0.9644	0.1015	11.2
14	896	194	1.55	0.9437	0.1185	13.2
15	960	229	1.83	0.9257	0.1158	17.4
16	1024	268	2.15	0.9101	0.1332	21.4
17	1088	313	2.5	0.8952	0.1518	33.9
18	1152	362	2.9	0.8817	0.1611	37.5
19	1216	417	3.33	0.8687	0.1659	53
20	1280	477	3.82	0.8572	0.1708	59.5

NOTE: Do not be confused w.r.t. the v1 miniseries and previous nanochat models I have trained so far in the rest of the discussions. Those models were trained with D:N ratio of 20 (Chinchilla), these models are using 8. So they are less trained (at each depth), but compute optimal for their respective validation loss.

We can also take the models d12+ (discarding some of the smaller models due to fear of outliers at that tiny of a scale) and do a fit predicting (total) parameters -> CORE of the asymptotic form to get a fit CORE = 1.0000 - 3.7555 * FLOPs^(-0.0344). With this, we can extrapolate to see what we need to reach all of the GPT-2 and GPT-3 models:

Depth to match GPT-2/3 CORE (d>=12 fit, 12:1 D:N ratio, 8xH100 @ $3/GPU/hr):

Model	CORE	Depth	Params	Tokens	FLOPs	Time (8xH100)	Cost
GPT-2 (124M)	0.114	d14	179M	1.4B	1.7e+18	8 min	$3
GPT-2 Medium (355M)	0.185	d22	599M	4.8B	2e+19	1.6 hrs	$38
GPT-2 Large (774M)	0.215	d28	1.0B	8.1B	5.8e+19	4.6 hrs	$111
GPT-2 XL (1.6B)	0.257	d38	2.2B	17.8B	2.9e+20	22.7 hrs	$546
GPT-3 Small (125M)	0.148	d17	319M	2.6B	5.5e+18	26 min	$10
GPT-3 Medium (350M)	0.216	d28	1.0B	8.3B	6.1e+19	4.8 hrs	$116
GPT-3 Large (760M)	0.266	d41	2.7B	21.3B	4.2e+20	1.4 days	$790
GPT-3 XL (1.3B)	0.291	d51	4.4B	35.3B	1.2e+21	3.8 days	$2.2k
GPT-3 2.7B	0.329	d71	9.6B	77.2B	5.7e+21	18.9 days	$10.9k
GPT-3 6.7B	0.361	d95	19.3B	154.6B	2.4e+22	78.0 days	$44.9k
GPT-3 13B	0.385	d119	33.4B	267.6B	7.3e+22	238.6 days	$137.5k
GPT-3 175B	0.427	d181	91.8B	734.1B	5.7e+23	1869.0 days	$1076.5k

Now, I wouldn't really read into this too much because we're doing a lot of extrapolation over many orders of magnitude of flops based on very few datapoints. And we're optimistically assuming the same utilization at scale as that of the d20 run that fully utilizes the 8XH100 box to estimate the time and cost. But still, it's encouraging to see that e.g. as a sanity check, the predicted FLOPs needed to get to GPT-3 level is 5.7e23 (the real amount needed for the GPT-3 run was ~3e23). If these numbers are to be trusted, then we're in a pretty decent spot, but still with quite a bit of room for improvement.

For any Q&A please feel free to use the discussion below, alternatively find me on the Discord channel #nanochat.

Hippensteel · 2026-01-08T01:07:20Z

Hippensteel
Jan 8, 2026

You mentioned the D:N ratio came out to ~8 vs Chinchilla's ~20, and speculated it might be Muon or small-model artifacts.

Is there a simple way to think about why an optimizer would change this ratio? My intuition is that a more efficient optimizer "fills up" model capacity faster, so you'd shift compute toward bigger models rather than longer training. Is that directionally right, or is something else going on?

Separate question on depth as the dial... you fixed aspect ratio and swept depth. Is there prior work (or intuition) suggesting depth is the more natural axis to vary vs width? Or was this mainly a practical choice to keep the sweep clean?

6 replies

runame Jan 8, 2026

The result that the optimal D:N ratio is ~8 for Muon seems consistent with the study described in appendix H.5 of this recent paper, where they find that 7.4 is the optimal ratio for Muon.

karpathy Jan 8, 2026
Maintainer Author

@runame thank you for the pointer but they also find that 9.6 is optimal for Adam. Still, interesting and relevant that the Muon TPP (Token Per Parameter) is lower than for Adam in their experiments.

Hippensteel Jan 8, 2026

Interesting that the paper puts the optimal ratio for Adam at ~9.6... so the shift from Chinchilla's ~20 isn't just a Muon thing.

Muon pushes it lower still (7.4 vs 9.6), which fits the "fills up faster" intuition, but the drop from Chinchilla (20) to this paper's Adam baseline (9.6) is actually the bigger delta. Seems like "modern" baselines (uP, etc.) might inherently favor bigger models more than we thought.

Thanks for the responses... this clarified a lot.

SeunghyunSEO Jan 8, 2026

nice work!

btw, I'm rather disappointed that the scaling exponent is still about 0.5 for both N and D.
deepseek shows high-quality datasets can change the scaling exponent, and I assumed modern pre-training corpora like DCLM would be far better than the datasets from the Chinchilla paper (published in 2022)...
we can see it prefer larger model when feeding high quality data.

karpathy Jan 8, 2026
Maintainer Author

Thanks @SeunghyunSEO , honestly there are many opportunities for confounders here and I haven't done a thorough literature review or experimentation. As one example, in some earlier nanochat tests I tried following Kaplan et al. and used non-embedding parameters N (instead of all parameters) and that created different coefficients, of 0.6 and 0.4. i.e. dataset-independent changes including how you measure C and N can easily skew the ratios too. Tbh I was a little bit relieved to get 0.5, 0.5 atm because it means I can use and set a simple constant ratio.

0error-ob · 2026-01-08T03:04:37Z

0error-ob
Jan 8, 2026

This resonates with your “dial / family of models” framing.

One question I’ve been thinking about: we treat compute / params / tokens as the primary dials, but there seems to be an upstream dial that usually stays implicit — the human-side pre-filtering and curation regime that defines what even enters the training pipeline.

Have you ever tried holding model + compute fixed, while systematically varying how data is curated or labeled (e.g. exploratory vs delivery-driven filtering), and then measuring downstream calibration or OOD behavior?

My hunch is that some fragility attributed to scaling limits may actually come from an unablated observer interface earlier in the pipeline.

2 replies

karpathy Jan 8, 2026
Maintainer Author

It sounds like you're exactly describing the DCLM paper? https://arxiv.org/abs/2406.11794

0error-ob Jan 8, 2026

That makes sense — DCLM is exactly the closest formalization I’ve seen of this.

One thing I’m still unsure about is whether all curation dials can be reduced to statistical heuristics, or whether some of the variance comes from the human-side workflow regime itself (e.g. short-horizon reactive vs long-horizon exploratory filtering).

I wonder if two curators applying the same objective filters but operating under different accountability / delivery constraints would still produce measurably different downstream calibration or OOD behavior.

domschl · 2026-01-08T06:56:21Z

domschl
Jan 8, 2026

Title: [2025++]

0 replies

Kojungbeom · 2026-01-08T10:46:49Z

Kojungbeom
Jan 8, 2026

I might be missing something or perhaps I misunderstood the concept, but I have a question regarding the individual metrics.

Since users are often interested in performance on specific tasks, do the individual benchmarks that make up the CORE score also show the same smooth scaling laws?

0 replies

zuuxuux · 2026-01-08T11:44:21Z

zuuxuux
Jan 8, 2026

Wonderful stuff.

How does FP precision affect parameter count? Is 2B 8-bit equal to 1B 16-bit? Could this be related to the discrepencies with the optimal ratios between Chinchilla and this: we are training at lower precision (are we?) so we need less information to "fill up" each parameter?

0 replies

JosvanderWesthuizen · 2026-01-08T18:28:07Z

JosvanderWesthuizen
Jan 8, 2026

This is so good, thank you! 🙏

found one small insignificant typo 😅

0 replies

vgoklani · 2026-01-08T23:46:09Z

vgoklani
Jan 8, 2026

Thanks @karpathy - did you include the scripts for the Hyperparameter sweeps? Presumably it's more than a grid-search... I'd be interested in running those and checking a few things. Thanks!

1 reply

karpathy Jan 9, 2026
Maintainer Author

I run stuff very similar to what's in scaling_laws.sh . Have Claude make your sweeps super ez

vindiesel · 2026-01-09T01:47:48Z

vindiesel
Jan 9, 2026

The warmdown ratio was the biggest surprise and I nudged it 0.2 -> 0.4

very interesting, how big was the improvement in metrics by making this change?

1 reply

vindiesel Jan 9, 2026

Also interesting how this change interacts with reducing target_param_data_ratio from 20 -> 8.

Taking both changes together, the warmdown period will only drop a little (8 / 20 * 0.4 / 0.2 = 0.4 * 2 = 0.8, so 80% of the original warmdown period), so we are still spending roughly the same time in exploit-ish stages, but much less time in the explore-ish stage.

dhruvnigam93 · 2026-01-09T04:39:38Z

dhruvnigam93
Jan 9, 2026

You are a goddamn hero. I lost it when you trained a model to estimate gpt 3 scores. Genius.

0 replies

emergentcog · 2026-01-09T12:22:43Z

emergentcog
Jan 9, 2026

Hi Andrej,

Thank you for sharing your detailed and granular empirical analysis.

Building on an earlier piece of research: A Resource Based Model For Neural Scaling Laws
I've come up with a refined NSL profile for the NSL, as follows:

Structural Phase (below critical depth): ℓ ∝ N_p^-2/3
Redundancy Phase (standard scaling): ℓ ∝ N_p^-1/3
Width-Only Scaling: ℓ ∝ N_p^-1/2

From my theoretical derivation with width-only scaling (fixed depth): N_p ∝ N²

This implies that N ∝ N_p^1/2. Since FLOPS (C) is grows proportionally to the resources (neurons) (N) this means that N ∝C^0.5 which is pretty close to your 0.49 for the optimal model size exponent.

This seems to suggest that width-only scaling beyond critical depth is not just theoretically sound but also matches (at least in your setting) the optimal scaling behavior you've observed in practice.
The slight deviation from 0.5 (0.49) is consistent with real-world constraints while validating the core theoretical framework.

Additionally, here is an illustration(from my preprint) graph / projection in relation with the Chinchilla paper:

This alignment between theory and empirical evidence reinforces the value of your miniseries approach -- it's helping to bridge the gap between theoretical understanding and practical implementation of scaling laws.

Best regards,
Tolga

0 replies

duckmaestro · 2026-01-09T16:59:07Z

duckmaestro
Jan 9, 2026

Amazingly similar to the Cobb–Douglas model (economics). Sub-linear returns if scaling occurs in only one input factor, but linear returns when you scale factors together.

0 replies

Francesco215 · 2026-01-10T12:28:55Z

Francesco215
Jan 10, 2026

Amazing post!

I have just one quick question: did you consider using muP-style hyperparameter transfer for these scaling experiments?

If so, what tradeoffs led you to stick with empirical tuning instead?

0 replies

karpathy · 2026-01-12T03:00:33Z

karpathy
Jan 12, 2026
Maintainer Author

Update Jan 11: pushed a number of upgrades and re-ran the miniseries.

2 replies

karpathy Jan 12, 2026
Maintainer Author

I'm also not ecstatic to be doing the extrapolations in CORE score, I'd much rather do it in loss, unfortunately the GPT-2/3 models are impossible to evaluate for FineWeb validation loss because they were trained on an entirely different data distribution. I tried and our val_bpb lies on a beautiful line with the power law we like, but the GPT-2 validation performance is just non-sensical... :( E.g. GPT-2 XL gets val_bpb of our d20, which can't be right. Indeed, d20 CORE of ~0.2 is nowhere near GPT-2 CORE of 2.256. TLDR I can't make the plot just be loss power laws because I can't related the loss to GPT-2/3 model capability.

kczufelt Jan 14, 2026

Really appreciate you sharing this run. It’s refreshing to see how openly you highlight the difficulty of getting meaningful validation signals when older and newer training distributions don’t align.
I’ve run into a similar issue while exploring a different architectural direction on top of LLMs — evaluation gets unreliable fast when the underlying assumptions shift. It’s helpful to see that this challenge shows up even at your scale.

JohnDrakopoulos · 2026-01-19T22:12:10Z

JohnDrakopoulos
Jan 19, 2026

This is a great repository and a good write-up. Many thanks. However, I am afraid that the scaling laws are more artifact than fact. They are largely a consequence of architectural limitations (a.k.a transformers). In general, compute is not proportional to the number of learning parameter. CNNs are one of the simplest counterexamples. The idea that the affine forward/backward weight operations are the bulk of neural computing is a similar fallacy and consequence of the same limitations. Full connectivity is the worst possible synaptic map and the other culprit. Although distributed semantics is arguably one of the greatest discoveries in (computational) linguistics in the 20th century, embeddings are not an effective way to implement and attempt to learn them. All of those things are ephemeral and they will not survive much longer. The scaling laws depend on them and will likewise collapse. They will certainly not survive my own research.

0 replies

divyanshu25 · 2026-01-31T16:54:31Z

divyanshu25
Jan 31, 2026

Just got my results across all the runs for scaling laws. They seem pretty clean. Not sure if I did something wrong in replicating the code. Curious to hear if these look reasonable?

Also in the post it says
If your code, architecture and optimization is properly arranged and you did your scaling laws right these curves should not intersect

But my curves for scaling laws are intersecting. I tried running with alot of configurations it always intersect for smaller models
Curious to hear what might be going wrong.

Thanks

0 replies

peddroantonio007-source · 2026-02-03T15:46:10Z

peddroantonio007-source
Feb 3, 2026

What should I do

0 replies

karpathy · 2026-02-08T17:59:54Z

karpathy
Feb 8, 2026
Maintainer Author

Update Feb 7

I re-ran the scaling laws overnight because I was curious where the repo is at and it's been exactly 1 month since Jan 7.
I used the script runs/miniseries.sh at this commit ff46300 . On my 8XH100 node the models up to d26 took overnight to run (somewhere around ~12 hours or something, I didn't keep a close track of it), the d28 was ~6 hours and d30 ~9 hours.

First, the val_bpb scaling laws look nice and clean:

The CORE looks more noisy sadly. CORE score is a bit noisier than I'd like at this small scale. I did a lot of experimentation on how to smooth it out and succeeded and made a "SmoothCORE" but decided to abandon out of fear of confusing people and introducing another metric. So e.g. I think the "regression" at d16 is just noise. d30 reaches CORE of almost exactly 30.0:

That said, given the fit to the (slightly noisy) CORE scores, the fit implies the following costs for reproducing various models in the GPT-2 and GPT-3 miniseries:

So GPT-3 is now predicted to be $32.3K and take ~56.1 days to train with nanochat d75 model. Recall that in Jan 7 this estimate was all the way up around $1076.5k, so that's a pretty solid improvement. But again, this is all a little bit weird and noisy and not to be trusted exactly. (Ideally we'd have targets in val_bpb which is a clean metric, but this is not a comparable one due to training data distribution shift.). Still, that's the rough napkin math at this point.

1 reply

Seqaeon Apr 28, 2026

Can we get the various scores at various depth for where the repo is currently at right now, presumably from around d8/d10 to d26/d34?

[Jan 7 2026] nanochat miniseries v1 #420

Uh oh!

Uh oh!

karpathy Jan 7, 2026 Maintainer

Details

Replies: 17 comments · 13 replies

Uh oh!

Uh oh!

Uh oh!

karpathy Jan 8, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

karpathy Jan 8, 2026 Maintainer Author

Uh oh!

Uh oh!

karpathy Jan 8, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karpathy Jan 9, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karpathy Jan 12, 2026 Maintainer Author

Uh oh!

karpathy Jan 12, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karpathy Feb 8, 2026 Maintainer Author

Uh oh!

karpathy
Jan 7, 2026
Maintainer

Replies: 17 comments 13 replies

karpathy Jan 8, 2026
Maintainer Author

karpathy Jan 8, 2026
Maintainer Author

karpathy Jan 8, 2026
Maintainer Author

karpathy Jan 9, 2026
Maintainer Author

karpathy
Jan 12, 2026
Maintainer Author

karpathy Jan 12, 2026
Maintainer Author

karpathy
Feb 8, 2026
Maintainer Author