
Exploring Mechanistic Interpretability on Weight Initialization Methods
for Small Transformer Language Models

The Optimization Bias
Weight initialization is traditionally evaluated entirely through the lens of optimization dynamics.

Similar Loss, Different Minds?
Two models may reach identical validation loss while relying on entirely different internal mechanisms.
Our Hypothesis
Initialization influences the formation of internal mechanisms, even when optimization metrics hide these differences.

What Are We Asking?
Performance & Emergence
How do different initialization schemes affect training dynamics and the emergence of useful language-modeling behavior?
Mechanistic Differences
Do models trained with different initializations develop different internal mechanisms, even when their final performance is similar?
Internal Geometry
Do different initialization schemes produce different activation and weight geometry (norms, entropy, singular values) during training?

What We Might Find
Learning Trajectories
Mechanism Localization
Hidden Differences
Explanatory Geometry

Standing on Shoulders
Weight Initialization
Classical methods designed to stabilize signal propagation and optimization variance.
He et al. (2015)Do they also affect internal mechanisms?
Mechanistic Interpretability
Tools to inspect computations: attention visualization, ablation, activation patching.
Elhage et al. (Mathematical Framework)
In-Context Learning
Induction behavior as a clear diagnostic for small Transformer LMs using previous context.
Singh et al. (2024) (What goes right?)

The Experimental Pipeline
Init Setup
Apply 6 distinct initialization schemes to identical architectures.
Training
Train small Transformers on synthetic diagnostic data (5 seeds each).
Behavioral Eval
Measure loss, task accuracy, and emergence step thresholds.
MI Analysis
Apply 6 mechanistic interpretability techniques across checkpoints.

Model & Data Setup
To isolate mechanistic variables while managing compute costs, we use a controlled synthetic environment rather than pretraining a large LLM.
- Vocab Size: 64 or 128
- Context Length: 64
- Patterns: Repeated tokens, copy logic
Input: [12, 45, 8, 99, 12]
Target: [45]

Induction Behavior
To rigorously compare mechanisms, we need a controlled diagnostic task. We use repeated-token sequence patterns (induction-like behavior).
A and predict the B token that follows.
Initialization Schemes
Normal (0, 0.02)
Standard Transformer-style baseline.
Xavier / Glorot
Classical variance-preserving baseline.
Kaiming / He
Common initialization for ReLU/GELU-like networks.
Orthogonal
Geometry and norm-preserving initialization.
Scaled Residual
Transformer-specific stability baseline.

Did it Learn?
Before inspecting the mechanisms, we establish the macroscopic training outcomes across initializations.
- Training & Validation Loss
Standard optimization health. - Diagnostic Accuracy
Task-specific success rate. - Logit Difference
Logit(correct) - Logit(distractor).

Observing the Network
What tokens does each self-attention head attend to, and do they exhibit induction-like properties?
1. Attention Patterns
Previous-token score, Same-token score, Induction-position score, Entropy.
2. Induction Score
Model-level and Head-level scoring on controlled diagnostic prompts.

Testing Necessity
Moving beyond observation to causal proof: what happens if we break or replace specific internal components?
3. Head Ablation
Zero out specific heads. Does the model lose its capability? Indicates localization.
4. Activation Patching
Swap clean vs corrupted run activations. Isolates where task-relevant information resides.

Information Emergence
Analyzing the intermediate states of the residual stream and the mathematical properties of the weights.
5. Logit Lens
Projecting intermediate residual streams into vocabulary space. At what layer does the correct token emerge?
6. Geometry Analysis
Residual stream norms, attention entropy, singular value spectra of weight matrices.

Thank You!
Any Questions?

Questions for the Audience
Is the initialization comparison sufficiently interesting and scoped?
Should the train-time stopping criterion be step-matched, loss-matched, or both?
Are the six MI analyses appropriate, or are there any other interesting perspectives to include?