WebSlides
Project Proposal
assets/logo.png
Advanced Topics in NLP

Exploring Mechanistic Interpretability on Weight Initialization Methods

for Small Transformer Language Models

Instructor
Liangming Pan
Research Team
Konstantin Garbers, Nicholas Oh
Motivation: The Standard View
assets/logo.png
Background

The Optimization Bias

Weight initialization is traditionally evaluated entirely through the lens of optimization dynamics.

Training loss convergence speed
Validation loss plateau
Gradient norm stability (vanishing/exploding)
Typical Loss Curves
Loss L\mathcal{L}
Training Steps
Init A (e.g., Xavier)
Init B (e.g., Kaiming)
Motivation: The Mechanistic Blindspot
assets/logo-white.png
The Core Problem

Similar Loss, Different Minds?

Two models may reach identical validation loss while relying on entirely different internal mechanisms.

Our Hypothesis

Initialization influences the formation of internal mechanisms, even when optimization metrics hide these differences.

Model A (Init X)
Loss=2.14\mathcal{Loss} = 2.14
Localized Circuit (1 Head)
Model B (Init Y)
Loss=2.14\mathcal{Loss} = 2.14
Distributed (Many Heads)
Research Questions
assets/logo.png
Core Inquiry

What Are We Asking?

RQ1

Performance & Emergence

How do different initialization schemes affect training dynamics and the emergence of useful language-modeling behavior?

RQ2

Mechanistic Differences

Do models trained with different initializations develop different internal mechanisms, even when their final performance is similar?

RQ3

Internal Geometry

Do different initialization schemes produce different activation and weight geometry (norms, entropy, singular values) during training?

Exploratory Hypotheses
assets/logo.png
Expectations

What We Might Find

Learning Trajectories

Some schemes (e.g., Scaled Residual) may lead to faster, more stable learning of the diagnostic task than classical methods.

Mechanism Localization

Some trained models may rely heavily on one or two attention heads, while others distribute the same behavior across several components.

Hidden Differences

Models with matched validation loss may exhibit fundamentally different ablation, patching, or logit lens behavior.

Explanatory Geometry

Differences in residual norms, attention entropy, or head-output similarity may correlate with observed stability.
Literature Context
assets/logo.png
Related Work

Standing on Shoulders

Weight Initialization

Classical methods designed to stabilize signal propagation and optimization variance.

Glorot & Bengio (2010),
He et al. (2015)Do they also affect internal mechanisms?

Mechanistic Interpretability

Tools to inspect computations: attention visualization, ablation, activation patching.

Nanda & Bloom (TransformerLens),
Elhage et al. (Mathematical Framework)

In-Context Learning

Induction behavior as a clear diagnostic for small Transformer LMs using previous context.

Olsson et al. (2022) (Induction Heads),
Singh et al. (2024) (What goes right?)
Proposed Method
assets/logo-white.png
Architecture

The Experimental Pipeline

Init Setup

Apply 6 distinct initialization schemes to identical architectures.

Training

Train small Transformers on synthetic diagnostic data (5 seeds each).

Behavioral Eval

Measure loss, task accuracy, and emergence step thresholds.

MI Analysis

Apply 6 mechanistic interpretability techniques across checkpoints.

The 'Lab' Environment
assets/logo.png
Scope Control

Model & Data Setup

To isolate mechanistic variables while managing compute costs, we use a controlled synthetic environment rather than pretraining a large LLM.

Model Architecture
TypeDecoder-only
Layers2
Heads4 per layer
d_model128
Synthetic Dataset
  • Vocab Size: 64 or 128
  • Context Length: 64
  • Patterns: Repeated tokens, copy logic
// Example Sequence
Input: [12, 45, 8, 99, 12]
Target: [45]
A Concrete Diagnostic Task
assets/logo.png
Controlled Setting

Induction Behavior

To rigorously compare mechanisms, we need a controlled diagnostic task. We use repeated-token sequence patterns (induction-like behavior).

Induction Head Mechanism
ABCDA?BAttention to preceding context
Logic: The model must look back to the previous instance of A and predict the B token that follows.
The Independent Variables
assets/logo.png
Interventions

Initialization Schemes

Normal (0, 0.02)

WijN(0,0.022)W_{ij} \sim \mathcal{N}(0, 0.02^2)

Standard Transformer-style baseline.

Xavier / Glorot

WijU ⁣[6nin+nout,6nin+nout]W_{ij} \sim U\!\left[-\sqrt{\frac{6}{n_{\text{in}}+n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}}+n_{\text{out}}}}\right]

Classical variance-preserving baseline.

Kaiming / He

WijN ⁣(0,2nin)W_{ij} \sim \mathcal{N}\!\left(0, \frac{2}{n_{\text{in}}}\right)

Common initialization for ReLU/GELU-like networks.

Orthogonal

WW=IorWW=IW^\top W = I \quad \text{or} \quad WW^\top = I

Geometry and norm-preserving initialization.

Scaled Residual

Wres12LWresW_{\text{res}} \leftarrow \frac{1}{\sqrt{2L}}\,W_{\text{res}}

Transformer-specific stability baseline.

Analysis 1: Behavioral Metrics
assets/logo.png
Behavioral Eval

Did it Learn?

Before inspecting the mechanisms, we establish the macroscopic training outcomes across initializations.

  • Training & Validation Loss
    Standard optimization health.
  • Diagnostic Accuracy
    Task-specific success rate.
  • Logit Difference
    Logit(correct) - Logit(distractor).
Emergence Step Analysis
Accuracy
Training Steps
Threshold (e.g. 80%)
4k12k
Analysis 2: Attention & Induction
assets/logo-white.png
MI Methods

Observing the Network

What tokens does each self-attention head attend to, and do they exhibit induction-like properties?

1. Attention Patterns

Previous-token score, Same-token score, Induction-position score, Entropy.

2. Induction Score

Model-level and Head-level scoring on controlled diagnostic prompts.

Attention Heatmap (Head 1.2)
Source Sequence
Target Sequence
[BOS]Thecatsat...Thecatsat...[EOS]
[BOS]Thecatsat...Thecatsat...[EOS]
Analysis 3: Causal Interventions
assets/logo.png
MI Methods

Testing Necessity

Moving beyond observation to causal proof: what happens if we break or replace specific internal components?

3. Head Ablation

Zero out specific heads. Does the model lose its capability? Indicates localization.

4. Activation Patching

Swap clean vs corrupted run activations. Isolates where task-relevant information resides.

Ablation Effect Matrix
L0
L1
-0.01
0.02
-0.05
-0.85
-0.02
-0.42
-0.01
0.05
Example: Strong localization in L0H3.
Analysis 4: Lens & Geometry
assets/logo-white.png
MI Methods

Information Emergence

Analyzing the intermediate states of the residual stream and the mathematical properties of the weights.

5. Logit Lens

Projecting intermediate residual streams into vocabulary space. At what layer does the correct token emerge?

6. Geometry Analysis

Residual stream norms, attention entropy, singular value spectra of weight matrices.

Logit Lens Flow
EmbedLayer 0Layer 1Rank 45Rank 1
assets/logo.png

Thank You!

Any Questions?

Proposal Feedback
assets/logo.png
Open Discussion

Questions for the Audience

1

Is the initialization comparison sufficiently interesting and scoped?

2

Should the train-time stopping criterion be step-matched, loss-matched, or both?

3

Are the six MI analyses appropriate, or are there any other interesting perspectives to include?