Project Proposal

Advanced Topics in NLP

Exploring Mechanistic Interpretability on Weight Initialization Methods

for Small Transformer Language Models

Instructor

Liangming Pan

Research Team

Konstantin Garbers, Nicholas Oh

Motivation: The Standard View

Background

The Optimization Bias

Weight initialization is traditionally evaluated entirely through the lens of optimization dynamics.

Training loss convergence speed

Validation loss plateau

Gradient norm stability (vanishing/exploding)

Typical Loss Curves

Loss

\mathcal{L}

Training Steps

Init A (e.g., Xavier)

Init B (e.g., Kaiming)

Motivation: The Mechanistic Blindspot

The Core Problem

Similar Loss, Different Minds?

Two models may reach identical validation loss while relying on entirely different internal mechanisms.

Our Hypothesis

Initialization influences the formation of internal mechanisms, even when optimization metrics hide these differences.

Model A (Init X)

\mathcal{Loss} = 2.14

Localized Circuit (1 Head)

Model B (Init Y)

\mathcal{Loss} = 2.14

Distributed (Many Heads)

Research Questions

Core Inquiry

What Are We Asking?

RQ1

Performance & Emergence

How do different initialization schemes affect training dynamics and the emergence of useful language-modeling behavior?

RQ2

Mechanistic Differences

Do models trained with different initializations develop different internal mechanisms, even when their final performance is similar?

RQ3

Internal Geometry

Do different initialization schemes produce different activation and weight geometry (norms, entropy, singular values) during training?

Exploratory Hypotheses

Expectations

What We Might Find

Learning Trajectories

Some schemes (e.g., Scaled Residual) may lead to faster, more stable learning of the diagnostic task than classical methods.

Mechanism Localization

Some trained models may rely heavily on one or two attention heads, while others distribute the same behavior across several components.

Hidden Differences

Models with matched validation loss may exhibit fundamentally different ablation, patching, or logit lens behavior.

Explanatory Geometry

Differences in residual norms, attention entropy, or head-output similarity may correlate with observed stability.

Literature Context

Related Work

Standing on Shoulders

Weight Initialization

Classical methods designed to stabilize signal propagation and optimization variance.

Glorot & Bengio (2010),
He et al. (2015)Do they also affect internal mechanisms?

Mechanistic Interpretability

Tools to inspect computations: attention visualization, ablation, activation patching.

Nanda & Bloom (TransformerLens),
Elhage et al. (Mathematical Framework)

In-Context Learning

Induction behavior as a clear diagnostic for small Transformer LMs using previous context.

Olsson et al. (2022) (Induction Heads),
Singh et al. (2024) (What goes right?)

Proposed Method

Architecture

The Experimental Pipeline

Init Setup

Apply 6 distinct initialization schemes to identical architectures.

Training

Train small Transformers on synthetic diagnostic data (5 seeds each).

Behavioral Eval

Measure loss, task accuracy, and emergence step thresholds.

MI Analysis

Apply 6 mechanistic interpretability techniques across checkpoints.

The 'Lab' Environment

Scope Control

Model & Data Setup

To isolate mechanistic variables while managing compute costs, we use a controlled synthetic environment rather than pretraining a large LLM.

Model Architecture

TypeDecoder-only

Layers2

Heads4 per layer

d_model128

Synthetic Dataset

Vocab Size: 64 or 128
Context Length: 64
Patterns: Repeated tokens, copy logic

// Example Sequence
Input: [12, 45, 8, 99, 12]
Target: [45]

A Concrete Diagnostic Task

Controlled Setting

Induction Behavior

To rigorously compare mechanisms, we need a controlled diagnostic task. We use repeated-token sequence patterns (induction-like behavior).

Induction Head Mechanism

Logic: The model must look back to the previous instance of A and predict the B token that follows.

The Independent Variables

Interventions

Initialization Schemes

Normal (0, 0.02)

W_{ij} \sim \mathcal{N}(0, 0.02^2)

Standard Transformer-style baseline.

Xavier / Glorot

W_{ij} \sim U\!\left[-\sqrt{\frac{6}{n_{\text{in}}+n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}}+n_{\text{out}}}}\right]

Classical variance-preserving baseline.

Kaiming / He

W_{ij} \sim \mathcal{N}\!\left(0, \frac{2}{n_{\text{in}}}\right)

Common initialization for ReLU/GELU-like networks.

Orthogonal

W^\top W = I \quad \text{or} \quad WW^\top = I

Geometry and norm-preserving initialization.

Scaled Residual

W_{\text{res}} \leftarrow \frac{1}{\sqrt{2L}}\,W_{\text{res}}

Transformer-specific stability baseline.

Analysis 1: Behavioral Metrics

Behavioral Eval

Did it Learn?

Before inspecting the mechanisms, we establish the macroscopic training outcomes across initializations.

Training & Validation Loss
Standard optimization health.
Diagnostic Accuracy
Task-specific success rate.
Logit Difference
Logit(correct) - Logit(distractor).

Emergence Step Analysis

Accuracy

Training Steps

Threshold (e.g. 80%)

Analysis 2: Attention & Induction

MI Methods

Observing the Network

What tokens does each self-attention head attend to, and do they exhibit induction-like properties?

1. Attention Patterns

Previous-token score, Same-token score, Induction-position score, Entropy.

2. Induction Score

Model-level and Head-level scoring on controlled diagnostic prompts.

Attention Heatmap (Head 1.2)

Source Sequence

Target Sequence

[BOS]Thecatsat...Thecatsat...[EOS]

Analysis 3: Causal Interventions

MI Methods

Testing Necessity

Moving beyond observation to causal proof: what happens if we break or replace specific internal components?

3. Head Ablation

Zero out specific heads. Does the model lose its capability? Indicates localization.

4. Activation Patching

Swap clean vs corrupted run activations. Isolates where task-relevant information resides.

Ablation Effect Matrix

-0.01

0.02

-0.05

-0.85

-0.02

-0.42

-0.01

0.05

Example: Strong localization in L0H3.

Analysis 4: Lens & Geometry

MI Methods

Information Emergence

Analyzing the intermediate states of the residual stream and the mathematical properties of the weights.

5. Logit Lens

Projecting intermediate residual streams into vocabulary space. At what layer does the correct token emerge?

6. Geometry Analysis

Residual stream norms, attention entropy, singular value spectra of weight matrices.

Logit Lens Flow

Thank You!

Any Questions?

Proposal Feedback

Open Discussion

Questions for the Audience

Is the initialization comparison sufficiently interesting and scoped?

Should the train-time stopping criterion be step-matched, loss-matched, or both?

Are the six MI analyses appropriate, or are there any other interesting perspectives to include?