| Wed, Jan 7 |
Introductions, course overview (Lead: Byron)
Readings: The Mythos of Model Interpretability;
Interpretable Machine Learning — A Brief History, State-of-the-Art and Challenges
|
|
Methods in interpretability
|
| Mon, Jan 12 |
Basics: A brief review of Transformers; from activation probing to patching
interp 101 notebook
Readings: Probing Classifiers: Promises, Shortcomings, and Advances;
How to use and interpret activation patching
|
| Wed, Jan 14 |
Probing representations: linguistic structure + truth directions
Readings:
A Structural Probe for Finding Syntax in Word Representations;
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Bonus: The Trilemma of Truth in Large Language Models; Language Models (Mostly) Know What They Know
|
| Mon, Jan 19 |
No class — MLK Day |
| Wed, Jan 21 |
Logit Lens & friends
Reading(s):
Eliciting Latent Predictions from Transformers with the Tuned Lens
Future lens
|
| Mon, Jan 26 |
Finding mechanisms
Reading(s):
Induction heads;
Transformer Feed-Forward Layers Are Key-Value Memories
Bonus: The Dual-Route Model of Induction
|
| Wed, Jan 28 |
Sparse Autoencoders
Reading(s):
Extracting interpretable features;
SAEs can interpret randomly initialized transformers
Bonus: A Survey on SAEs
|
| Mon, Feb 2 |
Circuits
Reading(s):
Hypothesis Testing the Circuit Hypothesis in LLMs;
LLM Circuit Analyses Are Consistent Across Training and Scale
Bonus: Intro to Circuits;
|
| Wed, Feb 4 |
Chain-of-Thought (CoT) reasoning
Reading(s):
Chain of Thought is not Explainability;
A mechanistic understanding of CoT
Bonus: Chain of Thought Prompting Elicits Reasoning in Large Language Models
|
| Mon, Feb 9 |
Verbalizing activations
Reading(s):
Patchscopes;
LatentQA
Bonus: Do Natural Language Descriptions of Model Activations Convey Privileged Information?
|
| Wed, Feb 11 |
Complications
Reading(s):
An interpretability illustion for BERT;
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching
|
| Mon, Feb 16 |
No class — Presidents Day |
| Wed, Feb 18 |
Where next for (mechanistic) interpretability?
Reading(s):
Open problems in mechanistic interpretability;
Interpretability Needs a New Paradigm
|
|
Humans x Interpretability
|
| Mon, Feb 23 |
Some cautionary results
On Human Predictions with Explanations and Predictions of Machine Learning Models
Sanity Checks for Saliency Maps
|
| Wed, Feb 25 |
And some more optimistic findings
Reading(s): Explanations Can Reduce Overreliance on AI Systems During Decision-Making;
Automated rationale generation: a technique for explainable AI and its effects on human perceptions
|
| Mon, Mar 2 |
No class — Spring Break |
| Wed, Mar 4 |
No class — Spring Break |
| Mon, Mar 9 |
Final project proposal presentations 1 |
| Wed, Mar 11 |
Final project proposal presentations 2 |
| Mon, Mar 16 |
Evaluation considerations
Reading(s): Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating Explainable AI Systems;
Manipulating and Measuring Model Interpretability
|
|
Emerging applications and directions
|
| Wed, Mar 18 |
Guest speaker: Hiba Ahsan on (mechanistic) interpretability in healthcare
Reading(s): Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare;
Can SAEs reveal and mitigate racial biases of LLMs in healthcare?
|
| Mon, Mar 23 |
Communicating with AI systems
Reading(s): We Can't Understand AI Using our Existing Vocabulary;
(Mis)Communicating with our AI Systems
|
| Wed, Mar 25 |
Rethinking assumptions
Reading(s): Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead;
Backpack LMs
|
|
Controlling and editing LLMs with interpretability methods
|
| Mon, Mar 30 |
No class! Work on COLM papers and/or projects :)
(Papers and assignments moved to Monday, 4/6.) |
| Wed, Apr 1 |
Activation steering and decoding control
Reading(s):
Language Model Steering with Activation Engineering;
DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts
Bonus:
Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
|
| Mon, Apr 6 |
Editing weights
Reading(s):
Locating and Editing Factual Associations in GPT;
Should We Really Edit Language Models? On the Evaluation of Edited Language Models?
Bonus: Mass-Editing Memory in a Transformer
|
| Wed, Apr 8 |
Dedicated project collaboration time (in class) |
| Mon, Apr 13 |
Final project presentations 1 (see here) |
| Wed, Apr 15 |
Final project presentations 2 (see here) |
| Mon, Apr 20 |
No class — Patriots Day (Boston) |
| Wed, Apr 22 |
Final project write-ups due! |