Actionable Interpretability

About

This course focuses on interpretability techniques for generative AI (mainly LLMs), emphasizing research that turns interpretability insights into actionable knowledge — for model builders and for domain experts who use AI systems. In short, the idea of this seminar is to explore (through readings and discussion) the theme of how can we make interpretability methods useful in practice? The first part of the course will focus (mostly) on "mechanistic" interpretability methods. We will then transition to human-centered approaches to interpretability. The course will conclude with small-group research projects related to these themes.

Logistics and what not

Meeting time. Monday & Wednesday, 2:50–4:30 pm.

Location. SL 007

Format. Presentations and structured research paper discussions. Research projects later in the term.

Office hours. Immediately following class or by appointment (just email!).

Role assignment schedule. Here (subject to change!)

Role descriptions and review guidance. Here

Project details. Guidance and info on projects is available here

Piazza. Find the course Piazza here

Grading

Presentation & Discussion Leadership / Critiques. You will be assigned one of three roles (Discussion Lead, Reviewer 1, Reviewer 2) for select sessions: See here. The lead is responsible for summarizing the paper(s). Reviewer 1 (R1) is responsible for highlighting positive things about the paper(s). Reviewer 2 (R2) is to offer critiques on the work. Details (including a rubric!) available here. (40%)

Participation. This course is highly participatory. You are expected to engage actively with the readings and contribute to discussion; you will be evaluated accordingly. In addition, you must post at least one question on Piazza per reading before each class. See participation expectations. (10%)

Project. Small-group research project applying or operationalizing interpretability methods: proposal (with presentation) → milestone → final report & presentation. Project details available here. (50%)

Schedule (Mon/Wed meetings)

For session role assignments, please see here.

Date	Plan
Wed, Jan 7	Introductions, course overview (Lead: Byron) Readings: The Mythos of Model Interpretability; Interpretable Machine Learning — A Brief History, State-of-the-Art and Challenges
Methods in interpretability
Mon, Jan 12	Basics: A brief review of Transformers; from activation probing to patching interp 101 notebook Readings: Probing Classifiers: Promises, Shortcomings, and Advances; How to use and interpret activation patching
Wed, Jan 14	Probing representations: linguistic structure + truth directions Readings: A Structural Probe for Finding Syntax in Word Representations; The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets Bonus: The Trilemma of Truth in Large Language Models; Language Models (Mostly) Know What They Know
Mon, Jan 19	No class — MLK Day
Wed, Jan 21	Logit Lens & friends Reading(s): Eliciting Latent Predictions from Transformers with the Tuned Lens Future lens
Mon, Jan 26	Finding mechanisms Reading(s): Induction heads; Transformer Feed-Forward Layers Are Key-Value Memories Bonus: The Dual-Route Model of Induction
Wed, Jan 28	Sparse Autoencoders Reading(s): Extracting interpretable features; SAEs can interpret randomly initialized transformers Bonus: A Survey on SAEs
Mon, Feb 2	Circuits Reading(s): Hypothesis Testing the Circuit Hypothesis in LLMs; LLM Circuit Analyses Are Consistent Across Training and Scale Bonus: Intro to Circuits;
Wed, Feb 4	Chain-of-Thought (CoT) reasoning Reading(s): Chain of Thought is not Explainability; A mechanistic understanding of CoT Bonus: Chain of Thought Prompting Elicits Reasoning in Large Language Models
Mon, Feb 9	Verbalizing activations Reading(s): Patchscopes; LatentQA Bonus: Do Natural Language Descriptions of Model Activations Convey Privileged Information?
Wed, Feb 11	Complications Reading(s): An interpretability illustion for BERT; Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching
Mon, Feb 16	No class — Presidents Day
Wed, Feb 18	Where next for (mechanistic) interpretability? Reading(s): Open problems in mechanistic interpretability; Interpretability Needs a New Paradigm
Humans x Interpretability
Mon, Feb 23	Some cautionary results On Human Predictions with Explanations and Predictions of Machine Learning Models Sanity Checks for Saliency Maps
Wed, Feb 25	And some more optimistic findings Reading(s): Explanations Can Reduce Overreliance on AI Systems During Decision-Making; Automated rationale generation: a technique for explainable AI and its effects on human perceptions
Mon, Mar 2	No class — Spring Break
Wed, Mar 4	No class — Spring Break
Mon, Mar 9	Final project proposal presentations 1
Wed, Mar 11	Final project proposal presentations 2
Mon, Mar 16	Evaluation considerations Reading(s): Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating Explainable AI Systems; Manipulating and Measuring Model Interpretability
Emerging applications and directions
Wed, Mar 18	Guest speaker: Hiba Ahsan on (mechanistic) interpretability in healthcare Reading(s): Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare; Can SAEs reveal and mitigate racial biases of LLMs in healthcare?
Mon, Mar 23	Communicating with AI systems Reading(s): We Can't Understand AI Using our Existing Vocabulary; (Mis)Communicating with our AI Systems
Wed, Mar 25	Rethinking assumptions Reading(s): Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead; Backpack LMs
Controlling and editing LLMs with interpretability methods
Mon, Mar 30	No class! Work on COLM papers and/or projects :) (Papers and assignments moved to Monday, 4/6.)
Wed, Apr 1	Activation steering and decoding control Reading(s): Language Model Steering with Activation Engineering; DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts Bonus: Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
Mon, Apr 6	Editing weights Reading(s): Locating and Editing Factual Associations in GPT; Should We Really Edit Language Models? On the Evaluation of Edited Language Models? Bonus: Mass-Editing Memory in a Transformer
Wed, Apr 8	Dedicated project collaboration time (in class)
Mon, Apr 13	Final project presentations 1 (see here)
Wed, Apr 15	Final project presentations 2 (see here)
Mon, Apr 20	No class — Patriots Day (Boston)
Wed, Apr 22	Final project write-ups due!