Actionable Interpretability

Special Topics in Artificial Intelligence · Spring 2026 · Mondays & Wednesdays, 2:50–4:30 pm (ET)
Northeastern University — Khoury College of Computer Sciences

About

This course focuses on interpretability techniques for generative AI (mainly LLMs), emphasizing research that turns interpretability insights into actionable knowledge — for model builders and for domain experts who use AI systems. In short, the idea of this seminar is to explore (through readings and discussion) the theme of how can we make interpretability methods useful in practice? The first part of the course will focus (mostly) on "mechanistic" interpretability methods. We will then transition to human-centered approaches to interpretability. The course will conclude with small-group research projects related to these themes.

See also: this relevant workshop.

Logistics and what not

Meeting time. Monday & Wednesday, 2:50–4:30 pm.

Location. SL 007

Format. Presentations and structured research paper discussions. Research projects later in the term.

Office hours. Immediately following class or by appointment (just email!).

Role assignment schedule. Here (subject to change!)

Role descriptions and review guidance. Here

Project details. Guidance and info on projects is available here

Piazza. Find the course Piazza here

Grading

Presentation & Discussion Leadership / Critiques. You will be assigned one of three roles (Discussion Lead, Reviewer 1, Reviewer 2) for select sessions: See here. The lead is responsible for summarizing the paper(s). Reviewer 1 (R1) is responsible for highlighting positive things about the paper(s). Reviewer 2 (R2) is to offer critiques on the work. Details (including a rubric!) available here. (40%)

Participation. This course is highly participatory. You are expected to engage actively with the readings and contribute to discussion; you will be evaluated accordingly. In addition, you must post at least one question on Piazza per reading before each class. See participation expectations. (10%)

Project. Small-group research project applying or operationalizing interpretability methods: proposal (with presentation) → milestone → final report & presentation. Project details available here. (50%)

Schedule (Mon/Wed meetings)

For session role assignments, please see here.
Date Plan
Wed, Jan 7 Introductions, course overview (Lead: Byron)
Readings: The Mythos of Model Interpretability; Interpretable Machine Learning — A Brief History, State-of-the-Art and Challenges
Methods in interpretability
Mon, Jan 12 Basics: A brief review of Transformers; from activation probing to patching
interp 101 notebook
Readings: Probing Classifiers: Promises, Shortcomings, and Advances; How to use and interpret activation patching
Wed, Jan 14 Probing representations: linguistic structure + truth directions
Readings: A Structural Probe for Finding Syntax in Word Representations; The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Bonus: The Trilemma of Truth in Large Language Models; Language Models (Mostly) Know What They Know
Mon, Jan 19 No class — MLK Day
Wed, Jan 21 Logit Lens & friends

Reading(s): Eliciting Latent Predictions from Transformers with the Tuned Lens Future lens
Mon, Jan 26 Finding mechanisms
Reading(s): Induction heads; Transformer Feed-Forward Layers Are Key-Value Memories
Bonus: The Dual-Route Model of Induction
Wed, Jan 28 Sparse Autoencoders
Reading(s): Extracting interpretable features; SAEs can interpret randomly initialized transformers
Bonus: A Survey on SAEs
Mon, Feb 2 Circuits
Reading(s): Hypothesis Testing the Circuit Hypothesis in LLMs; LLM Circuit Analyses Are Consistent Across Training and Scale
Bonus: Intro to Circuits;
Wed, Feb 4 Chain-of-Thought (CoT) reasoning
Reading(s): Chain of Thought is not Explainability; A mechanistic understanding of CoT
Bonus: Chain of Thought Prompting Elicits Reasoning in Large Language Models
Mon, Feb 9 Verbalizing activations
Reading(s): Patchscopes; LatentQA
Bonus: Do Natural Language Descriptions of Model Activations Convey Privileged Information?
Wed, Feb 11 Complications
Reading(s): An interpretability illustion for BERT; Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching
Mon, Feb 16 No class — Presidents Day
Wed, Feb 18 Where next for (mechanistic) interpretability?
Reading(s): Open problems in mechanistic interpretability; Interpretability Needs a New Paradigm
Humans x Interpretability
Mon, Feb 23 Some cautionary results
On Human Predictions with Explanations and Predictions of Machine Learning Models Sanity Checks for Saliency Maps
Wed, Feb 25 And some more optimistic findings
Reading(s): Explanations Can Reduce Overreliance on AI Systems During Decision-Making; Automated rationale generation: a technique for explainable AI and its effects on human perceptions
Mon, Mar 2 No class — Spring Break
Wed, Mar 4 No class — Spring Break
Mon, Mar 9 Final project proposal presentations 1
Wed, Mar 11 Final project proposal presentations 2
Mon, Mar 16 Evaluation considerations
Reading(s): Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating Explainable AI Systems; Manipulating and Measuring Model Interpretability
Emerging applications and directions
Wed, Mar 18 Guest speaker: Hiba Ahsan on (mechanistic) interpretability in healthcare
Reading(s): Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare; Can SAEs reveal and mitigate racial biases of LLMs in healthcare?
Mon, Mar 23 Communicating with AI systems
Reading(s): We Can't Understand AI Using our Existing Vocabulary; (Mis)Communicating with our AI Systems
Wed, Mar 25 Rethinking assumptions
Reading(s): Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead; Backpack LMs
Controlling and editing LLMs with interpretability methods
Mon, Mar 30 No class! Work on COLM papers and/or projects :)
(Papers and assignments moved to Monday, 4/6.)
Wed, Apr 1 Activation steering and decoding control
Reading(s): Language Model Steering with Activation Engineering; DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts
Bonus: Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
Mon, Apr 6 Editing weights
Reading(s): Locating and Editing Factual Associations in GPT; Should We Really Edit Language Models? On the Evaluation of Edited Language Models?
Bonus: Mass-Editing Memory in a Transformer
Wed, Apr 8 Dedicated project collaboration time (in class)
Mon, Apr 13 Final project presentations 1 (see here)
Wed, Apr 15 Final project presentations 2 (see here)
Mon, Apr 20 No class — Patriots Day (Boston)
Wed, Apr 22 Final project write-ups due!