CY 4100: AI Security and Privacy

Fall 2025

 

Instructors: Alina Oprea (alinao)

 

Class Schedule: 

Tuesday and Thursday 11:45am-1:25pm ET

Location: Hurtig Hall 310

 

Office Hours: 

Thursday 2:30-3:30pm ET on Zoom and by appointment

 

Class forum:  Canvas with links to Piazza and Gradescope 

 

Class policies:  Academic integrity policy is strictly enforced.

 

Class Description:

AI is now deployed in critical domains such as medicine, biology, finance, and cyber security. Foundation models such as large language models (LLMs) have been trained on massive datasets crawled from the web and are subsequently finetuned to new tasks including summarization, translation, code generation, and conversational agents. This trend raises many concerns about the security of AI models in critical applications, as well as the privacy of the data used to train these models.

In this course, we study a variety of adversarial attacks on discriminative and generative AI models that impact the security and privacy of these systems. We will discuss mitigations against AI security and privacy vulnerabilities, and the challenges in making AI trustworthy. We will read and debate papers published in top-tier conferences in machine learning and cyber security. Students will have an opportunity to work on a semester-long project in trustworthy AI.

 

Disclaimer: This course is not meant to be the first course taken by a student in ML/AI. This course focuses on recent research in security and privacy of ML and AI. Prior knowledge in machine learning is essential for following this course. If you have any questions about the course content, please email the instructor.

 

Pre-requisites:

 

o   Calculus and linear algebra

o   Basic knowledge of machine learning 

 

Grading

The grade will be based on:

 

o   Assignments – 20%

o   Quizzes – 10%

o   Final project report – 40%

o   Final project presentation – 10%

o   Paper presentation– 15%

o   Class participation – 5%

     

 Calendar (Tentative)

 

Week

Date

Topic

Readings

1

Thu

09/04

Course outline (syllabus, grading, policies) 

Introduction to trustworthy AI

Keshav. How to read a paper.

 2

Mon

09/08

Review of deep learning

Thu

09/11

Review of LLMs.

Taxonomy of adversarial attacks on predictive and generative AI

 

Reference: Chapters 1 and 2.1 of NIST report on Adversarial ML

 

3

Mon

09/15

Evasion attacks against ML

Optimization-based and gradient-free attacks.

Required read: Carlini and Wagner. Towards Evaluating the Robustness of Neural Networks. IEEE S&P 2017

 

Thu

09/18

Poisoning attacks against ML.

Backdoor attacks, targeted attacks, subpopulation attacks

Gu et al. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. arXiv 2017.

 

Jagielski et al. Subpopulation Data Poisoning Attacks. ACM CCS 2021.

 

4

Mon

09/22

Privacy risks in ML. Membership inference attacks.

Carlini et al. Membership Inference Attacks From First Principles. IEEE S&P 2022.

 

Thu

09/25

LLM privacy: Membership inference and data extraction

Duan et al. Do Membership Inference Attacks Work on Large Language Models? COLM 2024.

 

Required read: Carlini et al. Extracting Training Data from Large Language Models. USENIX Security 2021.

 

5

Mon

09/29

LLM prompt injection and jailbreaking.

 

 

 

Wei et al. Jailbroken: How Does LLM Safety Training Fail? arXiv 2023

 

Greshake et al. Compromising real-world LLM-integrated applications with indirect prompt injection. AISec 2023.

 

Russinovich et al. Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack. USENIX Security 2025

 

 

Thu

10/02

LLM safety alignment

 

 

Ouyang et al. Training language models to follow instructions with human feedback. arXiv 2022.

 

Bai et al. Constitutional AI: Harmlessness from AI Feedback. arXiv 2022

6

Mon

10/06

Paper presentation, first session

 

 

Papers and presenters TBD

 

Thu

10/09

Class canceled

 

 

 

7

Mon

10/13

University holiday. No class 

 

Thu

10/16

LLM jailbreaking

 

 

 

 

 

 

Required read: Zou et al. Universal and Transferable Adversarial Attacks on Aligned Language Models, arXiv 2023.

 

Chao et al. Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv 2023

 

8

Mon

10/20

Defenses to prompt injection and jailbreaking

 

 

 

 

Required read: Wallace et al. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. arXiv 2024

 

Li et al. ACE: A Security Architecture for LLM-Integrated App Systems. arXiv 2025. Presented by Evan Rose

 

 

Thu

10/23

LLM poisoning

 

 

 

 

Required read: Hubinger et al. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. arXiv 2024

 

Chaudhari et al. Cascading Adversarial Bias from Injection to Distillation in Language Models. arXiv 2025. Presented by Harsh Chaudhari

 

9

Mon

10/27

 

Paper presentation, second session

Papers and presenters TBD

 

 

Thu

10/30

LLM agent security

Shavit et al. Practices for Governing Agentic AI Systems. OpenAI website 2023

 

Triedman et al. Multi-Agent Systems Execute Arbitrary Malicious Code. arXiv 2025

 

Syros et al. SAGA: A Security Architecture for Governing AI Agentic Systems. arXiv 2025.

 

10

Mon

11/03

LLM agent privacy

 

Required read: Bagdasaryan et al. Air Gap: Privacy-Conscious Conversational Agents. ACM CCS 2024

 

Das et al. Disclosure Audits for LLM Agents. arXiv 2025

 

Thu

11/06

Security of reasoning models

Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022 

Kumar et al. OVERTHINK: Slowdown Attacks on Reasoning LLMs. arXiv 2025

 

Zaremba et al. Trading Inference-Time Compute for Adversarial Robustness. arXiv 2025

 

11

Mon

11/10

Paper presentation, third session

Papers and presenters TBD

 

Thu

11/13

LLM fine-tuning risks

 

Chen et al. The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risks. arXiv 2023

 

Kandpal et al. User Inference Attacks on Large Language Models. EMNLP 2024

 

Labunets et al. Fun-tuning: Characterizing the Vulnerability of Proprietary LLMs to Optimization-based Prompt Injection Attacks via the Fine-Tuning Interface. IEEE S&P 2025

 

12

Mon

11/17

Watermarking LLMs

Kirchenbauer et al. A Watermark for Large Language Models. arXiv 2023

 

Jovanovic et al. Watermark Stealing in Large Language Models ICML 2024

 

Thu

11/20

 

Reinforcement learning security

 

 

Rathbun et al. SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents. arXiv 2024

 

 

13

Mon

11/24

Review

 

    

Thu

11/27

No class

University holiday (Thanksgiving)

 

Mon

12/01

Project presentations

Thu

12/04

Project presentations

 

Mon

12/08

Project reports due

 

 

Review materials

o   Probability review notes from Stanford's machine learning class

o   Sam Roweis's probability review

o   Linear algebra review notes from Stanford's machine learning class 

 

 

Other resources

 

Books:

o   Trevor Hastie, Rob Tibshirani, and Jerry Friedman. Elements of Statistical Learning. Second Edition, Springer, 2009.

o   Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2006.  

o   A. Zhang, Z. Lipton, and A. SmolaDive into Deep Learning  

o   C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy

o   Shai Ben-David and Shai Shalev-Shwartz. Understanding Machine Learning: From Theory to Algorithms