Teaching and Advising Experience
CS 6240: Large-Scale Parallel Data Processing
- Interested, but you do not meet the pre-reqs?
Please read this FAQ.
- Covers big-data analysis techniques that scale out with
increasing number of compute nodes, e.g., for cloud computing. Focuses on
approaches for problem and data partitioning that distribute work
effectively while keeping total cost for computation and data transfer low.
Deterministic and random algorithms from a variety of domains, including
graphs, data mining, linear algebra, and information retrieval, are studied
and analyzed in terms of their cost, scalability, and robustness against
skew. Coursework emphasizes hands-on programming experience with modern
state-of-the-art big-data processing technology. Students who do not meet
course prerequisites may seek permission of instructor.
CS 7290: Special Topics in Data Science: Foundations in
Scalable Data Management
- This course explores research topics in analysis and
management of large data, with a focus on distributed and parallel
approaches, join processing, and imprecise data/approximation. We will
discuss and analyze papers covering applications, algorithms, systems, and
theory--with a focus on the most recent developments. This course is
designed for PhD students, as well as advanced Masters students with a solid
background in algorithms and one or more data-oriented areas of computer
science, incl. machine learning, AI, logics, information retrieval, and
security. A desired outcome of the course project is the creation of
research results that are publishable in a peer-reviewed conference.
CS 6240: Parallel Data Processing in MapReduce
- Graduate course. This course covers techniques for
analyzing very large data sets. We introduce the MapReduce programming model
and the core technologies it relies on in practice, such as a distributed
file system. Related approaches and technologies from distributed databases
and Cloud Computing will also be introduced. Particular emphasis is placed
on practical examples and hands-on programming experience. Both plain
MapReduce and database-inspired advanced programming models running on top
of a MapReduce infrastructure will be used.
CS 6220: Data Mining Techniques
- Graduate course. This course covers various aspects of data mining including data
preprocessing, classification, ensemble methods, association rule mining, sequence
mining, and cluster analysis. The class project involves hands-on practice
of mining useful knowledge from a large database.
CS 3200: Database Design
- Upper division undergraduate course. This course studies the design of relational databases, including the
entity-relationship model, normalization, relational algebra, SQL, triggers,
stored procedures, indexing, elementary query optimization, and fundamentals
of concurrency and recovery. The class project involves working with a
commercial relational database management system and accessing it from an
CSG 339: Scalable Techniques for Massive Data
- Graduate course. We discuss influential and cutting edge research papers from academia
and industry research groups. The course also has a project requirement
where students can choose a research project related to large-scale data
PhD Student Advising
- Bahar Qarabaqi
- Alper Okcan
Previous Students (co-advised with faculty at Cornell)
- Biswanath Panda (Ph.D. 2009, first employment: Google)
- Mingsheng Hong (Ph.D. 2008, first employment: Vertica)
- Daria Sorokina (Ph.D. 2008, first employment: PostDoc at
- Abhinandan Das (Ph.D. 2005, first employment: Google)