Large-Scale Storage and Retrieval

Time: Tue/Fri 3:25-5:05 ET/Boston

Location: Online (See Canvas for connection info.)

John Rachlin

Associate Teaching Professor, Northeastern University


E-mail j.rachlin@northeastern.edu
Web https://www.khoury.northeastern.edu/people/john-rachlin/
Office Hours Thu 11a-1p, 2p-5p on Zoom.
By appointment only.
Email me to arrange other times.

Teaching Assistants

See Piazza for hours and zoom locations.

Veronica Aguiar

Aditya Bhanwadiya

Dhruv Doshi

Ajay Inavolu

Amey Parab

Graziano Cezario

Vivek Rachakonda

Andres Rivera

Tee Tesharojanasup

Course Information

Course Description (Unofficial)

Presents the fundamentals of data engineering using polyglot persistence. We will go beyond the relational database that has dominated industry since the 1970’s and explore the world of NoSQL (non-relational) data models including Document, Key-Value, Graph, and Columnar databases. We’ll study their capabilities, limitations, and use-cases. On the “retrieval” side of things, we will cover such diverse topics as Apache Spark using Scala and Stream / Event processing using Kafka and Spark Streaming. Most of these technologies are relatively new, having been available to developers for less than 10 years, and they continue to evolve rapidly as part of the effort to address today’s big-data challenges. This makes their study both fascinating and challenging. There will be some programming assignments, one small research assignment where you will investigate a NoSQL database of your own choosing, and a class project. In our study of Apache Spark, we will learn Spark SQL and the rudiments of Scala functional programming. Other programming assignments may be implemented in your choice of language. 4.000 Credit hours

Readings

There is no single text book that covers everything we will be exploring in this class – although some authors have tried! What few books are available tend to be very highlevel and of little practical value. I will be assigning chapters, papers, and videos as we go along. To the extent possible I have chosen books that are available to you for free as a Northeastern student through O'Reilly E-Books. Two very popular books that we will be using are: Kleppmann, 2017. Designing Data-Intensive Applications, O'Reilly and Reis & Housley, 2022. Fundamentals of Data Engineering. Both of these books are freely available through your student O'Reilly E-Books account. I also created a DS4300 playlist for your convenience.

Recordings

Most classes will be recorded. Recordings are available through Canvas....Zoom Meetings....Cloud Recordings. Recordings should not be used as a substitute for coming to class. In fact, I do take attendance, and attendance counts towards your final grade.

Advice for taking an online class

I understand that taking a class online can be a challenge for some students. I will do everything in my power to make the class as personalized and engaging as possible. This course was not designed for 100,000 anonymous strangers. I will tailor the material to your questions and feedback and your assignments will be individually evaluated by the instruction team. You can make this class more interesting and fun by asking questions, expressing opinions, and having a voice. Turning on your camera, while not required, will help me to get to know you and to know how you are reacting to the material. You may also find it helpful to have a second small monitor set up so you can follow along when I give technical demonstrations.

Homework

There will be one "mini-project" programming assignment about every two weeks. The detailed dates are listed on the schedule below. Homeworks should be completed largely on your own. Informal discussions or seeking general help from fellow students is ok so long as you cite your sources. Do not simply copy another student's submission! In addition, as part of your homework, there may be occaisional take-home quizzes to verify your understanding of the reading material.

Strict Homework Late Policy:
  • Up to 48 hours late: 10% penalty
  • After 48 hours: Not accepted.

Class Project

There will be a group project involving either a NoSQL database or the use of Apache Spark to be presented at the end of the semester.

Academic Misconduct

Programming is a creative process. Individuals or pair groups (when allowed) must reach their own understanding of problems and discover paths to their solutions. During this time, discussions with friends and colleagues are encouraged—you will do much better in the course, and at Northeastern, if you find people with whom you regularly discuss problems. But those discussions should take place verbally. If you simply copy large blocks of code from another student you are breaking the rules. Each program/application must be largely the product of your own mind. To the extent that you use internet resources such as Stack Overflow or ChatGPT, I expect that you will properly cite your sources within your code. The university's academic integrity policy discusses actions regarded as violations and consequences for students: http://www.northeastern.edu/osccr/academic-integrity

Evaluation

The final grade for this course will be weighted as follows:

  • Homework: 70%
  • Group Project: 25%
  • Participation (attendance, piazza, etc.): 5%

Final grades will be assigned based on the following scale. Computed grades are NOT rounded.

LetterRange
A94 - 100
A-90 - 94
B+87 - 89
B83 - 86
B-80 - 82
C+77 - 79
C73 - 76
C-70 - 72
D+67 - 69
D63 - 66
D-60 - 62
F<60

Schedule

Note: This schedule is subject to change and will be adjusted as needed throughout the semester.

Week Date Topic Reading HW Due
1 Jan 9/12 The limits of the relational model Kleppmann 1
Harrison 1
2 Jan 16/19 NoSQL models. Consistency and Availability Kleppman 2
Vogels2008
3 Jan 23/26 Redis, Key-Value Stores Kleppman 3
Harrison 3
HW1
4 Jan 30/Feb 2 Indexing & Storage TBD
5 Feb 6/9 Mongo: Document Stores HW2
6 Feb 13/16 Replication and the Cap Theorem Kleppmann 5
7 Feb 20/23 Partitioning, More Mongo and PyMongo Kleppmann 6 HW3
8 Feb 27/Mar 2 Functional programming with Scala Swartz: Learning Scala Ch1-5
9 Mar 5/8 Spring Break - No Class
10 Mar 12/15 Apache Spark and SparkSQL Sarkar: Learning SparkSQL (Ch1-3)
or Zaharia: Learning Spark (Ch1,2,9)
HW4
11 Mar 19/22 Event processing with Kafka / Spark Streaming TBD
12 Mar 26/29 Graph Database TBD HW5
13 Apr 2/5 Graph databases II TBD
14 Apr 9/12 Project Science Fair
15 Apr 16 Wrap up and the future of databases PROJECT DUE
Apr 17 @ 11:59pm ET/Boston

Inclusive Class

Northeastern University values the diversity of our students, staff, and faculty; recognizing the important contribution each makes to our unique community.

Respect is demanded at all times throughout this course. In the classroom, not only is participation required, it is expected that everyone is treated with dignity and respect. We realize everyone comes from a different background with different experiences and abilities. Our knowledge will always be used to better everyone in the class.

We strive to create a learning environment that is welcoming to students of all backgrounds. If you feel unwelcome for any reason, please let us know so we can work to make things better. You can let us know by talking to anyone on the teaching staff. If you feel uncomfortable talking to members of the teaching staff, please consider reaching out to your academic advisor.

Northeastern is committed to providing equal access and support to all qualified students through the provision of reasonable accommodations so that each student may fully participate in the learning experience. If you have a disability that requires accommodations, please contact the Disability Resource Center http://www.northeastern.edu/drc/, DRC@northeastern.edu, 617-353-2675. Accommodations cannot be made retroactively and to receive an accommodation, a letter from the DRC or LDP is required.