Course Information
Course Description (Unofficial)
Presents the fundamentals of data engineering using polyglot persistence. We will go beyond
the relational database that has dominated industry since the 1970’s and explore the world of
NoSQL (non-relational) data models including Document, Key-Value, Graph, and Columnar
databases. We’ll study their capabilities, limitations, and use-cases. On the “retrieval” side of
things, we will cover such diverse topics as Apache Spark using Scala and
Stream / Event processing using Kafka and Spark Streaming. Most of these technologies are relatively new,
having been available to developers for less than 10 years, and they continue to evolve rapidly
as part of the effort to address today’s big-data challenges. This makes their study both
fascinating and challenging. There will be some programming assignments, one small research
assignment where you will investigate a NoSQL database of your own choosing, and a class
project. In our study of Apache Spark, we will learn Spark SQL and the rudiments of Scala
functional programming. Other programming assignments may be implemented in your choice
of language. 4.000 Credit hours
Readings
There is no single text book that covers everything we will be exploring in this
class – although some authors have tried! What few books are available tend to be very highlevel and of little practical value. I will be assigning chapters, papers, and videos as we go
along. To the extent possible I have chosen books that are available to you for free as a
Northeastern student through
O'Reilly E-Books. Two very popular books that we will be using are:
Kleppmann, 2017.
Designing Data-Intensive Applications, O'Reilly and
Reis & Housley, 2022. Fundamentals of Data Engineering.
Both of these books are freely available through your student O'Reilly E-Books account. I also created a DS4300
playlist for your convenience.
Recordings
Most classes will be recorded. Recordings are available through Canvas....Zoom Meetings....Cloud Recordings.
Recordings should not be used as a substitute for coming to class. In fact, I do take attendance, and attendance counts towards your final grade.
Advice for taking an online class
I understand that taking a class online can be a challenge for some students. I will do everything in my power to make the class as personalized
and engaging as possible. This course was not designed for 100,000 anonymous strangers. I will tailor the material to your questions and feedback and
your assignments will be individually evaluated by the instruction team. You can make this class more interesting and fun by asking questions,
expressing opinions, and having a voice. Turning on your camera, while not required, will help me to get to know you and to know how you
are reacting to the material. You may also find it helpful to have a second small monitor set up so you can follow along when I give technical demonstrations.
Homework
There will be one "mini-project" programming assignment about every two weeks.
The detailed dates are listed on the schedule below. Homeworks should be completed largely on your own. Informal discussions or seeking general help from fellow students is ok so long as you
cite your sources. Do not simply copy another student's submission! In addition, as part of your homework, there may be occaisional take-home quizzes to verify your
understanding of the reading material.
Strict Homework Late Policy:
- Up to 48 hours late: 10% penalty
- After 48 hours: Not accepted.
Class Project
There will be a group project involving either a NoSQL database or the
use of Apache Spark to be presented at the end of the semester.
Academic Misconduct
Programming is a creative process. Individuals or pair groups (when allowed) must reach their own understanding of problems
and discover paths to their solutions. During this time, discussions with friends and colleagues
are encouraged—you will do much better in the course, and at Northeastern, if you find people
with whom you regularly discuss problems. But those discussions should take place verbally.
If you simply copy large blocks of code from another student you are breaking the rules.
Each program/application must be largely the product of your own mind. To the extent that you use
internet resources such as Stack Overflow or ChatGPT, I expect that you will properly cite your sources
within your code.
The university's academic integrity policy discusses actions regarded as violations
and consequences for students:
http://www.northeastern.edu/osccr/academic-integrity
Evaluation
The final grade for this course will be weighted as follows:
- Homework: 70%
- Group Project: 25%
- Participation (attendance, piazza, etc.): 5%
Final grades will be assigned based on the following scale.
Computed grades are NOT rounded.
Letter | Range |
A | 94 - 100 |
A- | 90 - 94 |
B+ | 87 - 89 |
B | 83 - 86 |
B- | 80 - 82 |
C+ | 77 - 79 |
C | 73 - 76 |
C- | 70 - 72 |
D+ | 67 - 69 |
D | 63 - 66 |
D- | 60 - 62 |
F | <60 |
Schedule
Note: This schedule is subject to change and will be adjusted as needed throughout the semester.
Week |
Date |
Topic |
Reading |
HW Due |
1 |
Jan 9/12 |
The limits of the relational model |
Kleppmann 1 Harrison 1 |
|
2 |
Jan 16/19 |
NoSQL models. Consistency and Availability |
Kleppman 2 Vogels2008 |
|
3 |
Jan 23/26 |
Redis, Key-Value Stores |
Kleppman 3 Harrison 3 |
HW1 |
4 |
Jan 30/Feb 2 |
Indexing & Storage |
TBD |
|
5 |
Feb 6/9 |
Mongo: Document Stores |
|
HW2 |
6 |
Feb 13/16 |
Replication and the Cap Theorem |
Kleppmann 5 |
|
7 |
Feb 20/23 |
Partitioning, More Mongo and PyMongo |
Kleppmann 6 |
HW3 |
8 |
Feb 27/Mar 2 |
Functional programming with Scala |
Swartz: Learning Scala Ch1-5 |
|
9 |
Mar 5/8 |
Spring Break - No Class |
|
|
10 |
Mar 12/15 |
Apache Spark and SparkSQL |
Sarkar: Learning SparkSQL (Ch1-3)
or Zaharia: Learning Spark (Ch1,2,9) |
HW4 |
11 |
Mar 19/22 |
Event processing with Kafka / Spark Streaming |
TBD |
|
12 |
Mar 26/29 |
Graph Database |
TBD |
HW5 |
13 |
Apr 2/5 |
Graph databases II |
TBD |
|
14 |
Apr 9/12 |
Project Science Fair |
|
|
15 |
Apr 16 |
Wrap up and the future of databases |
|
PROJECT DUE
Apr 17 @ 11:59pm ET/Boston |
Inclusive Class
Northeastern University values the diversity of our students, staff, and faculty; recognizing the important contribution each makes to our unique community.
Respect is demanded at all times throughout this course. In the classroom, not only is participation required, it is expected that everyone is treated with dignity and respect. We realize everyone comes from a different background with different experiences and abilities. Our knowledge will always be used to better everyone in the class.
We strive to create a learning environment that is welcoming to students of all backgrounds. If you feel unwelcome for any reason, please let us know so we can work to make things better. You can let us know by talking to anyone on the teaching staff. If you feel uncomfortable talking to members of the teaching staff, please consider reaching out to your academic advisor.
Northeastern is committed to providing equal access and support to all qualified students through the provision of reasonable accommodations so that each student may fully participate in the learning experience. If you have a disability that requires accommodations, please contact the Disability Resource Center http://www.northeastern.edu/drc/, DRC@northeastern.edu, 617-353-2675. Accommodations cannot be made retroactively and to receive an accommodation, a letter from the DRC or LDP is required.