Prereqs
- Required: W4111 Intro to Databases
- Preferred: W4112 DB Implementation
- Preferred: E6111 Advanced DB Systems
- Ugrads welcome; talk to instructor
Grading
Confirmed Speakers
The course will involve invited speakers from industry and academia to compare and contrast research and practice, as well as transitioning between the two disciplines.
- Juliana Freire Professor, CSE at NYU
- Frank Wang Founder, Cybersecurity Foundry
- Stavros Papadopoulos CEO, TileDB
- Evan Jones MIT PhD, BlueCore
- Anant Bhardwaj CEO, Instabase
- Edward Benson ex-CEO of cloudstitch, now at Instabase
- Adam Marcus CTO, B12
- Richard Hipp Creator of SQLite
- Molham Aref CEO, LogicBlox (bought)
- Todd Mostak CEO, MapD
- Ben Darnell CTO, Co-founder, Cockroach Labs
Overview
Data is eating the world and developing next generation data-driven applications and systems for working with data is more important than ever before. In addition, the lines between research, applications, and industry are increasingly blurred.
This course will survey modern research in data management – from large scale data processing, modern database engines, to data cleaning and visualization, to secure data management. To ground the discussion, we will host invited speakers that have (or are) transitioned their research work from academia to industry. Depending on timing and interest, select students may be invited to join the speakers for more in-depth discussions over dinner after class.
Students are expected to actively participate in discussions.
Course capped at 25. If waitlist is huge, a small assignment will be used to choose participants.
Recent Announcements
- Added questions for first reading below
- Updated extra credit part of syllabus
Schedule
Lecture schedule
- 4-5PM: lecture/discussion
- 5-6PM: guest lecture
1/17: Introduction
1/24: Human-Assisted AI at B12
- Presenter: Adam Marcus
- Reading: Crowdsourced Data Management: Industry and Academic Perspectives
- Skim chapter 3, stopping after completing section 3.3
- Skim chapter 4, stopping after completing section 4.2
- Read the executive summary of chapter 5
- Reading Questions: Since the reading is not a paper, answer the following questions
- What is the type of problem that crowd-powered algorithms and systems are uniquely meant to solve? Don’t use any jargon
- Imagine and describe the craziest/most impactful application for which crowdsourcing would be a necessary component.
- What does the application do?
- Who cares?
- Why is crowdsourcing needed?
- Assignments
1/31: Secure Databases
- Presenter: Frank Wang
- Readings
- Reading Questions: Since the readings are summaries, answer the following questions
- Describe concrete scenarios of problems that each of the first two readings seeks to address.
- Are they realistic in practice?
- Or do they limit functionality or not fully protect what they promise or both?
- The NYMag article is a chilling read. Do the readings help? What could one do?
- Frank has experience in cybersecurity startups and research, come up with 1+ questions for him
- Assignments
2/07: GPU Databases
2/14: The future of data interaction
- Presenter: Eugene Wu
- Readings (optional)
2/21: TileDB
2/28: Instabase: Turing your research into a startup
- Presenter: Anant Bhardwaj and Edward Benson
- Readings
- Reading Questions:
- For both papers:
- What concrete problem did the datahub paper set out to solve?
- What technical idea seems promising (irrespective of the problem statement described in the papers)?
- Do you think the problem is real? If so, describe an impactful scenario for it. If not, describe why not and what aspects of the paper could be impactful. (Be sure to be clear about what impactful means)
- One or more questions for the speakers
3/07: Cockroach Labs: Raft Made Complex (cancelled)
- Presenter: Ben Darnell
- Readings
- Reading Questions:
- Provide a simple example of how Raft recovers from a leader failure. A simple way is to list the state at each node at every time step, and use english to describe what happens between each time step.
- Suppose the clocks on all nodes are perfectly synchronized (e.g., node i’s clock is exactly the same as node j’s clock). Which part of raft is still necessary?
- Raft argues that it is more understandable. What evidence would be sufficient to show that this is the case? How does it compare with what the authors show?
- Why did the authors develop raft in the first place?
- Include 1+ question to ask the speaker.
3/14: Spring Recess
3/21: Snow Day, class cancelled
3/28: The VisTrails Project + Midterm
4/04: 3 Timeless ideas in SQLite
- Presenter: Richard Hipp
- Readings and Questions
- No Silver Bullet - Essence and Accident in Software Engineering
- What accidental complexities in database systems have been mitigated by the SQL language?
- What additional accidental complexities have not been addressed by SQL?
- Can the query planner in an SQL database engine be considered an example of an AI or Expert System or as an implementation of an Automatic Programming system?
- About SQLite – it’s pretty amazing!
- Appropriate Uses for SQLite
- Some applications transfer information between client and server by
stuffing the content into an SQLite database and sending the database
file over the wire. What are some advantages and disadvantages to
this approach compared to sending the content as JSON, a ZIP Archive,
or a bespoke binary format?
- Git, the most widely used distributed version control system today,
uses a bespoke key/value database format call a “pack-file” both
to store content and to transfer content over the wire when syncing
or cloning. The value for each entry is either the content of a file
being versioned, or a binary object describing relationships between
files. The key is a SHA1 hash of the content. What if Git had been
designed to use SQLite, or some other embedded relational database,
instead of a key/value store? Would Git be a better or a poorer product
if it used an SQLite database in place of pack-files? What
impact would using SQLite versus pack-files have on performance,
reliability, storage efficiency, and extensibility? If you were
designing a replacement or follow-on to Git, what kind of database
would you use for local storage, server-side storage, and for transfer?
4/11: LogicBlox
4/18: TBA
- Presenter: Evan Jones
- Reading:
- Questions
- The paper proposes a “Dataflow Model”. What exactly are the fundamental parts of the dataflow model? (not very clear in the paper)
- The paper claims Dataflow “balances correctness”.
- What do they mean by the word “correctness”?
- What are the advantages and disadvantage in being “correct”?
4/25: Project presentations
5/2: (Extra class) Cockroach Labs: Raft Made Complex