Information

Eugene Wu (Instructor)
OH: Tues 5-6PM in 421 Mudd
Fotis Psallidas (IA)
OH: Thurs 12-1PM in 425 Mudd
Class: Weds 4-6PM in 227 Mudd
Syllabus
Piazza
Project
Provide Feedback

Prereqs

Required: W4111 Intro to Databases
Preferred: W4112 DB Implementation
Preferred: E6111 Advanced DB Systems
Ugrads welcome; talk to instructor

Grading

Project 70%
Midterm 15%
Paper Reviews 15%
Extra Credit 0-10%
Participation mandatory

Confirmed Speakers

The course will involve invited speakers from industry and academia to compare and contrast research and practice, as well as transitioning between the two disciplines.

Juliana Freire Professor, CSE at NYU
Frank Wang Founder, Cybersecurity Foundry
Stavros Papadopoulos CEO, TileDB
Evan Jones MIT PhD, BlueCore
Anant Bhardwaj CEO, Instabase
Edward Benson ex-CEO of cloudstitch, now at Instabase
Adam Marcus CTO, B12
Richard Hipp Creator of SQLite
Molham Aref CEO, LogicBlox (bought)
Todd Mostak CEO, MapD
Ben Darnell CTO, Co-founder, Cockroach Labs

Overview

Data is eating the world and developing next generation data-driven applications and systems for working with data is more important than ever before. In addition, the lines between research, applications, and industry are increasingly blurred.

This course will survey modern research in data management – from large scale data processing, modern database engines, to data cleaning and visualization, to secure data management. To ground the discussion, we will host invited speakers that have (or are) transitioned their research work from academia to industry. Depending on timing and interest, select students may be invited to join the speakers for more in-depth discussions over dinner after class.

Students are expected to actively participate in discussions.
Course capped at 25. If waitlist is huge, a small assignment will be used to choose participants.

Recent Announcements

Added questions for first reading below
Updated extra credit part of syllabus

Schedule

Lecture schedule

4-5PM: lecture/discussion
5-6PM: guest lecture

1/17: Introduction

Presenter: Eugene

1/24: Human-Assisted AI at B12

Presenter: Adam Marcus
Reading: Crowdsourced Data Management: Industry and Academic Perspectives
- Skim chapter 3, stopping after completing section 3.3
- Skim chapter 4, stopping after completing section 4.2
- Read the executive summary of chapter 5
Reading Questions: Since the reading is not a paper, answer the following questions
- What is the type of problem that crowd-powered algorithms and systems are uniquely meant to solve? Don’t use any jargon
- Imagine and describe the craziest/most impactful application for which crowdsourcing would be a necessary component.
  - What does the application do?
  - Who cares?
  - Why is crowdsourcing needed?
Assignments
- Project Presentations batch 1. Sign up here

1/31: Secure Databases

Presenter: Frank Wang
Readings
Reading Questions: Since the readings are summaries, answer the following questions
- Describe concrete scenarios of problems that each of the first two readings seeks to address.
  - Are they realistic in practice?
  - Or do they limit functionality or not fully protect what they promise or both?
- The NYMag article is a chilling read. Do the readings help? What could one do?
- Frank has experience in cybersecurity startups and research, come up with 1+ questions for him
Assignments
- Project Presentations batch 2. Sign up here

2/07: GPU Databases

Presenter: Todd Mostak
Readings
- Required: Accelerating SQL Database Operations on a GPU with CUDA
- Required: Hashing out Perfect Group-Bys in a GPU-accelerated database
- Optional: Efficient Query Processing in Co-Processor-accelerated Databases
- Optional: Massive Throughput Database Queries with LLVM on GPUs
- Optional: End-to-end Computation on the GPU with the GPU Data Frame (GDF)
- Optional: Vega makes visualizing BIG data easy
See Syllabus for Reading Questions
Assignments
- Project Presentations batch 3. Sign up here
- Prospectus due 2/11

2/14: The future of data interaction

Presenter: Eugene Wu
Readings (optional)

2/21: TileDB

Presenter: Stavros Papadopoulos
TileDB codebase
Reading: The TileDB Array Data Storage Manager
Optional Reading:
- Main discussion: Dynamic Prefetching of Data Tiles for Interactive Visualization
- Related: Distributed and Interactive Cube Exploration

2/28: Instabase: Turing your research into a startup

Presenter: Anant Bhardwaj and Edward Benson
Readings
- The Datahub Paper that started instabase
- Quilt
- (Optional) Ajax-based Report Pages as Incrementally Rendered Views
- (Optional) Collaborative Data Analytics with DataHub (Demo)
Reading Questions:
- For both papers:
  - What concrete problem did the datahub paper set out to solve?
  - What technical idea seems promising (irrespective of the problem statement described in the papers)?
  - Do you think the problem is real? If so, describe an impactful scenario for it. If not, describe why not and what aspects of the paper could be impactful. (Be sure to be clear about what impactful means)
- One or more questions for the speakers

3/07: Cockroach Labs: Raft Made Complex (cancelled)

Presenter: Ben Darnell
Readings
- The original Raft paper
- (Optional) Viewstamped Replication Revisited
- (Optional) Living Without Atomic Clocks blog post
- (Optional) Raft Refloated
- (Optional) Google’s Paxos Made LIve
- (Optional) Anna: A Crazy Fast, Super-Scalable, Flexibly Consistent KVS
Reading Questions:
- Provide a simple example of how Raft recovers from a leader failure. A simple way is to list the state at each node at every time step, and use english to describe what happens between each time step.
- Suppose the clocks on all nodes are perfectly synchronized (e.g., node i’s clock is exactly the same as node j’s clock). Which part of raft is still necessary?
- Raft argues that it is more understandable. What evidence would be sufficient to show that this is the case? How does it compare with what the authors show?
- Why did the authors develop raft in the first place?
- Include 1+ question to ask the speaker.

3/14: Spring Recess

3/21: Snow Day, class cancelled

3/28: The VisTrails Project + Midterm

Presenter: Juliana Freire
Readings and Questions
- (skim) VisTrails overview
  - What was the unmet need or opportunity?
  - What were existing approaches and why were they not used? Or could they have been used?
  - What was the key technical idea?
- Querying and Creating Visualizations by Analogy
  - What is the primary problem, and the main solution idea, in this paper?
  - Describe a concrete example of this issue outside of the domains of VisTrails and scientific visualization?
  - What aspects of the solution applies to your example? What extensions are needed?
- (Optional) Using VisTrails and Provenance for Teaching Scientific Visualization

4/04: 3 Timeless ideas in SQLite

Presenter: Richard Hipp
Readings and Questions
- No Silver Bullet - Essence and Accident in Software Engineering
  - What accidental complexities in database systems have been mitigated by the SQL language?
  - What additional accidental complexities have not been addressed by SQL?
  - Can the query planner in an SQL database engine be considered an example of an AI or Expert System or as an implementation of an Automatic Programming system?
- About SQLite – it’s pretty amazing!
- Appropriate Uses for SQLite
  - Some applications transfer information between client and server by stuffing the content into an SQLite database and sending the database file over the wire. What are some advantages and disadvantages to this approach compared to sending the content as JSON, a ZIP Archive, or a bespoke binary format?
    - Git, the most widely used distributed version control system today, uses a bespoke key/value database format call a “pack-file” both to store content and to transfer content over the wire when syncing or cloning. The value for each entry is either the content of a file being versioned, or a binary object describing relationships between files. The key is a SHA1 hash of the content. What if Git had been designed to use SQLite, or some other embedded relational database, instead of a key/value store? Would Git be a better or a poorer product if it used an SQLite database in place of pack-files? What impact would using SQLite versus pack-files have on performance, reliability, storage efficiency, and extensibility? If you were designing a replacement or follow-on to Git, what kind of database would you use for local storage, server-side storage, and for transfer?

4/11: LogicBlox

Presenter: Molham Aref
Reading:
- (required) Design and Implementation of the LogicBlox System
- (optional) Datalog Primer
- (optional) Optimal Joins
Click here for the Questions

4/18: TBA

Presenter: Evan Jones
Reading:
- (required) The Dataflow Model
Questions
- The paper proposes a “Dataflow Model”. What exactly are the fundamental parts of the dataflow model? (not very clear in the paper)
- The paper claims Dataflow “balances correctness”.
  - What do they mean by the word “correctness”?
  - What are the advantages and disadvantage in being “correct”?

4/25: Project presentations

5/2: (Extra class) Cockroach Labs: Raft Made Complex

Presenter: Ben Darnell