Revision as of 19:27, 19 March 2016

DS-GA 1004- Big Data: Tentative Schedule -- subject to change

Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016

Instructors:
- Professor Juliana Freire (http://vgc.poly.edu/~juliana)
- Dr. Erin C Carson
- Dr. Nicholas Knight

TAs:
- Yuan Feng
- Kevin Ye

Lecture: Mondays, 4:55pm-7:35pm at Silver 207

Some classes will include a lab session, please always bring your laptop.

News

1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup
1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See NYU HPC Access Instructions

Week 1 - Jan 25: Course Overview

Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf
Lab: Computing infrastructure for the course
Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)
Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form

Week 2 - Feb 1: The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model

Lecture notes:
- http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-management-evolution.pdf
- http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-to-db.pdf
Lab: getting started with MySQL
Required Reading:
- Chapter 1 of Mining of Massive Data Analysis
Suggested Reading:
- Greenspun's SQL for Web Nerds Intro
- SQL/Nerds Modeling (parts)
- History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla, by C. Mohan, EDBT 2013

Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.)

Lecture notes:
- http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/relational-algebra.pdf
- http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/sql.pdf
Lab: SQL
Programming assignment: Using SQL for data analysis and cleaning (check NYU Classes)

Week 4 - Feb 15: Holiday

Transparency and Reproducibility (1 week)

Week 5 - Feb 22: Data Exploration and Reproducibility

Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/reproducibility-provenance.pdf
Lab: Hands-on git and github (see NYU Classes). You will need to submit your work for this lab!

Big Data Foundations and Infrastructure (3 weeks)

Week 6 - Feb 29: Introduction to Map Reduce

Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf
Lab: Hands-on Hadoop (local and AWS)
Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services (juliana_freire: Ch. 2: Map-Reduce)
- Quiz is due on 2016-03-14 12:00 PM EST

Week 7 - March 7: MapReduce Algorithm Design Patterns

Lecture notes:
Lab: Hands-on Hadoop (HPC)
Programming assignment: Map Reduce (check NYU Classes)
Readings:
- Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
- Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)

Week 8-- March 14th: Spring Break

Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK

Lecture notes:
- http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf
  - - http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf
Lab: NoSQL
Readings:
- Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
- MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext

Additional Suggested reading:
- Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609
- Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726
- BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf
- Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf

Big Data Algorithms, Mining Techniques, and Visualization (6 weeks)

Week 10 - March 28th: Finding similar items

Lecture notes:
- http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf

Reading: Chapter 3 Mining of Massive Datasets

Homework Assignment
- See quizzes on Gradiance -- Distance measures and document similarity.

Week 11 - April 4th: Association Rules

Lecture notes:
- http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf

Reading: Chapter 6 Mining of Massive Datasets

Suggested additional reading:
- Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.
- Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann
- Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html

Homework Assignment
- See quizes on Gradiance -- Distance measures and document similarity.

Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS)

Lecture notes:
- http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf

Lab: Using Amazon AWS to analyze and visualize taxi data
- https://github.com/ViDA-NYU/aws_taxi

Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&T Research

Week 14 - April 25th: Graph Analysis

Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf

Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms

@@ Line 84: / Line 84: @@
 == Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK==
-*''' Lecture notes:''' ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf
+*''' Lecture notes:'''
+** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf
+**** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/hive-pig.pdf
 * '''Lab:''' NoSQL
-* '''Programming assignment:''' Pig and Spark
 * '''Readings''':
 ** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf

Difference between revisions of "Course: Big Data 2016"

Revision as of 19:27, 19 March 2016

Contents

DS-GA 1004- Big Data: Tentative Schedule -- subject to change

News

Week 1 - Jan 25: Course Overview

Week 2 - Feb 1: The evolution of Data Management and introduction to Big Data; Introduction to Databases and Relational Model

Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.)

Week 4 - Feb 15: Holiday

Transparency and Reproducibility (1 week)

Week 5 - Feb 22: Data Exploration and Reproducibility

Big Data Foundations and Infrastructure (3 weeks)

Week 6 - Feb 29: Introduction to Map Reduce

Week 7 - March 7: MapReduce Algorithm Design Patterns

Week 8-- March 14th: Spring Break

Week 9- March 21st: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK

Big Data Algorithms, Mining Techniques, and Visualization (6 weeks)

Week 10 - March 28th: Finding similar items

Week 11 - April 4th: Association Rules

Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS)

Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&T Research

Week 14 - April 25th: Graph Analysis

Week 15 - May 2: TBD

Week 16 - May 9: Final Exam

Week 17 - May 16: Project Presentations

Navigation menu

Search