Difference between revisions of "Course: Big Data 2016"
Jump to navigation
Jump to search
Line 56: | Line 56: | ||
== Week 6 - Feb 29: MapReduce Algorithm Design Patterns == | == Week 6 - Feb 29: MapReduce Algorithm Design Patterns == | ||
*''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/ | *''' Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design.pdf | ||
* '''Lab:''' Hands-on Hadoop (HPC) | * '''Lab:''' Hands-on Hadoop (HPC) | ||
* '''Programming assignment:''' Map Reduce (check NYU Classes) | * '''Programming assignment:''' Map Reduce (check NYU Classes) | ||
Line 85: | Line 85: | ||
== Week 9 - March 21: Data Exploration and Reproducibility == | == Week 9 - March 21: Data Exploration and Reproducibility == | ||
* Lecture notes: http://vgc.poly.edu/~juliana/courses/ | * '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-science-reproducibility.pdf | ||
* '''Lab:''' Hands-on reproducibility. | |||
* Lab: Hands-on reproducibility. | * '''Programming assignment:''' Exploring urban data (see NYU Classes) | ||
* Programming assignment | |||
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) = | = Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) = | ||
Line 101: | Line 94: | ||
* Lecture notes: | * Lecture notes: | ||
** http://vgc.poly.edu/~juliana/courses/ | ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/similarity.pdf | ||
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/ | * Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2016/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] | ||
* Homework Assignment | * Homework Assignment | ||
Line 111: | Line 104: | ||
* Lecture notes: | * Lecture notes: | ||
** http://vgc.poly.edu/~juliana/courses/ | ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/association-rules.pdf | ||
Line 124: | Line 117: | ||
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. | ** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. | ||
== Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU | == Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS) == | ||
* Lecture notes: | * Lecture notes: | ||
** http://vgc.poly.edu/~juliana/courses/ | ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/SpatialQP.pdf | ||
* Lab: Using Amazon AWS to analyze and visualize taxi data | * Lab: Using Amazon AWS to analyze and visualize taxi data | ||
** https://github.com/ViDA-NYU/aws_taxi | ** https://github.com/ViDA-NYU/aws_taxi | ||
== Week 13 - April 18th: | == Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&T Research == | ||
== Week 14 - April 25th: Graph Analysis == | == Week 14 - April 25th: Graph Analysis == | ||
* Lecture notes: | * '''Lecture notes:''' http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf | ||
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms | * Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms | ||
== Week 15 - May 2: | == Week 15 - May 2: TBD == | ||
== Week 16 - May 9: | == Week 16 - May 9: Final Exam == | ||
== Week 17 - May 16: Project Presentations == | == Week 17 - May 16: Project Presentations == |
Revision as of 22:05, 23 January 2016
DS-GA 1004- Big Data: Tentative Schedule -- subject to change
- Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2016
- Instructors:
- Professor Juliana Freire (http://vgc.poly.edu/~juliana)
- Dr. Erin C Carson
- Dr. Nicholas Knight
- TAs:
- Yuan Feng
- Kevin Ye
- Lecture: Mondays, 4:55pm-7:35pm at 19 University Pl., room 102.
- Some classes will include a lab session, please always bring your laptop.
News
- 1/25/2016: Amazon has kindly donated time on AWS for all the student in this class. You must signup for the AWS Educate program, see http://www.vistrails.org/index.php/AWS_Setup
- 1/25/2016: An HPC account has been created for you. You will need this account for in-class exercises and homework assignments. See NYU HPC Access Instructions
Week 1 - Jan 25: Course Overview
- Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/course-overview.pdf
- Lab: Computing infrastructure for the course
- Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)
- Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form
Week 2 - Feb 1: The evolution of Data Management and introduction to Big Data; Introduction to Databases, Relational Model and SQL
- Lecture notes:
- Lab: in-class assignment on relational algebra
- Readings:
Week 3 - Feb 8: Introduction to Databases, Relational Model and SQL (cont.)
- Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/intro-db.pdf
- Lab: SQL
- Programming assignment: Using SQL for data analysis and cleaning (check NYU Classes)
Week 4 - Feb 15: Holiday
Big Data Foundations and Infrastructure (3 weeks)
Week 5 - Feb 22: Introduction to Map Reduce
- Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-intro.pdf
- Lab: Hands-on Hadoop (local and AWS)
- Quiz 1 (Map Reduce) -- check http://www.newgradiance.com/services
Week 6 - Feb 29: MapReduce Algorithm Design Patterns
- Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/mapreduce-algo-design.pdf
- Lab: Hands-on Hadoop (HPC)
- Programming assignment: Map Reduce (check NYU Classes)
- Readings:
- Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
- Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)
Week 7 - March 7: Parallel Databases vs MapReduce; Storage Solutions; Introduction to SPARK
- Lecture notes: ** http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/paralleldb-vs-hadoop.pdf
- Lab: NoSQL
- Programming assignment: Pig and Spark
- Readings:
- Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
- MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
- Additional Suggested reading:
- Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609
- Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726
- BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf
- Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf
Week 8 -- March 14th: Spring Break
Transparency and Reproducibility (1 week)
Week 9 - March 21: Data Exploration and Reproducibility
- Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2016/Lectures/data-science-reproducibility.pdf
- Lab: Hands-on reproducibility.
- Programming assignment: Exploring urban data (see NYU Classes)
Big Data Algorithms, Mining Techniques, and Visualization (6 weeks)
Week 10 - March 28th: Finding similar items
- Reading: Chapter 3 Mining of Massive Datasets
- Homework Assignment
- See quizzes on Gradiance -- Distance measures and document similarity.
Week 11 - April 4th: Association Rules
- Reading: Chapter 6 Mining of Massive Datasets
- Suggested additional reading:
- Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.
- Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann
- Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html
- Homework Assignment
- See quizes on Gradiance -- Distance measures and document similarity.
Week 12 - April 11th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CDS)
- Lab: Using Amazon AWS to analyze and visualize taxi data
Week 13 - April 18th: Data Cleaning - Invited lecture by Dr. Divesh Srivastava, AT&T Research
Week 14 - April 25th: Graph Analysis
- Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms