Difference between revisions of "Course: Big Data 2015"
Jump to navigation
Jump to search
(35 intermediate revisions by 3 users not shown) | |||
Line 10: | Line 10: | ||
= News = | = News = | ||
* 04/05/2015: New quizzes are available at http://www.newgradiance.com | |||
* [[Big Data 2015: Final Project]] | |||
* 2/26/2015: An Amazon AWS token was emailed to each student. Please create your Amazon AWS account. You can find instructions at: http://www.vistrails.org/index.php/AWS_Setup | |||
* 2/26/2015: You should install the Cloudera VM on your laptop. We will need that for the lab on March 9th. Here are the instructions: [[Cloudera VM Setup]] | |||
* There is a new version of the textbook Mining of Massive Datasets, we will use the latest version 2.1 | |||
= Background (2 weeks) = | = Background (2 weeks) = | ||
Line 55: | Line 58: | ||
== Week 4: Algorithm Design for MapReduce == | == Week 4 - March 2: Algorithm Design for MapReduce: Relational Operations == | ||
* Lecture notes: | * Lecture notes: | ||
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design.pdf | ** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-relations.pdf | ||
* Lab: Hands-on Hadoop | * Lab: Hands-on Hadoop (local) | ||
* Required reading: | * Required reading: | ||
Line 66: | Line 69: | ||
** Mining of Massive Datasets (2nd Edition), Chapter 2. | ** Mining of Massive Datasets (2nd Edition), Chapter 2. | ||
* Programming assignment: Map Reduce | * Programming assignment: Map Reduce (check NYU Classes) | ||
== Week 5: Parallel Databases vs MapReduce | == Week 5 - March 9: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce == | ||
* Lecture notes: | * Lecture notes: | ||
** http://vgc.poly.edu/~juliana/courses/ | ** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/mapreduce-algo-design-patterns.pdf | ||
* Lab: Hands-on Hadoop on AWS | |||
** Lab materials: http://bigdata.poly.edu/~tuananh/files/awscli-examples.zip | |||
** Install aws command-line interface: http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html | |||
* Some links to AWS CLI documentation: | |||
** http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html | |||
** http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-set-up.html | |||
** http://www.linux.com/learn/tutorials/761430-an-introduction-to-the-aws-command-line-tool | |||
**EMR Through Commandline: https://www.safaribooksonline.com/library/view/programming-elastic-mapreduce/9781449364038/ch04.html | |||
** Importing Key: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#how-to-generate-your-own-key-and-import-it-to-aws | |||
** EMR Job Flow: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/EMR_CreateJobFlow.html | |||
* Required reading: | |||
** Data-Intensive Text Processing with MapReduce, Chapters 1 and 2 | |||
** Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf) | |||
* Programming assignment: check NYU Classes on March 10th | |||
== | == March 16th: Spring Break == | ||
= Transparency and Reproducibility (1 week) = | |||
== Week 6 - March 23: Data Exploration and Reproducibility == | |||
* Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf | |||
* Lab: Hands-on reproducibility. Before class, please | |||
** Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads | |||
** Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt | |||
** Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt | |||
** http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf | |||
** Questions? Email Fernando at fchirigati@nyu.edu | |||
* Programming assignment 4: Exploring urban data (see NYU Classes) | |||
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) = | = Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) = | ||
== Week | == Week 7 - March 30th: Finding similar items == | ||
* Lecture notes: | * Lecture notes: | ||
** http://vgc.poly.edu/~juliana/courses/ | ** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf | ||
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] | |||
* Homework Assignment | |||
** See quizzes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. | |||
== Week | == Week 8 - April 6th: Association Rules == | ||
* Lecture notes: | * Lecture notes: | ||
** http://vgc.poly.edu/~juliana/courses/ | ** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/association-rules.pdf | ||
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] | * Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets] | ||
Line 109: | Line 138: | ||
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html | **Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html | ||
* Homework Assignment | |||
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. | |||
== Week 9 - April 13th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP) == | |||
* Lecture notes: | |||
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/SpatialQP.pdf | |||
* Lab: Using Amazon AWS to analyze and visualize taxi data | |||
** https://github.com/ViDA-NYU/aws_taxi | |||
== Week 10: | == Week 10 - April 20th: Parallel Databases == | ||
* Lecture notes: | * Lecture notes: | ||
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/ | ** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/paralleldb-vs-hadoop-2015.pdf | ||
* | * Required reading: | ||
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf | |||
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext | |||
* | * Suggested reading: | ||
** | ** Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609 | ||
** | ** Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726 | ||
** BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf | |||
** Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf | |||
== Week 11: Graph Analysis == | == Week 11 - April 27th: Graph Analysis == | ||
* Lecture notes: | * Lecture notes: | ||
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf | ** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/graph-algos.pdf | ||
* Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms | |||
== Week | == Week 12 - May 4: Final Exam == | ||
== Week | == Week 13 - May 11: Project Presentations == | ||
== Week | == Week 14 - May 18: Project Presentations == |
Latest revision as of 01:13, 29 April 2015
DS-GA 1004- Big Data: Tentative Schedule -- subject to change
- Course Web page: http://vgc.poly.edu/~juliana/courses/BigData2015
- Instructor: Professor Juliana Freire (http://vgc.poly.edu/~juliana)
- Lecture: Mondays, 4:55pm-7:35pm at Silver, room 208.
- Some classes will include a lab session, please "always bring your laptop.
News
- 04/05/2015: New quizzes are available at http://www.newgradiance.com
- Big Data 2015: Final Project
- 2/26/2015: An Amazon AWS token was emailed to each student. Please create your Amazon AWS account. You can find instructions at: http://www.vistrails.org/index.php/AWS_Setup
- 2/26/2015: You should install the Cloudera VM on your laptop. We will need that for the lab on March 9th. Here are the instructions: Cloudera VM Setup
- There is a new version of the textbook Mining of Massive Datasets, we will use the latest version 2.1
Background (2 weeks)
Week 1 - Feb 2: Course Overview; The evolution of Data Management and introduction to Big Data
- Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/course-overview.pdf
- Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)
- Course survey: https://docs.google.com/forms/d/1LTiJwkDVvp0cF62Fw_d9Y86US5LCkorRUIQtV2T8KWE/viewform?usp=send_form
Week 2 - Feb 9: Introduction to Databases, Relational Model and SQL
- Lecture notes:
- Lab:
- SQL hands on: Big Data 2015 - SQL Lab
- Other useful reading:
- Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)
Feb 16: Holiday
Big Data Foundations and Infrastructure (3 weeks)
Week 3 - Feb 23: Introduction to Map Reduce
- Lab: (continuation)
- SQL hands on: Big Data 2015 - SQL Lab
- Lecture notes:
- Required Reading:
- Data-Intensive Text Processing with MapReduce. Chapters 1 and 2
- Mining of Massive Datasets (v 2.1). Chapter 2 - 2.1, 2.2, and 2.3
- Other useful reading:
- Hadoop: The Definitive Guide. http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520
- Quiz 1 (Map Reduce) assigned -- check http://www.newgradiance.com/services
Week 4 - March 2: Algorithm Design for MapReduce: Relational Operations
- Lecture notes:
- Lab: Hands-on Hadoop (local)
- Required reading:
- Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
- Mining of Massive Datasets (2nd Edition), Chapter 2.
- Programming assignment: Map Reduce (check NYU Classes)
Week 5 - March 9: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce
- Lecture notes:
- Lab: Hands-on Hadoop on AWS
- Lab materials: http://bigdata.poly.edu/~tuananh/files/awscli-examples.zip
- Install aws command-line interface: http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html
- Some links to AWS CLI documentation:
- http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html
- http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-set-up.html
- http://www.linux.com/learn/tutorials/761430-an-introduction-to-the-aws-command-line-tool
- EMR Through Commandline: https://www.safaribooksonline.com/library/view/programming-elastic-mapreduce/9781449364038/ch04.html
- Importing Key: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#how-to-generate-your-own-key-and-import-it-to-aws
- EMR Job Flow: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/EMR_CreateJobFlow.html
- Required reading:
- Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
- Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)
- Programming assignment: check NYU Classes on March 10th
March 16th: Spring Break
Transparency and Reproducibility (1 week)
Week 6 - March 23: Data Exploration and Reproducibility
- Lecture notes: http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/data-science-reproducibility.pdf
- Lab: Hands-on reproducibility. Before class, please
- Download VisTrails 2.1.5 from here: http://www.vistrails.org/index.php/Downloads
- Download the mta-analysis example: http://bigdata.poly.edu/~fchirigati/mda-class/mta-analysis.vt
- Download the links for the input data: http://bigdata.poly.edu/~fchirigati/mda-class/mta-links.txt
- http://bigdata.poly.edu/~fchirigati/mda-class/hands-on.pdf
- Questions? Email Fernando at fchirigati@nyu.edu
- Programming assignment 4: Exploring urban data (see NYU Classes)
Big Data Algorithms, Mining Techniques, and Visualization (6 weeks)
Week 7 - March 30th: Finding similar items
- Reading: Chapter 3 Mining of Massive Datasets
- Homework Assignment
- See quizzes on Gradiance -- Distance measures and document similarity.
Week 8 - April 6th: Association Rules
- Reading: Chapter 6 Mining of Massive Datasets
- Suggested additional reading:
- Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.
- Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann
- Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html
- Homework Assignment
- See quizes on Gradiance -- Distance measures and document similarity.
Week 9 - April 13th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Harish Doraiswamy (NYU CUSP)
- Lab: Using Amazon AWS to analyze and visualize taxi data
Week 10 - April 20th: Parallel Databases
- Lecture notes:
- Required reading:
- Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
- MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
- Suggested reading:
- Hive - A Warehousing Solution Over a Map-Reduce Framework. http://dl.acm.org/citation.cfm?id=1687609
- Pig Latin: A Not-So-Foreign Language for Data Processing. http://dl.acm.org/citation.cfm?id=1376726
- BigTable: http://fcoffice.googlecode.com/svn/%E4%B9%A6%E7%B1%8D/bigtable-osdi06.pdf
- Spark: Cluster Computing with Working Sets. http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf
Week 11 - April 27th: Graph Analysis
- Required Reading: Data-Intensive Text Processing with MapReduce. Chapters 5 -- Graph Algorithms