Difference between revisions of "Course: Massive Data Analysis 2014"
Jump to navigation
Jump to search
Line 61: | Line 61: | ||
* Lecture notes: | * Lecture notes: | ||
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/mapreduce-intro.pdf | ** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/mapreduce-intro.pdf | ||
* Getting started with Hadoop: You will use two different Hadoop systems | |||
** NYU HPC will provide accounts so that you can use a local Hadoop cluster. Please submit a request for the to create an account for you *ASAP* at: https://wikis.nyu.edu/display/NYUHPC/Request+or+Renew | |||
You can find instructions on how to login and use the NYU Hadoop cluster at: http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/MapReduceExample/readme-nyu-hadoop.txt | |||
** Amazon AWS: Each student will receive a token with $100 credit towards computing time at AWS. See http://www.vistrails.org/index.php/AWS_Setup for instructions on how to set up AWS. | |||
'''Always remember to terminate your instances! If you don't you will be charged and responsible for the charges beyond your credit.''' | |||
* Required reading: | * Required reading: |
Revision as of 13:18, 22 September 2014
CS-GY 6333 Massive Data Analysis: Tentative Schedule -- subject to change
- Course Web page: http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/
- Instructor: Professor Juliana Freire (http://vgc.poly.edu/~juliana/)
- Lecture: Mondays, 1:00pm-3:25pm at 2MTC, room 9.011.
News
- Welcome!
Background (4 weeks)
Week 1 -- Sept 8: Course Overview; the evolution of Data Management
- Lecture notes: http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/course-overview.pdf (http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/course-overview-6p.pdf)
- Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)
- Course survey: https://docs.google.com/spreadsheet/embeddedform?formkey=dFpwTjROVzhLUWY2NVNXb0xvNTVLMnc6MA
Week 2 -- Sept 15: Provenance and Reproducibility
- Lecture notes: http://vgc.poly.edu/~fchirigati/mda-class/provenance-reproducibility.pdf
- The class will have a lab component. Please bring your laptops.
- Before class, follow the instructions below to install and set up VisTrails as well as github
- VisTrails setup:
- Download VisTrails 2.1.4 from http://www.vistrails.org/index.php/Downloads and follow the installation instructions. Start the system and then quit.
- Download the following packages:
- After you extract the content of the zip files, place them under $HOME/.vistrails/userpackages
- Github setup:
- Create a github account (https://github.com/join)
- Learn how to set up git and create a public repository.
- During class, you will add the trail of your analysis to github, and submit the link to your public github repo using this form: https://docs.google.com/forms/d/17OScN8Ea-El20AC4mHIb32S3e62mAbGEiU-BET0PyX8/viewform?usp=send_form
Week 3 -- Sept 22: Introduction to Databases; Relational Model and SQL
- Lecture notes:
- http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/intro-to-db.pdf
- http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/relational-algebra.pdf
- http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/sql-intro.pdf
- http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/sql-more.pdf
- Other useful reading:
Week 4 -- Sept 29: Overview: Advanced SQL and Query Optimization
- Lecture notes:
Big Data Foundations and Infrastructure (3 weeks)
Week 5 -- Oct 6: Cloud computing, Map Reduce and Hadoop
- Lecture notes:
- Getting started with Hadoop: You will use two different Hadoop systems
- NYU HPC will provide accounts so that you can use a local Hadoop cluster. Please submit a request for the to create an account for you *ASAP* at: https://wikis.nyu.edu/display/NYUHPC/Request+or+Renew
You can find instructions on how to login and use the NYU Hadoop cluster at: http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/MapReduceExample/readme-nyu-hadoop.txt
- Amazon AWS: Each student will receive a token with $100 credit towards computing time at AWS. See http://www.vistrails.org/index.php/AWS_Setup for instructions on how to set up AWS.
Always remember to terminate your instances! If you don't you will be charged and responsible for the charges beyond your credit.
- Required reading:
- Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
- Mining of Massive Datasets (2nd Edition), Chapter 2 - 2.1 and 2.2 (Large-Scale File Systems and Map-Reduce).
- Other useful reading:
- Hadoop: The Definitive Guide. http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520
Week 6 -- Oct 13: Fall Break
Week 7 -- Oct 20: Algorithm Design for MapReduce
- Lecture notes:
- Required reading:
- Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
- Mining of Massive Datasets (2nd Edition), Chapter 2.
Week 8 -- Oct 27: Parallel Databases vs MapReduce, Query Processing on Mapreduce and High-level Languages
- Lecture notes:
- Required reading:
- Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- I have placed this version in http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/MapReduce-algorithms-Jan2013-draft.pdf)
- Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
- MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
- Additional reading:
- Pig Latin: A Not-So-Foreign Language for Data Processing: http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf
- Hive - A Warehousing Solution Over a Map-Reduce Framework: http://www.vldb.org/pvldb/2/vldb09-938.pdf
Big Data Algorithms and Techniques (3 weeks)
Week 9 -- Nov 3: Association Rules
- Lecture notes:
- Reading: Chapter 6 Mining of Massive Datasets
Week 10 -- Nov 10: Finding similar items
- Reading: Chapter 3 Mining of Massive Datasets
Week 11 -- Nov 17: Graph Analysis
- Lecture notes:
Week 12 -- Nov 25: Large-Scale Visualization -- Invited lecture by Dr. Lauro Lins (AT&T Research)
- Lecture notes:
- Reading:
The Value of Visualization, Jarke Van Wijk http://www.win.tue.nl/~vanwijk/vov.pdf
Tamara Munzner's Book draft 2 available online http://www.cs.ubc.ca/~tmm/courses/533/book/
Nanocubes Paper http://nanocubes.net http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf