Difference between revisions of "Course: Massive Data Analysis 2014"
Jump to navigation
Jump to search
Line 64: | Line 64: | ||
* Lecture notes: | * Lecture notes: | ||
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/mapreduce-intro.pdf | ** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/mapreduce-intro.pdf | ||
* Lab: after the lecture, you will work on an in-class exercise. For this you need to install Hadoop on your laptop and have your account setup on AWS. See instructions below. | |||
* Getting started with Hadoop: You will use three different Hadoop configurations: | * Getting started with Hadoop: You will use three different Hadoop configurations: |
Revision as of 17:09, 3 October 2014
CS-GY 6333 Massive Data Analysis: Tentative Schedule -- subject to change
- Course Web page: http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/
- Instructor: Professor Juliana Freire (http://vgc.poly.edu/~juliana/)
- Lecture: Mondays, 1:00pm-3:25pm at 2MTC, room 9.011.
News
- On Sept 22nd, I distributed AWS tokens that will be needed for your assignments. If you have not received your token, let me know.
- Your first assignment has been posted -- see details below and in NYU Classes.
Background (4 weeks)
Week 1 -- Sept 8: Course Overview; the evolution of Data Management
- Lecture notes: http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/course-overview.pdf (http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/course-overview-6p.pdf)
- Reading: Chapter 1 of Mining of Massive Data Sets (version 1.1)
- Course survey: https://docs.google.com/spreadsheet/embeddedform?formkey=dFpwTjROVzhLUWY2NVNXb0xvNTVLMnc6MA
Week 2 -- Sept 15: Provenance and Reproducibility
- Lecture notes: http://vgc.poly.edu/~fchirigati/mda-class/provenance-reproducibility.pdf
- The class will have a lab component. Please bring your laptops.
- Before class, follow the instructions below to install and set up VisTrails as well as github
- VisTrails setup:
- Download VisTrails 2.1.4 from http://www.vistrails.org/index.php/Downloads and follow the installation instructions. Start the system and then quit.
- Download the following packages:
- After you extract the content of the zip files, place them under $HOME/.vistrails/userpackages
- Github setup:
- Create a github account (https://github.com/join)
- Learn how to set up git and create a public repository.
- During class, you will add the trail of your analysis to github, and submit the link to your public github repo using this form: https://docs.google.com/forms/d/17OScN8Ea-El20AC4mHIb32S3e62mAbGEiU-BET0PyX8/viewform?usp=send_form
Week 3 -- Sept 22: Introduction to Databases; Relational Model and SQL
- Lecture notes:
- http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/intro-to-db.pdf
- http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/relational-algebra.pdf
- http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/sql-intro.pdf
- http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/sql-more.pdf
- Other useful reading:
Week 4 -- Sept 29: Overview: Advanced SQL and Query Optimization
- Lecture notes:
- In-class exercise: http://vistrails.org/index.php/Big_Data_Lab_SQL
Big Data Foundations and Infrastructure (3 weeks)
Week 5 -- Oct 6: Cloud computing, Map Reduce and Hadoop
- Lecture notes:
- Lab: after the lecture, you will work on an in-class exercise. For this you need to install Hadoop on your laptop and have your account setup on AWS. See instructions below.
- Getting started with Hadoop: You will use three different Hadoop configurations:
- Local (on your laptop): see installation instructions in http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/HadoopExerciseInstructions.pdf
- NYU HPC will provide accounts so that you can use a local Hadoop cluster. Please submit a request for the to create an account for you *ASAP*. Follow the instructions to obtain an HPC account in: https://wikis.nyu.edu/display/NYUHPC/HPC+at+NYU+-+Access. You can find instructions on how to login and use the NYU Hadoop cluster at: http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/MapReduceExample/readme-nyu-hadoop.txt
- Amazon AWS: Each student will receive a token with $100 credit towards computing time at AWS. See http://www.vistrails.org/index.php/AWS_Setup for instructions on how to set up AWS. Always remember to terminate your instances! If you don't you will be charged and you are responsible for the charges beyond your credit.
- Required reading:
- Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
- Mining of Massive Datasets (2nd Edition), Chapter 2 - 2.1 and 2.2 (Large-Scale File Systems and Map-Reduce).
- Other useful reading:
- Hadoop: The Definitive Guide. http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520
Week 6 -- Oct 13: Fall Break
Week 7 -- Oct 20: Algorithm Design for MapReduce
- Lecture notes:
- Required reading:
- Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
- Mining of Massive Datasets (2nd Edition), Chapter 2.
Week 8 -- Oct 27: Parallel Databases vs MapReduce, Query Processing on Mapreduce and High-level Languages
- Lecture notes:
- Required reading:
- Data-Intensive Text Processing with MapReduce (Jan 27, 2013), Chapter 6 -- Processing Relational Data (this chapter appears in the 2013 version of the textbook -- I have placed this version in http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/MapReduce-algorithms-Jan2013-draft.pdf)
- Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
- MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
- Additional reading:
- Pig Latin: A Not-So-Foreign Language for Data Processing: http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf
- Hive - A Warehousing Solution Over a Map-Reduce Framework: http://www.vldb.org/pvldb/2/vldb09-938.pdf
Big Data Algorithms and Techniques (3 weeks)
Week 9 -- Nov 3: Association Rules
- Lecture notes:
- Reading: Chapter 6 Mining of Massive Datasets
Week 10 -- Nov 10: Finding similar items
- Reading: Chapter 3 Mining of Massive Datasets
Week 11 -- Nov 17: Graph Analysis
- Lecture notes:
Week 12 -- Nov 25: Large-Scale Visualization -- Invited lecture by Dr. Lauro Lins (AT&T Research)
- Lecture notes:
- Reading:
The Value of Visualization, Jarke Van Wijk http://www.win.tue.nl/~vanwijk/vov.pdf
Tamara Munzner's Book draft 2 available online http://www.cs.ubc.ca/~tmm/courses/533/book/
Nanocubes Paper http://nanocubes.net http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf