Difference between revisions of "Course: Massive Data Analysis 2014"

From VistrailsWiki
Jump to navigation Jump to search
 
(9 intermediate revisions by the same user not shown)
Line 14: Line 14:
* On Sept 22nd, I distributed AWS tokens that will be needed for your assignments. If you have not received your token, let me know.
* On Sept 22nd, I distributed AWS tokens that will be needed for your assignments. If you have not received your token, let me know.
* Your first assignment has been posted -- see details below and in NYU Classes.
* Your first assignment has been posted -- see details below and in NYU Classes.
* Instructions on how to set up your AWS account: http://www.vistrails.org/index.php/AWS_Setup
* You should get an NYU HPC account so that you can use the NYU Hadoop cluster. To submit a request for an account, follow the instructions in: https://wikis.nyu.edu/display/NYUHPC/HPC+at+NYU+-+Access. You can find instructions on how to login and use the NYU Hadoop cluster at: http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/MapReduceExample/readme-nyu-hadoop.txt


= Background (4 weeks) =
= Background (4 weeks) =
Line 151: Line 153:
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/association-rules.pdf
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/association-rules.pdf


* Assignment: Check http://www.newgradiance.com/services
* Assignment on frequent items and association rule mining. ''Due on Dec 7th.''  Check http://www.newgradiance.com/services


* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]
* Reading: Chapter 6 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]
Line 160: Line 162:
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html
**Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html


== Week 12 -- Dec 1:  Finding similar items ==
== Week 12 -- Dec 1:  Project Updates ==


* Lecture notes:
* Lecture notes:
Line 167: Line 169:
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]
* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]


== Week 13 -- Dec 8: Graph Analysis ==
* Quizzes on Distance Measures and Document Similarity . ''These quizzes are optional and will count as extra credit. Due on Dec 14th.''  Check http://www.newgradiance.com/services
 
== Week 13 -- Dec 8: Finding Similar Items and Link Analysis ==


* Lecture notes:
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/similarity.pdf
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/graph-algos.pdf
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/graph-algos.pdf


* Readings:
**Chapter 3 (pages 55-79) [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]
**Chapter 5 (pages 87-106) [http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Textbooks/MapReduce-algorithms-Jan2013-draft.pdf Data-Intensive Text Processing with MapReduce]
== Week 13 -- Dec 10: Project Discussion ==
* Meeting with individual groups at 2 MTC, 10.097


== Week 14 -- Dec 15: Project Presentations  ==
== Week 14 -- Dec 15: Project Presentations  ==

Latest revision as of 20:58, 8 December 2014

CS-GY 6333 Massive Data Analysis: Tentative Schedule -- subject to change

  • Lecture: Mondays, 1:00pm-3:25pm at 2MTC, room 9.011.

News

Background (4 weeks)

Week 1 -- Sept 8: Course Overview; the evolution of Data Management

Week 2 -- Sept 15: Provenance and Reproducibility

  • Github setup:

Week 3 -- Sept 22: Introduction to Databases; Relational Model and SQL

Week 4 -- Sept 29: Overview: Advanced SQL and Query Optimization

Big Data Foundations and Infrastructure (3 weeks)

Week 5 -- Oct 6: Cloud computing, Map Reduce and Hadoop

  • Lab: after the lecture, you will work on an in-class exercise. For this you need to install Hadoop on your laptop and have your account setup on AWS. See instructions below.
  • You will use two different Hadoop configurations:
    • Local (on your laptop)
    • Amazon AWS: Each student should have received a token with $100 credit towards computing time at AWS. If you have not received the token yet, contact us immediately! When using AWS, always remember to terminate your instances! If you don't, you will be charged and you are responsible for the charges beyond your credit.
    • See installation instructions for Hadoop on your local machine and how to setup your AWS account in http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/HadoopExerciseInstructions.pdf
    • Warning: Install Hadoop in your machine and setup your AWS account before class starts. There will be no time for installing software during our in-class exercise.


  • Required reading:
    • Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
    • Mining of Massive Datasets (2nd Edition), Chapter 2 - 2.1 and 2.2 (Large-Scale File Systems and Map-Reduce).

Week 6 -- Oct 13: Fall Break

Week 7 -- Oct 20: Big Data Analysis with Myria

Week 7 -- Oct 27: Algorithm Design for MapReduce

  • Required reading:
    • Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
    • Mining of Massive Datasets (2nd Edition), Chapter 2.

Week 8 -- Nov 3: Parallel Databases vs MapReduce, Query Processing on Mapreduce and High-level Languages

  • Discussion about project
  • Assignment: check Gradiance!


Big Data Algorithms, Techniques, and Visualization (3 weeks)

Week 9 -- Nov 10: Visualization and Big Data -- Invited lecture by Dr. Huy Vo (NYU CUSP)


Week 10 -- Nov 17: Visualization Techniques -- Invited lecture by Dr. Lauro Lins (AT&T Research)

  • Project status report due!

Week 11 -- Nov 25 Association Rules

  • Suggested additional reading:
    • Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.
    • Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann
    • Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html

Week 12 -- Dec 1: Project Updates

Week 13 -- Dec 8: Finding Similar Items and Link Analysis

Week 13 -- Dec 10: Project Discussion

  • Meeting with individual groups at 2 MTC, 10.097

Week 14 -- Dec 15: Project Presentations