Difference between revisions of "Course: Big Data 2015"

From VistrailsWiki
Jump to navigation Jump to search
Line 111: Line 111:
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =
= Big Data Algorithms, Mining Techniques, and Visualization (6 weeks) =


== Week 8: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Huy Vo (NYU CUSP) ==
== Week 7 - March 30th: Finding similar items  ==


* Lecture notes:
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/vis_and_big_data_resized.pdf
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf


* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]


== Week 9: Parallel Databases ==
* Homework Assignment
 
** See quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity.  
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/paralleldb-vs-hadoop-2014.pdf


** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
== Week 8 - April 6th: Association Rules  ==
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
 
== Week 9: Association Rules  ==


* Lecture notes:
* Lecture notes:
Line 139: Line 136:




== Week 10: Finding similar items  ==
== Week 9 - April 13th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Huy Vo (NYU CUSP) ==


* Lecture notes:
* Lecture notes:
** http://vgc.poly.edu/~juliana/courses/BigData2015/Lectures/similarity.pdf
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/vis_and_big_data_resized.pdf
 
 
== Week 10: Parallel Databases ==
 
** http://vgc.poly.edu/~juliana/courses/MassiveDataAnalysis2014/Lectures/paralleldb-vs-hadoop-2014.pdf


* Reading: Chapter 3 [http://vgc.poly.edu/~juliana/courses/BigData2015/Textbooks/ullman-book-v1.1-mining-massive-data.pdf Mining of Massive Datasets]
** Benchmark DBMS vs MapReduce (2009): http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
** MapReduce: A Flexible Data Processing Tool: http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext


* Homework Assignment
** There are two new quizes on [http://www.newgradiance.com Gradiance] -- Distance measures and document similarity. They ''due on May  5th.''
** Your final assignment is available at http://www.vistrails.org/index.php/Assignment_4_-_Querying_with_Pig_and_Mapreduce. This is an optional assignment and will count towards extra credit


== Week 11: Graph Analysis ==
== Week 11: Graph Analysis ==
Line 157: Line 157:
== Week 12: TBD ==
== Week 12: TBD ==


== Week 13: TBD ==
== Week 13: Final Exam ==


== Week 14: Final Exam ==
== Week 14:  Project Presentations ==


== Week 15: Project Presentations ==
== Week 15: Project Presentations ==

Revision as of 14:10, 23 March 2015

DS-GA 1004- Big Data: Tentative Schedule -- subject to change

  • Lecture: Mondays, 4:55pm-7:35pm at Silver, room 208.
  • Some classes will include a lab session, please "always bring your laptop.

News

  • 2/26/2015: An Amazon AWS token was emailed to each student. Please create your Amazon AWS account. You can find instructions at: http://www.vistrails.org/index.php/AWS_Setup
  • 2/26/2015: You should install the Cloudera VM on your laptop. We will need that for the lab on March 9th. Here are the instructions: Cloudera VM Setup
  • There is a new version of the textbook Mining of Massive Datasets, we will use the latest version 2.1

Background (2 weeks)

Week 1 - Feb 2: Course Overview; The evolution of Data Management and introduction to Big Data

Week 2 - Feb 9: Introduction to Databases, Relational Model and SQL

  • Programming assignment: Using SQL for data analysis and cleaning (see NYU Classes)

Feb 16: Holiday

Big Data Foundations and Infrastructure (3 weeks)

Week 3 - Feb 23: Introduction to Map Reduce


Week 4 - March 2: Algorithm Design for MapReduce: Relational Operations

  • Lab: Hands-on Hadoop (local)
  • Required reading:
    • Data-Intensive Text Processing with MapReduce, Chapters 1 and 2
    • Mining of Massive Datasets (2nd Edition), Chapter 2.
  • Programming assignment: Map Reduce (check NYU Classes)

Week 5 - March 9: MapReduce Algorithm Design Patterns; Parallel Databases vs MapReduce


  • Programming assignment: check NYU Classes on March 10th

March 16th: Spring Break

Transparency and Reproducibility (1 week)

Week 6 - March 23: Data Exploration and Reproducibility

  • Programming assignment 4: Exploring urban data (see NYU Classes)

Big Data Algorithms, Mining Techniques, and Visualization (6 weeks)

Week 7 - March 30th: Finding similar items

  • Homework Assignment
    • See quizes on Gradiance -- Distance measures and document similarity.

Week 8 - April 6th: Association Rules

  • Suggested additional reading:
    • Fast algorithms for mining association rules, Agrawal and Srikant, VLDB 1994.
    • Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber, Morgan Kaufmann
    • Dynamic Itemset Counting and Implication Rules for Market Basket Data. Brin et al., SIGMOD 1997. http://www-db.stanford.edu/~sergey/dic.html


Week 9 - April 13th: Visualization and Spatio-Temporal Data -- Invited lecture by Dr. Huy Vo (NYU CUSP)


Week 10: Parallel Databases


Week 11: Graph Analysis

Week 12: TBD

Week 13: Final Exam

Week 14: Project Presentations

Week 15: Project Presentations