Difference between revisions of "Course: Big Data Analysis"
Jump to navigation
Jump to search
Line 20: | Line 20: | ||
* [http://www.analytics-magazine.org/november-december-2010/54-the-analytics-journey.html The Analytics Journey] | * [http://www.analytics-magazine.org/november-december-2010/54-the-analytics-journey.html The Analytics Journey] | ||
* [http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ BigData Analytics Usecases] | * [http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ BigData Analytics Usecases] | ||
== Week 2: Monday Sept. 17th - Map-Reduce == | == Week 2: Monday Sept. 17th - Map-Reduce == | ||
Line 48: | Line 46: | ||
* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel] | * Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel] | ||
=== Readings === | === Required Reading === | ||
* [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce] | |||
* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext | |||
* Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011 | |||
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)] | |||
=== Additional Readings === | |||
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_ | |||
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data] | * [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data] | ||
* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads] | * [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads] |
Revision as of 14:27, 20 September 2012
This schedule is tentative and subject to change
Make sure to check my.poly.edu for course announcements
Week 1: Monday Sept. 10th - Course Overview
- Course overview and introduction to Big Data Analysis
- Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro.pdf
- Student survey -- to be filled out today!
Required Reading
Additional References
- Dilbert's BigData
- New York Time's "How BigData Became so Big"
- World Economic Forum: Big Data, Big Impact
- The Analytics Journey
- BigData Analytics Usecases
Week 2: Monday Sept. 17th - Map-Reduce
- Introduction to Map-Reduce
- Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/Hadoop.pdf
- Introduction to [1]
- The Map-Reduce ecosystem: Pig, Hive, Jaql, Mahout, BigInsights
Required Reading
- Mining of Massive Datasets, Chapter 2
- Data-Intensive Text Processing with MapReduce, Chapter 2, Chapter 3
- original google map-reduce paper
Additional References
- Pig Latin: A Not-So-Foreign Language for Data Processing
- Jaql: A Scripting Language for Large Scale Semistructured Data Analysis
- Hive - A Warehousing Solution Over a Map-Reduce Framework
Week 3: Monday Sept. 24th - Databases and Big Data
- Databases and Big Data: Persistence, Querying, Indexing, Transactions
- BigTables and NoSQL stores. Tuple store vs. column stores: HBase, MongoDB, Cassandra
- Transactions in NoSQL stores. Google's percolator.
- "NewSQL" stores: more on Hive, VoltDB, HadoopDB,
- Beyond MapReduce: Berkeley's Spark, UC Irvine's Asterix, Google's Dremel
Required Reading
- PDMBS vs. MapReduce
- http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
- Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011
- Benchmark DBMS vs MapReduce (2009)
Additional Readings
- http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_
- Bigtable: A Distributed Storage System for Structured Data
- HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
- Low Overhead Concurrency Control for Partitioned Main Memory Databases
- ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.
- Dremel: Interactive Analysis of Web-Scale Datasets
- Large-scale Incremental Processing Using Distributed Transactions and Notifications
Week 4: Monday Oct. 1st - Statistics is easy - Invited Speaker: Dennis Shasha
- Guest lecture by Dennis Shasha: Statistics and Big Data
- Provenance and data exploration
Required Reading
- http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students
Juliana Freire and Claudio Silva. In Computing in Science and Engineering 14(4): 18-25, 2012.
Juliana Freire, David Koop, Emanuele Santos, and Claudio T. Silva. In IEEE Computing in Science & Engineering, 2008.
Week 5: Monday Oct. 8st - Finding Similar Items
- Overview of information integration
Readings
Week 6: Monday Oct. 15st - Invited Speaker: Torsten Suel
- Reading: inverted index and crawling (Lin chapter 4)
- Ask Torsten (tentative, ask him for reading material)
Readings
- 1998 PageRank Paper
- Mining of Massive Datasets, Chapter 5
- Data-Intensive Text Processing with MapReduce, Chapter 5
Week 7: Monday Oct. 22st - Invited Speakers: Claudio Silva and Lauro Lins
- Introduction to Visualization; Data stewardship and provenance
- Guest lecture by Claudio Silva and Lauro Lins
Readings
- Hellerstein (ask Claudio for additional references)
- ADD: provenance and reproducibility
Week 8: Monday Oct. 29th - Graph Analysis
- Graph algorithms, link analysis, social networks
Readings
- Data-Intensive Text Processing with MapReduce, Chapter 4
Week 9: Monday Nov. 5th - Frequent Itemsets
Reading
- Mining of Massive Datasets, Chapter 6
Week 10: Monday Nov. 12th - Mining Data Streams =
Readings
- Mining of Massive Datasets, Chapter 4
Week 11: Monday Nov. 19th - Clustering
Readings
- Mining of Massive Datasets, Chapter 7
Week 12: Monday Nov. 26th - Recommendation Systems
Readings
- Mining of Massive Datasets, Chapter 9
Week 13 Monday Dec. 3rd - EM algorithms for text processing
- Data-Intensive Text Processing with MapReduce, Chapter 6