Difference between revisions of "Course: Big Data Analysis"
Line 45: | Line 45: | ||
* Databases and Big Data: Persistence, Querying, Indexing, Transactions | * Databases and Big Data: Persistence, Querying, Indexing, Transactions | ||
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf | * Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf | ||
=== Related Topics === | === Related Topics === | ||
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra] | * BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra] | ||
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do | |||
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained. | |||
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html]. | * Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html]. | ||
* "NewSQL" stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB], | * "NewSQL" stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB], | ||
Line 68: | Line 69: | ||
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications] | * [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications] | ||
== Week 4: Monday Sept 30th - | == Week 4: Monday Sept 30th - Query Processing on Mapreduce and High-level Languages == | ||
* Pig Latin and Query Processing: | * Pig Latin and Query Processing: | ||
** [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/query_processing_relational Relational query processing: Review] | ** [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/query_processing_relational Relational query processing: Review] | ||
** [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/query_processing_pig_mapreduce.ppt.pdf Query Processing in Pig] | ** [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/query_processing_pig_mapreduce.ppt.pdf Query Processing in Pig] | ||
* In-class assignment | |||
=== Required Reading === | === Required Reading === | ||
Line 79: | Line 80: | ||
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008 | * Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008 | ||
== Week 5: Monday Oct. | == Week 5: Monday Oct. 7th Invited Speaker: Torsten Suel == | ||
* | |||
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/ | * Big Data and Information Retrieval. Invited lecture by Torsten Suel. | ||
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/search-data.pdf | |||
== Week 6: Mon Oct. 14th - Fall Break - No class == | |||
== Week | == Week 7: Monday Oct. 22st - Graph Algorithms == | ||
TODO | |||
=== Readings === | === Readings === | ||
Line 107: | Line 99: | ||
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)] | * [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)] | ||
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)] | * [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)] | ||
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf] | |||
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha == | |||
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]: [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy] | |||
== Week 9: Monday Nov 5th - EM and Text Processing | |||
TODO | |||
=== Readings === | === Readings === | ||
* Data-Intensive Text Processing with MapReduce, Chapter 6 | |||
== Week | == Week 10: Monday Nov. 11th - - Finding Similar Items and Information Integration == | ||
* | * Similarity: Applications, Measures and Efficiency considerations | ||
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf | |||
* Similarity application: Information integration on the Web: | * Similarity application: Information integration on the Web: | ||
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf | ** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf | ||
* Homework presentation and demo | |||
=== Required Reading === | |||
* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution] | |||
=== Homework Assignment === | |||
'''Due November 17th''' | |||
Your assignment is in http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service. | |||
== Week | == Week 11: Monday Nov 18th- Frequent Itemsets == | ||
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf | ** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf | ||
=== | === Required Reading === | ||
* Mining of Massive Datasets, Chapter 4 | * Mining of Massive Datasets, Chapter 4 | ||
=== Homework Assignment === | |||
'''Due November 24th''' | |||
=== Additional Reading === | === Additional Reading === | ||
Line 140: | Line 148: | ||
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://dl.acm.org/citation.cfm?id=223813 | * An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://dl.acm.org/citation.cfm?id=223813 | ||
== Week 12: Monday Nov. 25th - Clustering == | |||
== Week 12: Monday Nov. | |||
* Lecture notes: | * Lecture notes: | ||
** Graph algorithms: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-graph.pdf | ** Graph algorithms: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-graph.pdf | ||
**Clustering: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/clustering.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/form-clustering-icde2007.pdf | **Clustering: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/clustering.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/form-clustering-icde2007.pdf | ||
=== Homework Assignment === | |||
'''Due Dec 1st''' | |||
=== Readings === | === Readings === | ||
Line 169: | Line 164: | ||
* Information Retrieval. http://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf | * Information Retrieval. http://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf | ||
== Week 14: Monday Dec. | == Further Readings == | ||
* [http://www-users.cs.umn.edu/~kumar/papers/anomaly-survey.php Anomaly Detection: A Survey] | |||
== Week 13: Monday Dec. 2nd - Invited lecture by Enrico Bertini == | |||
* Introduction to Visual Analytics | |||
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro-to-visualization.pdf | |||
=== Readings === | |||
The Value of Visualization. IEEE Visualization 2005. Jarke J. van Wijk. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.78.1138 | |||
Visualization Analysis and Design: Principles, Methods, and Practice. Tamara Munzner (Book Draft 2 from Sep. 2012). http://www.cs.ubc.ca/~tmm/courses/533-11/book/vispmp-draft.pdf | |||
== Week 14: Monday Dec. 9th - Recommendation Systems == | |||
=== Readings === | === Readings === | ||
* Ullman chapter 9 | |||
== Week 15 Monday Dec. 16th - Final Exam == | |||
== Other topics == | == Other topics == |
Revision as of 19:51, 8 September 2013
Fall 2013
This schedule is tentative and subject to change
Make sure to check my.poly.edu for course announcements
Week 1: Monday Sept. 9th - Course Overview
- Course overview and introduction to Big Data Analysis
- Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro.pdf
- Student survey -- to be filled out today!
Required Reading
Additional References
- Dilbert's BigData
- New York Time's "How BigData Became so Big"
- World Economic Forum: Big Data, Big Impact
- The Analytics Journey
- BigData Analytics Usecases
Week 2: Monday Sept. 16th - Map-Reduce/Hadoop
- Introduction to Map-Reduce and high-level data processing languages
- Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/Hadoop.pdf
- Hand out AWS tokens. Notes on using AWS.
- First assignment, due on Sun Sept 29th.
- Introduction to Hadoop
- The Map-Reduce ecosystem: Pig, Hive, Jaql, Mahout, BigInsights
Required Reading
- Mining of Massive Datasets, Chapter 2
- Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3
- original google map-reduce paper
Additional References
- Pig Latin: A Not-So-Foreign Language for Data Processing
- Jaql: A Scripting Language for Large Scale Semistructured Data Analysis
- Hive - A Warehousing Solution Over a Map-Reduce Framework
Week 3: Monday Sept. 23rd - Data Management for Big Data
- Databases and Big Data: Persistence, Querying, Indexing, Transactions
- Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf
Related Topics
- BigTables and NoSQL stores. Tuple store vs. column stores: HBase, MongoDB, Cassandra
- HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do
- HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.
- Transactions in NoSQL stores. Google's percolator, [1].
- "NewSQL" stores: more on Hive, VoltDB, HadoopDB,
- Beyond MapReduce: Berkeley's Spark, UC Irvine's Asterix, Google's Dremel
Required Reading
- PDMBS vs. MapReduce
- http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
- Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011
- Benchmark DBMS vs MapReduce (2009)
Additional References
- http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_
- Bigtable: A Distributed Storage System for Structured Data
- HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
- Low Overhead Concurrency Control for Partitioned Main Memory Databases
- ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.
- Dremel: Interactive Analysis of Web-Scale Datasets
- Large-scale Incremental Processing Using Distributed Transactions and Notifications
Week 4: Monday Sept 30th - Query Processing on Mapreduce and High-level Languages
- Pig Latin and Query Processing:
- In-class assignment
Required Reading
- http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students
- Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008
Week 5: Monday Oct. 7th Invited Speaker: Torsten Suel
- Big Data and Information Retrieval. Invited lecture by Torsten Suel.
Week 6: Mon Oct. 14th - Fall Break - No class
Week 7: Monday Oct. 22st - Graph Algorithms
TODO
Readings
- 1998 PageRank Paper
- Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)
- Mining of Massive Datasets, Chapter 5 (Link Analysis)
- Pregel: A System for Large-Scale Graph Processing. Google. [2]
Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha
- Guest lecture by Dennis Shasha: Statistics is Easy
== Week 9: Monday Nov 5th - EM and Text Processing
TODO
Readings
- Data-Intensive Text Processing with MapReduce, Chapter 6
Week 10: Monday Nov. 11th - - Finding Similar Items and Information Integration
- Similarity: Applications, Measures and Efficiency considerations
- Similarity application: Information integration on the Web:
- Homework presentation and demo
Required Reading
Homework Assignment
Due November 17th Your assignment is in http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.
Week 11: Monday Nov 18th- Frequent Itemsets
Required Reading
- Mining of Massive Datasets, Chapter 4
Homework Assignment
Due November 24th
Additional Reading
- Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://delivery.acm.org/10.1145/180000/170072/p207-agrawal.pdf?ip=128.238.251.32&acc=ACTIVE%20SERVICE&CFID=198467341&CFTOKEN=23537886&__acm__=1352747519_b80a516e0f5e294b36dc021f13f55bbb
- Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf
- An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://dl.acm.org/citation.cfm?id=223813
Week 12: Monday Nov. 25th - Clustering
- Lecture notes:
Homework Assignment
Due Dec 1st
Readings
- Mining of Massive Datasets, Chapter 7
- See readings for previous class
- Web Mining, by Bing Liu. http://www.cs.uic.edu/~liub/WebMiningBook.html
- Information Retrieval. http://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf
Further Readings
Week 13: Monday Dec. 2nd - Invited lecture by Enrico Bertini
- Introduction to Visual Analytics
Readings
The Value of Visualization. IEEE Visualization 2005. Jarke J. van Wijk. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.78.1138
Visualization Analysis and Design: Principles, Methods, and Practice. Tamara Munzner (Book Draft 2 from Sep. 2012). http://www.cs.ubc.ca/~tmm/courses/533-11/book/vispmp-draft.pdf
Week 14: Monday Dec. 9th - Recommendation Systems
Readings
- Ullman chapter 9
Week 15 Monday Dec. 16th - Final Exam
Other topics
Provenance
Juliana Freire and Claudio Silva. In Computing in Science and Engineering 14(4): 18-25, 2012.
Juliana Freire, David Koop, Emanuele Santos, and Claudio T. Silva. In IEEE Computing in Science & Engineering, 2008.