Difference between revisions of "Course: Big Data Analysis"

From VistrailsWiki
Jump to navigation Jump to search
 
(57 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Fall 2013 ==
'''''The deadline for the Pagerank assignment has been extended. I have sent a notification to all students, but for some of you, the email bounced. Make sure your nyu.edu email is working.'''''
'''''This schedule is tentative and subject to change'''''
'''''This schedule is tentative and subject to change'''''


Line 4: Line 8:


== News ==
== News ==
[http://www.vistrails.org/index.php/Course_Project:_Wikipedia_Analysis Project description]
* Assignment on Mapreduce and Pig, due on Dec 1st. Please see http://my.poly.edu
 
* Nov 7th: New quizzes have been assigned. Please see http://www.newgradiance.com/services/servlet/COTC
The deadline is Nov 15th. Please make sure that you have your correct name and Poly ID in your Gradiance account.
 
* Dr. C Mohan's presentation is now available at http://bit.ly/CMnMDS
 
For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].


== Week 1: Monday Sept. 10th - Course Overview ==
== Week 1: Monday Sept. 9th - Course Overview ==


* Course overview and introduction to Big Data Analysis
* Course overview and introduction to Big Data Analysis
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro.pdf  
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro.pdf  
* [https://docs.google.com/spreadsheet/viewform?fromEmail=true&formkey=dFdHT3BST2l1TW9KeHYzYjBDaTU0V1E6MQ Student survey] -- to be filled out today!
* [https://docs.google.com/spreadsheet/viewform?fromEmail=true&formkey=dFdHT3BST2l1TW9KeHYzYjBDaTU0V1E6MQ Student survey] -- to be filled out today!
Line 24: Line 35:
* [http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ BigData Analytics Usecases]
* [http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ BigData Analytics Usecases]


== Week 2:  Monday Sept. 17th - Map-Reduce ==
== Week 2:  Monday Sept. 16th - Map-Reduce/Hadoop ==
 
* Introduction to Map-Reduce and high-level data processing languages
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf
* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].
* Apache [http://hadoop.apache.org/ Hadoop]
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://mahout.apache.org/ Mahout]


* Introduction to Map-Reduce
=== Assignment ===
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/Hadoop.pdf
 
* Introduction to [http://hadoop.apache.org/Hadoop]
* [[cs9223 Mapreduce Assignment]]
* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://code.google.com/p/jaql/ Jaql], [http://mahout.apache.org/ Mahout], BigInsights
* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.
For more details see http://cis.poly.edu/policies.
* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''
* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018


=== Required Reading ===
=== Required Reading ===
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2, Chapter 3]
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]


=== Additional References ===
== Week 3: Monday Sept. 23rd - Data Management for Big Data ==
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]
 
== Week 3: Monday Sept. 24th - Databases and Big Data ==


* Databases and Big Data: Persistence, Querying, Indexing, Transactions
* Databases and Big Data: Persistence, Querying, Indexing, Transactions
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf
* In-class exercise (to be distributed in class)


=== Related Topics ===
=== Related Topics ===
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]
* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].
* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].
* "NewSQL" stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],
* "NewSQL" stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],
Line 59: Line 75:
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]
* [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]


=== Additional Readings ===
=== Additional References ===
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_
* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]
* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]
Line 68: Line 84:
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]
* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]


== Week 4:  Monday Oct. 1st - Statistics is easy - Invited Speaker: Dennis Shasha ==
== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==
 
* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''
 
* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS
 
* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.
 
* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.
 
* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan
 
== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==
 
* Pig Latin and Query Processing:
** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]
** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]
* In-class assignment
 
=== Required Reading ===
 
* [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]
 
=== Additional References ===
 
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]
* [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]
 
== Week 6:  Mon Oct. 14th - Fall Break - No class ==
 
== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==
* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf
* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf
 
 
== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==
 
* Inside MongoDB
 
== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==


* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]
* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]
* Pig Latin and Query Processing:
* Introduction to Provenance
** [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/query_processing_relational  Relational query processing: Review]
** [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/query_processing_pig_mapreduce.ppt.pdf  Query Processing in Pig]


=== Required Reading ===
=== Required Reading ===
Line 79: Line 132:
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008
* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008


=== Homework Assignment ===
'''Due October 9th'''
[[BigDataHW1]]


== Week 5: Monday Oct. 8st - Finding Similar Items ==
* We will cover the material planned for "Week 10: Monday Nov. 11th": Finding Similar Items
 
== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==
* Similarity: Applications, Measures and Efficiency considerations
* Similarity: Applications, Measures and Efficiency considerations
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf
Line 94: Line 146:


=== Homework Assignment ===
=== Homework Assignment ===
'''Due October 15th at noon'''
'''Due Nov 15th, 2013'''
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.


== Week 6Wednesday Oct. 17th - Invited Speaker: Torsten Suel ==
== Week 10: Monday Nov. 11th - MapReduce Algorithm Design ==
'''Note this class will be held on Wednesday!'''


* Big Data and Information Retrieval. Invited lecture by Torsten Suel.
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/search-data.pdf


=== Readings ===
=== Required Reading ===
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]


== Week 7: Monday Oct. 22st - Invited lecture by and Lauro Lins ==
* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer
* Introduction to Visualization
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro-to-visualization.pdf


=== Readings ===
=== Homework Assignment ===
The Value of Visualization. IEEE Visualization 2005. Jarke J. van Wijk. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.78.1138
'''Due Nov 15th, 2013'''
Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.


Visualization Analysis and Design: Principles, Methods, and Practice. Tamara Munzner (Book Draft 2 from Sep. 2012). http://www.cs.ubc.ca/~tmm/courses/533-11/book/vispmp-draft.pdf
== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing ==
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf


== Week 8: Monday Oct 29th- Class canceled due to storm ==  
=== Homework Assignment ===
Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.




== Week 9: Monday Nov 5th- Data infrastructure and information integration ==  
=== Required Reading ===
* Big Table, HadoopDB.
* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]
* Similarity application: Information integration on the Web:
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf


=== Readings ===
=== Additional Reading ===
* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do
* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]
* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.
* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]


== Week 10: Monday Nov. 12th  - Frequent Itemsets ==
== Week 12: Monday Nov. 25th - Large-Scale Visualization ==
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/assoc-rules1.pdf


=== Readings ===
* Invited lectures by:
* Mining of Massive Datasets, Chapter 4
** Dr. Lauro Lins (AT&T Research)
** Dr. Huy Vo (NYU Center for Urban Science and Progress)


=== Additional Reading ===
* Lecture notes:
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://delivery.acm.org/10.1145/180000/170072/p207-agrawal.pdf?ip=128.238.251.32&acc=ACTIVE%20SERVICE&CFID=198467341&CFTOKEN=23537886&__acm__=1352747519_b80a516e0f5e294b36dc021f13f55bbb
** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf
** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://dl.acm.org/citation.cfm?id=223813


== Week 11: Monday Nov 19th- Algorithms on MapReduce: text processing  ==


* Algorithms, link analysis, social networks
=== Required Reading ===
** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf
The Value of Visualization, Jarke Van Wijk
* Discussion on the project
http://www.win.tue.nl/~vanwijk/vov.pdf


=== Readings ===
Tamara Munzner's Book draft 2 available online
* Data-Intensive Text Processing with MapReduce, Chapter 4
http://www.cs.ubc.ca/~tmm/courses/533/book/


== Week 12: Monday Nov. 26th - Graph Algorithms and Phase-I project presentations ==
Nanocubes Paper
http://nanocubes.net
http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf


** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf
=== Additional Reading ===
imMens Paper (to contrast with nanocubes)
http://vis.stanford.edu/papers/immens


=== Readings ===
* Data-Intensive Text Processing with MapReduce, Chapter 4
* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]


== Week 13: Monday Dec. 3rd - Clustering ==
== Week 13: Monday Dec. 2nd - Frequent Itemsets ==
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf


* Lecture notes:
=== Additional Reading ===
** Graph algorithms: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-graph.pdf
* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf
**Clustering: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/clustering.pdf, http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/form-clustering-icde2007.pdf
* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf
* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf


=== Optional Quiz ===
'''Due Dec 9th'''


=== Readings ===
== Week 14: Monday Dec. 9th - - EM and exam review ==
* Mining of Massive Datasets, Chapter 7
* See readings for previous class
* Web Mining, by Bing Liu. http://www.cs.uic.edu/~liub/WebMiningBook.html
* Information Retrieval. http://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf


== Week 14: Monday Dec. 10th - EM algorithms for text processing ==
* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf


=== Readings ===
=== Readings ===


* Data-Intensive Text Processing with MapReduce, Chapter 6
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)
 
 
== Week 15  Monday Dec. 17 -  Phase-II Project presentation  ==
 
 
== Further Readings ==
* [http://www-users.cs.umn.edu/~kumar/papers/anomaly-survey.php Anomaly Detection: A Survey]
 
== Other topics ==
===Provenance ===
* [http://vgc.poly.edu/~juliana/pub/vistrails-reproducibility2012.pdf Making Computations and Publications Reproducible with VisTrails]
Juliana Freire and Claudio Silva. In Computing in Science and Engineering 14(4): 18-25, 2012.


* [http://vgc.poly.edu/~juliana/pub/freire-cise2008.pdf Provenance for Computational Tasks: A Survey]
== Week 15  Monday Dec. 16th - Final Exam ==
Juliana Freire, David Koop, Emanuele Santos, and Claudio T. Silva. In IEEE Computing in Science & Engineering, 2008.

Latest revision as of 15:23, 16 December 2013

Fall 2013

The deadline for the Pagerank assignment has been extended. I have sent a notification to all students, but for some of you, the email bounced. Make sure your nyu.edu email is working.

This schedule is tentative and subject to change

Make sure to check my.poly.edu for course announcements

News

The deadline is Nov 15th. Please make sure that you have your correct name and Poly ID in your Gradiance account.

For frequently asked questions about the course and homework assignments, please check our BigDataAnalysisFAQ.

Week 1: Monday Sept. 9th - Course Overview

Required Reading

Additional References

Week 2: Monday Sept. 16th - Map-Reduce/Hadoop

Assignment

For more details see http://cis.poly.edu/policies.

  • You assignment is due on Sun Sept 29th. Make sure you can login and access my.poly.edu!
  • If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018

Required Reading

Week 3: Monday Sept. 23rd - Data Management for Big Data

Related Topics

Required Reading

Additional References

Week 4: Monday Sept 30th - Invited lecture by Dr. C. Mohan (IBM)

  • Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor
  • Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.
  • Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.
  • Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan

Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages

Required Reading

Additional References

Week 6: Mon Oct. 14th - Fall Break - No class

Week 6: Wed Oct. 16th - Fall Break - Make-up class


Week 7: Monday Oct. 21st - Invited Speaker: Alberto Lerner

  • Inside MongoDB

Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha

Required Reading


  • We will cover the material planned for "Week 10: Monday Nov. 11th": Finding Similar Items

Week 9: Monday Nov. 4th - Finding Similar Items, Information Integration

Required Reading

Homework Assignment

Due Nov 15th, 2013 Your assignment is in http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.

Week 10: Monday Nov. 11th - MapReduce Algorithm Design

Required Reading

  • Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer

Homework Assignment

Due Nov 15th, 2013 Your assignment is in http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.

Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing

Homework Assignment

Your Mapreduce/Pig assignment is available from Blackboard. It is Due December 1st.


Required Reading

Additional Reading

Week 12: Monday Nov. 25th - Large-Scale Visualization

  • Invited lectures by:
    • Dr. Lauro Lins (AT&T Research)
    • Dr. Huy Vo (NYU Center for Urban Science and Progress)


Required Reading

The Value of Visualization, Jarke Van Wijk http://www.win.tue.nl/~vanwijk/vov.pdf

Tamara Munzner's Book draft 2 available online http://www.cs.ubc.ca/~tmm/courses/533/book/

Nanocubes Paper http://nanocubes.net http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf

Additional Reading

imMens Paper (to contrast with nanocubes) http://vis.stanford.edu/papers/immens


Week 13: Monday Dec. 2nd - Frequent Itemsets

Additional Reading

Optional Quiz

Due Dec 9th

Week 14: Monday Dec. 9th - - EM and exam review

Readings

Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)

Week 15 Monday Dec. 16th - Final Exam