Difference between revisions of "Course: Big Data Analysis"

Latest revision as of 15:23, 16 December 2013

Fall 2013

The deadline for the Pagerank assignment has been extended. I have sent a notification to all students, but for some of you, the email bounced. Make sure your nyu.edu email is working.

This schedule is tentative and subject to change

Make sure to check my.poly.edu for course announcements

News

Assignment on Mapreduce and Pig, due on Dec 1st. Please see http://my.poly.edu

Nov 7th: New quizzes have been assigned. Please see http://www.newgradiance.com/services/servlet/COTC

The deadline is Nov 15th. Please make sure that you have your correct name and Poly ID in your Gradiance account.

Dr. C Mohan's presentation is now available at http://bit.ly/CMnMDS

For frequently asked questions about the course and homework assignments, please check our BigDataAnalysisFAQ.

Week 1: Monday Sept. 9th - Course Overview

Course overview and introduction to Big Data Analysis
Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro.pdf
Student survey -- to be filled out today!

Required Reading

Additional References

Week 2: Monday Sept. 16th - Map-Reduce/Hadoop

Introduction to Map-Reduce and high-level data processing languages
Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf
Hand out AWS tokens. Notes on using AWS.
Apache Hadoop
The Map-Reduce ecosystem: Pig, Hive, Mahout

Assignment

cs9223 Mapreduce Assignment
This is an individual assignment. You may not collude with any other individual, or plagiarise their work.

For more details see http://cis.poly.edu/policies.

You assignment is due on Sun Sept 29th. Make sure you can login and access my.poly.edu!
If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018

Required Reading

Week 3: Monday Sept. 23rd - Data Management for Big Data

Databases and Big Data: Persistence, Querying, Indexing, Transactions
Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf

Required Reading

Additional References

Week 4: Monday Sept 30th - Invited lecture by Dr. C. Mohan (IBM)

Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor

Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS

Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.

Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.

Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan

Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages

Pig Latin and Query Processing:
- Relational processing over MapReduce
- Queries over MapReduce
In-class assignment

Required Reading

Pig Latin: A Not-So-Foreign Language for Data Processing

Additional References

Week 6: Mon Oct. 14th - Fall Break - No class

Week 6: Wed Oct. 16th - Fall Break - Make-up class

Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf
Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf

Week 7: Monday Oct. 21st - Invited Speaker: Alberto Lerner

Inside MongoDB

Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha

Guest lecture by Dennis Shasha: Statistics is Easy
Introduction to Provenance

Required Reading

http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students
Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008

We will cover the material planned for "Week 10: Monday Nov. 11th": Finding Similar Items

Week 9: Monday Nov. 4th - Finding Similar Items, Information Integration

Similarity: Applications, Measures and Efficiency considerations
- Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf
Similarity application: Information integration on the Web:
- Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf
Homework presentation and demo

Required Reading

Mining of Massive Datasets, chapter 3; information integration; entity resolution

Homework Assignment

Due Nov 15th, 2013 Your assignment is in http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.

Week 10: Monday Nov. 11th - MapReduce Algorithm Design

Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf

Required Reading

Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer

Homework Assignment

Due Nov 15th, 2013 Your assignment is in http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.

Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing

Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf

Homework Assignment

Your Mapreduce/Pig assignment is available from Blackboard. It is Due December 1st.

Required Reading

Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)

Additional Reading

1998 PageRank Paper
Mining of Massive Datasets, Chapter 5 (Link Analysis)
Pregel: A System for Large-Scale Graph Processing. Google. [2]

Week 12: Monday Nov. 25th - Large-Scale Visualization

Invited lectures by:
- Dr. Lauro Lins (AT&T Research)
- Dr. Huy Vo (NYU Center for Urban Science and Progress)

Lecture notes:
- https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf
- https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf

Required Reading

The Value of Visualization, Jarke Van Wijk http://www.win.tue.nl/~vanwijk/vov.pdf

Tamara Munzner's Book draft 2 available online http://www.cs.ubc.ca/~tmm/courses/533/book/

Nanocubes Paper http://nanocubes.net http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf

Additional Reading

imMens Paper (to contrast with nanocubes) http://vis.stanford.edu/papers/immens

Week 13: Monday Dec. 2nd - Frequent Itemsets

Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf

Additional Reading

Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf
Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf
An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf

Optional Quiz

Due Dec 9th

Week 14: Monday Dec. 9th - - EM and exam review

Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf

Readings

Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)

@@ Line 1: / Line 1: @@
+== Fall 2013 ==
+'''''The deadline for the Pagerank assignment has been extended. I have sent a notification to all students, but for some of you, the email bounced. Make sure your nyu.edu email is working.'''''
+'''''This schedule is tentative and subject to change'''''
 '''''Make sure to check my.poly.edu for course announcements'''''
-== Week 1: Monday Sept. 10th - Course Overview ==
+== News ==
+* Assignment on Mapreduce and Pig, due on Dec 1st. Please see http://my.poly.edu
-* Course overview  (First day of classes!)
+* Nov 7th: New quizzes have been assigned. Please see http://www.newgradiance.com/services/servlet/COTC
-* Student survey
+The deadline is Nov 15th. Please make sure that you have your correct name and Poly ID in your Gradiance account.
-* Introduction to Big Data
-=== Readings ===
+* Dr. C Mohan's presentation is now available at http://bit.ly/CMnMDS
+For frequently asked questions about the course and homework assignments, please check our [[BigDataAnalysisFAQ]].
+== Week 1: Monday Sept. 9th - Course Overview ==
+* Course overview and introduction to Big Data Analysis
+* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/intro.pdf
+* [https://docs.google.com/spreadsheet/viewform?fromEmail=true&formkey=dFdHT3BST2l1TW9KeHYzYjBDaTU0V1E6MQ Student survey] -- to be filled out today!
+=== Required Reading ===
+* [http://i.stanford.edu/~ullman/mmds/book.pdf Mining of Massive Datasets, Chapter 1]
+* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter1]
+=== Additional References ===
 * [http://dilbert.com/strips/comic/2012-07-29/ Dilbert's BigData]
 * [http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?ref=stevelohr New York Time's "How BigData Became so Big"]
@@ Line 14: / Line 34: @@
 * [http://www.analytics-magazine.org/november-december-2010/54-the-analytics-journey.html The Analytics Journey]
 * [http://practicalanalytics.wordpress.com/2011/12/12/big-data-analytics-use-cases/ BigData Analytics Usecases]
-* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter1]
+== Week 2:   Monday Sept. 16th - Map-Reduce/Hadoop ==
+* Introduction to Map-Reduce and high-level data processing languages
+* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hadoop.pdf
+* Hand out AWS tokens. [http://www.vistrails.org/index.php/AWS_Setup Notes on using AWS].
+* Apache [http://hadoop.apache.org/ Hadoop]
+* The Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://mahout.apache.org/ Mahout]
+=== Assignment ===
+* [[cs9223 Mapreduce Assignment]]
+* This is an individual assignment. You may not collude with any other individual, or plagiarise their work.
+For more details see http://cis.poly.edu/policies.
+* You assignment is ''due on Sun Sept 29th''. '''Make sure you can login and access my.poly.edu!'''
+* If you have questions about the assignment, we will hold office hours on Sept 23, 2013 from 2:30-3:30pm at 2 Metrotech, room 10.018
+=== Required Reading ===
+* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]
+* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3]
+* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]
+== Week 3: Monday Sept. 23rd - Data Management for Big Data ==
+* Databases and Big Data: Persistence, Querying, Indexing, Transactions
+* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/paralleldb-vs-hadoop.pdf
+=== Related Topics ===
+* BigTables and NoSQL stores. Tuple store vs. column stores: [http://hbase.apache.org/ HBase], [http://www.mongodb.org/ MongoDB], [http://cassandra.apache.org/ Cassandra]
+* HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do
+* HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.
+* Transactions in NoSQL stores. Google's percolator, [http://research.google.com/pubs/pub36726.html].
+* "NewSQL" stores: more on [http://hive.apache.org/ Hive], [http://voltdb.com/ VoltDB], [http://db.cs.yale.edu/hadoopdb/hadoopdb.html HadoopDB],
+* Beyond MapReduce: [http://spark-project.org/ Berkeley's Spark], [http://asterix.ics.uci.edu/ UC Irvine's Asterix], Google's [http://code.google.com/p/dremel/ Dremel]
+=== Required Reading ===
 * [http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext PDMBS vs. MapReduce]
+* http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
+* [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011]
 * [http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Benchmark DBMS vs MapReduce (2009)]
-== Week 2:   Monday Sept. 17th - Map-Reduce ==
+=== Additional References ===
+* http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_
+* [http://research.google.com/archive/bigtable.html Bigtable: A Distributed Storage System for Structured Data]
+* [http://cs-www.cs.yale.edu/homes/dna/papers/hadoopdb.pdf HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads]
+* [http://cs-www.cs.yale.edu/homes/dna/papers/hstore-cc.pdf Low Overhead Concurrency Control for Partitioned Main Memory Databases]
+* [http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.]
+* [http://research.google.com/pubs/pub36632.html Dremel: Interactive Analysis of Web-Scale Datasets]
+* [http://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]
+== Week 4:  Monday Sept 30th - ''Invited lecture by Dr. C. Mohan (IBM)'' ==
+* '''Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor'''
-* Introduction to map-reduce
+* Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS
-* Introduction to [http://hadoop.apache.org/ Hadoop]
-* Map-Reduce ecosystem: [http://pig.apache.org/ Pig], [http://hive.apache.org/ Hive], [http://code.google.com/p/jaql/ Jaql], [http://mahout.apache.org/ Mahout], BigInsights
+* Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.
+* Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.
+* Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan
+== Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages ==
+* Pig Latin and Query Processing:
+** [http://www.vistrails.org/images/1-RelationalOnMapReduce.pdf Relational processing over MapReduce]
+** [http://www.vistrails.org/images/2-PigOnMapReduce.pdf Queries over MapReduce]
+* In-class assignment
+=== Required Reading ===
-=== Readings ===
-* [http://research.google.com/archive/mapreduce.html original google map-reduce paper]
-* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Mining of Massive Datasets, Chapter 2]
-* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 2, Chapter 3]
 * [http://pages.cs.brandeis.edu/~olga/cs228/Reading%20List_files/piglatin.pdf Pig Latin: A Not-So-Foreign Language for Data Processing]
+=== Additional References ===
 * [http://www.mpi-inf.mpg.de/~rgemulla/publications/beyer11jaql.pdf Jaql: A Scripting Language for Large Scale Semistructured Data Analysis]
 * [http://www.vldb.org/pvldb/2/vldb09-938.pdf Hive - A Warehousing Solution Over a Map-Reduce Framework]
-== Week 3: Monday Sept. 24th - Statistics is easy ==
+== Week 6:  Mon Oct. 14th - Fall Break - No class ==
+== Week 6:  Wed Oct. 16th - Fall Break - Make-up class ==
+* Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf
+* Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf
-* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]
+== Week 7:  Monday Oct. 21st - Invited Speaker: Alberto Lerner ==
-* Statistics and Big Data
-=== Readings ===
+* Inside MongoDB
+== Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha ==
+* Guest lecture by [http://cs.nyu.edu/shasha/ Dennis Shasha]:  [http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/stateasy.pdf Statistics is Easy]
+* Introduction to Provenance
+=== Required Reading ===
 * http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students
-* JF: add references for issues related to stats and big data
+* Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008
-== Week 4:  Monday Oct. 1st - Databases and Big Data ==
+* We will cover the material planned for "Week 10: Monday Nov. 11th": Finding Similar Items
-* Databases and Big Data
+== Week 9: Monday Nov. 4th  - Finding Similar Items, Information Integration ==
+* Similarity: Applications, Measures and Efficiency considerations
+** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/similarity.pdf
+* Similarity application: Information integration on the Web:
+** Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-info-integration.pdf
+* Homework presentation and demo
-=== Readings ===
+=== Required Reading ===
-*  JF: ADD: NoSQL databases (reading papers from literature)
+* [http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Mining of Massive Datasets, chapter 3; information integration; entity resolution]
-Column store vs. tuple store. HBase, MongoDB, VaultDB, Cassandra, HadoopDB (Facebook)
-Overview of different architectures, distributed databases vs. hadoop, transaction support...
-== Week 5: Monday Oct. 8st - Finding Similar Items ==
+=== Homework Assignment ===
-* Overview of information integration
+'''Due Nov 15th, 2013'''
+Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.
-=== Readings ===
+== Week 10: Monday Nov. 11th  - MapReduce Algorithm Design ==
-* Mining of Massive Datasets, chapter 3; information integration; entity resolution
+* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf
-== Week 6:  Monday Oct. 15st - Graph Analysis ==
+=== Required Reading ===
-* Graph algorithms, link analysis, social networks
+* Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer
-=== Readings ===
+=== Homework Assignment ===
-* Mining of Massive Datasets, Chapter 5
+'''Due Nov 15th, 2013'''
-* Data-Intensive Text Processing with MapReduce, Chapter 5
+Your assignment is in  http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.
+== Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing ==
+* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/mapreduce-indexing-graph.pdf
-== Week 7:  Monday Oct. 22st - Introduction to Visualization; Data stewardship and provenance ==
+=== Homework Assignment ===
-* Guest lecture by Claudio Silva and Lauro Lins
+Your Mapreduce/Pig assignment is available from Blackboard. '''It is Due December  1st'''.
-=== Readings ===
-* Hellerstein (ask Claudio for additional references)
-* ADD: provenance and reproducibility
+=== Required Reading ===
+* [http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf Data-Intensive Text Processing with MapReduce, Chapter 4 (Inverted Indexing for Text Retrieval) and 5(Graph Algorithms)]
-== Week 8: Monday Oct. 29th - TBD swap oct 15==
+=== Additional Reading ===
-* Reading: inverted index and crawling (Lin chapter 4)
+* [http://infolab.stanford.edu/pub/papers/google.pdf 1998 PageRank Paper]
-* Ask Torsten (tentative, ask him for reading material)
+* [http://infolab.stanford.edu/~ullman/mmds/ch5.pdf Mining of Massive Datasets, Chapter 5 (Link Analysis)]
+* Pregel: A System for Large-Scale Graph Processing. Google. [http://kowshik.github.com/JPregel/pregel_paper.pdf]
-=== Readings ===
+== Week 12: Monday Nov. 25th - Large-Scale Visualization ==
-* Data-Intensive Text Processing with MapReduce, Chapter 4
+* Invited lectures by:
+** Dr. Lauro Lins (AT&T Research)
+** Dr. Huy Vo (NYU Center for Urban Science and Progress)
-== Week 9: Monday Nov. 12th - Frequent Itemsets ==
+* Lecture notes:
+** https://www.dropbox.com/s/7t2vqryj5zgs44n/intro-to-visualization.pdf
+** https://www.dropbox.com/s/btb3ocupkmpgefi/nanocubes.pdf
-=== Reading ===
-* Mining of Massive Datasets, Chapter 6
+=== Required Reading ===
+The Value of Visualization, Jarke Van Wijk
+http://www.win.tue.nl/~vanwijk/vov.pdf
-== Week 10: Monday Nov. 5th - Mining Data Streams ===
+Tamara Munzner's Book draft 2 available online
+http://www.cs.ubc.ca/~tmm/courses/533/book/
-=== Readings ===
+Nanocubes Paper
-* Mining of Massive Datasets, Chapter 4
+http://nanocubes.net
+http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf
+=== Additional Reading ===
+imMens Paper (to contrast with nanocubes)
+http://vis.stanford.edu/papers/immens
-== Week 11: Monday Nov. 19th - Clustering ==
-=== Readings ===
+== Week 13: Monday Dec. 2nd - Frequent Itemsets ==
-* Mining of Massive Datasets, Chapter 7
+* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/association-rules.pdf
-== Week 12: Monday Nov. 26th - Recommendation Systems ==
+=== Additional Reading ===
+* Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf
+* Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf
+* An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf
-=== Readings ===
+=== Optional Quiz ===
-* Mining of Massive Datasets, Chapter 9
+'''Due Dec 9th'''
-== Week 13  Monday Dec. 3rd -  EM algorithms for text processing==
+== Week 14: Monday Dec. 9th - - EM and exam review ==
-* Data-Intensive Text Processing with MapReduce, Chapter 6
+* Lecture notes: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/hmm-em-mapreduce.pdf
-== Week 14: Monday Dec. 10th - Project presentation ==
+=== Readings ===
+Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)
-== Other Readings ==
+== Week 15  Monday Dec. 16th -  Final Exam ==
-* [http://www-users.cs.umn.edu/~kumar/papers/anomaly-survey.php Anomaly Detection: A Survey]

Difference between revisions of "Course: Big Data Analysis"

Latest revision as of 15:23, 16 December 2013

Fall 2013

News

Week 1: Monday Sept. 9th - Course Overview

Required Reading

Additional References

Week 2: Monday Sept. 16th - Map-Reduce/Hadoop

Assignment

Required Reading

Week 3: Monday Sept. 23rd - Data Management for Big Data

Related Topics

Required Reading

Additional References

Week 4: Monday Sept 30th - Invited lecture by Dr. C. Mohan (IBM)

Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages

Required Reading

Additional References

Week 6: Mon Oct. 14th - Fall Break - No class

Week 6: Wed Oct. 16th - Fall Break - Make-up class

Week 7: Monday Oct. 21st - Invited Speaker: Alberto Lerner

Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha

Required Reading

Week 9: Monday Nov. 4th - Finding Similar Items, Information Integration

Required Reading

Homework Assignment

Week 10: Monday Nov. 11th - MapReduce Algorithm Design

Required Reading

Homework Assignment

Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing

Homework Assignment

Required Reading

Additional Reading

Week 12: Monday Nov. 25th - Large-Scale Visualization

Required Reading

Additional Reading

Week 13: Monday Dec. 2nd - Frequent Itemsets

Additional Reading

Optional Quiz

Week 14: Monday Dec. 9th - - EM and exam review

Readings

Week 15 Monday Dec. 16th - Final Exam

Navigation menu

Search