Difference between revisions of "Course: Advanced Databases"
Line 9: | Line 9: | ||
== News == | == News == | ||
February 10th, 2014: | |||
* Wiki is now up-to-date | * Wiki is now up-to-date | ||
* Added research papers for reading assignment | * Added research papers for reading assignment | ||
Line 30: | Line 32: | ||
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.2620&rep=rep1&type=pdf Using schema matching to simplify heterogeneous data translation. Tova Milo and Sagit Zohar. VLDB. Vol. 98. 1998.] | # [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.2620&rep=rep1&type=pdf Using schema matching to simplify heterogeneous data translation. Tova Milo and Sagit Zohar. VLDB. Vol. 98. 1998.] | ||
== Week 1: Tuesday | == Week 1: Tuesday February 4th - Course Overview == | ||
* Course overview and introduction | * Course overview and introduction | ||
Line 49: | Line 51: | ||
* [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)] | * [http://en.wikipedia.org/wiki/Enterprise_Information_Integration Enterprise Information Integration (Wikipedia)] | ||
== Week 2: Tuesday February | == Week 2: Tuesday February 11th - Query Compilation 1 == | ||
* Query Compilation 1. Indexing and Storage | * Query Compilation 1. Indexing and Storage | ||
Line 59: | Line 61: | ||
* [http://research.google.com/archive/mapreduce.html original google map-reduce paper] | * [http://research.google.com/archive/mapreduce.html original google map-reduce paper] | ||
== Week 3: | == Week 3: Tuesday February 18th - Query Compilation 2 == | ||
* | * Query Compilation and Rewriting | ||
* | * | ||
=== Related Topics === | === Related Topics === |
Revision as of 15:53, 11 February 2014
NYU School of Engineering. CS6093: Spring 2014
Advanced Database Systems (CS6093) Syllabus for this semester: Syllabus (pdf)
This schedule is tentative and subject to change
Make sure to check my.poly.edu for course announcements
News
February 10th, 2014:
- Wiki is now up-to-date
- Added research papers for reading assignment
- Added slides for lecture 1 & 2
Reading Assignment
Here is the list of selected papers for the reading assignment:
- Nested loops revisited. D. J. DeWitt, J. F. Naughton, and J. Burger. 1993, January. In Proceedings of the Second International Conference on Parallel and Distributed Information Systems, (pp. 230-242).
- Exploiting Uniqueness in Query Optimization. G. N. Paulley and Per-Åke Larson. 1994. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 68-79.
- Accelerating XPath location steps. Torsten Grust. Proceedings of the 2002 ACM SIGMOD international conference on Management of data.
- AQuery: query language for ordered data, optimization techniques, and experiments. A. Lerner and D. Shasha. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 345–356, 2003.
- Algorithms for deferred view maintenance. Latha Colby, Timothy Griffin, Leonid Libkin, Inderpal Mumick and Howard Trickey. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pages 469-480.
- Optimizing Queries with Materialized Views. Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. Data Engineering 11 (1995): 190.
- Translating web data. L. Popa, Y. Velegrakis, M. A. Hernández, R. J. Miller, and R. Fagin. (In Proceedings of the 28th international conference on Very Large Data Bases (pp. 598-609). VLDB Endowment. August 2002.
- Optimizing Queries across Diverse Data Sources. Laura M. Haas, Donald Kossmann, Edward L. Wimmers and Jun Yangy. Proceedings of the International Conference on Very Large Data Bases. Vol. 23. Morgan Kaufmann Pub, 1997.
- WebOQL: Restructuring documents, databases and webs. Gustavo O. Arocena, and Alberto O. Mendelzon. 14th International Conference on Data Engineering. IEEE, 1998.
- A Data Transformation System for Biological Data Sources. Peter Buneman, Susan B. Davidson, Kyle Hart, G. Christian Overton, and Limsoon Wong. 1995. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95)
- WebViews: accessing personalized web content and services. Juliana Freire, Bharat Kumar, and Daniel Lieuwen. Proceedings of the 10th international conference on World Wide Web. ACM, 2001.
- Using schema matching to simplify heterogeneous data translation. Tova Milo and Sagit Zohar. VLDB. Vol. 98. 1998.
Week 1: Tuesday February 4th - Course Overview
- Course overview and introduction
- Lecture notes: http://www.vistrails.org/images/ADB-Intro-Class1.pdf
- Student survey -- to be filled out today!
Textbooks
- Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke Database Management Systems
- Database Systems: The Complete Book, by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, see the Database Systems: The Complete Book
- Guido Moerkotte's free book on query compilation and optimization: Query Compilers
- Principles of Data Integration by AnHai Doan, Alon Halevy, and Zachary Ives. Reference at: Principles of Data Integration
Additional References
- Graefe, Goetz. "Query evaluation techniques for large databases." ACM Computing Surveys (CSUR) 25.2 (1993): 73-169.. A classic database survey, and a must read for anyone serious about data processing.
- Data Integration (Wikipedia)
- Enterprise Information Integration (Wikipedia)
Week 2: Tuesday February 11th - Query Compilation 1
- Query Compilation 1. Indexing and Storage
- Lecture notes:
Required Reading
- Mining of Massive Datasets, Chapter 2
- Data-Intensive Text Processing with MapReduce, Chapter 2 and Chapter 3
- original google map-reduce paper
Week 3: Tuesday February 18th - Query Compilation 2
- Query Compilation and Rewriting
Related Topics
- BigTables and NoSQL stores. Tuple store vs. column stores: HBase, MongoDB, Cassandra
- HBase book HBase: The Definitive Guide. Random Access to Your Planet-Size Data: http://shop.oreilly.com/product/0636920014348.do
- HBase book. Chapter 8 Architecture for information about transactional processing, WriteAhead Log notably, and how consistency is being maintained.
- Transactions in NoSQL stores. Google's percolator, [1].
- "NewSQL" stores: more on Hive, VoltDB, HadoopDB,
- Beyond MapReduce: Berkeley's Spark, UC Irvine's Asterix, Google's Dremel
Required Reading
- PDMBS vs. MapReduce
- http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
- Parallel data processing with MapReduce: a survey. Lee et al, SIGMOD Record 2011
- Benchmark DBMS vs MapReduce (2009)
Additional References
- http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_
- Bigtable: A Distributed Storage System for Structured Data
- HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
- Low Overhead Concurrency Control for Partitioned Main Memory Databases
- ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models.
- Dremel: Interactive Analysis of Web-Scale Datasets
- Large-scale Incremental Processing Using Distributed Transactions and Notifications
Week 4: Monday Sept 30th - Invited lecture by Dr. C. Mohan (IBM)
- Note that we will meet at a different location: NYU CUSP, 1 Metrotech Center, 19th floor
- Tutorial: An In-Depth Look at Modern Database Systems: http://bit.ly/CMnMDS
- Abstract: This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source and commercial systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented. An outline of problems for which no adequate solutions exist will be included. Such problems could be fertile grounds for new research work.
- Presenter: Dr. C. Mohan, IBM Fellow, IBM Almaden Research Center, San Jose, CA 95120, USA.
- Bio: Dr. C. Mohan has been an IBM researcher for 31 years in the information management area, impacting numerous IBM and non-IBM products, the research community and standards, especially with his invention of the ARIES family of locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM, ACM and IEEE Fellow has also served as the IBM India Chief Scientist. In addition to receiving the ACM SIGMOD Innovation Award, the VLDB 10 Year Best Paper Award and numerous IBM awards, he has been elected to the US and Indian National Academies of Engineering, and has been named an IBM Master Inventor. This distinguished alumnus of IIT Madras received his PhD at the University of Texas at Austin. He is an inventor of 38 patents. He serves on the advisory board of IEEE Spectrum and on the IBM Software Group Architecture Board’s Council. More information can be found in his home page at http://bit.ly/CMohan
Week 5: Monday Oct. 7th - Query Processing on Mapreduce and High-level Languages
- Pig Latin and Query Processing:
- In-class assignment
Required Reading
Additional References
- Jaql: A Scripting Language for Large Scale Semistructured Data Analysis
- Hive - A Warehousing Solution Over a Map-Reduce Framework
Week 6: Mon Oct. 14th - Fall Break - No class
Week 6: Wed Oct. 16th - Fall Break - Make-up class
- Reproducibility and Data Exploration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/reproducibility.pdf
- Large-scale information integration: http://vgc.poly.edu/~juliana/courses/cs9223/Lectures/web-information-integration.pdf
Week 7: Monday Oct. 21st - Invited Speaker: Alberto Lerner
- Inside MongoDB
Week 8: Monday Oct 28th- Statistics is easy - Invited Speaker: Dennis Shasha
- Guest lecture by Dennis Shasha: Statistics is Easy
- Introduction to Provenance
Required Reading
- http://www.morganclaypool.com/doi/abs/10.2200/S00142ED1V01Y200807MAS001 -- book is available for free for NYU students
- Second edition of the book: http://www.morganclaypool.com/doi/pdf/10.2200/S00295ED1V01Y201009MAS008
- We will cover the material planned for "Week 10: Monday Nov. 11th": Finding Similar Items
Week 9: Monday Nov. 4th - Finding Similar Items, Information Integration
- Similarity: Applications, Measures and Efficiency considerations
- Similarity application: Information integration on the Web:
- Homework presentation and demo
Required Reading
Homework Assignment
Due Nov 15th, 2013 Your assignment is in http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.
Week 10: Monday Nov. 11th - MapReduce Algorithm Design
Required Reading
- Chapters 3 and 4 in textbook: Data-Intensive Text Processing with MapReduce, by Lin and Dyer
Homework Assignment
Due Nov 15th, 2013 Your assignment is in http://www.newgradiance.com/services. Please see http://vgc.poly.edu/~juliana/courses/cs9223 for instructions on how to access this service.
Week 11: Monday Nov 18th- MapReduce Algorithm Design and Graph Processing
Homework Assignment
Your Mapreduce/Pig assignment is available from Blackboard. It is Due December 1st.
Required Reading
Additional Reading
- 1998 PageRank Paper
- Mining of Massive Datasets, Chapter 5 (Link Analysis)
- Pregel: A System for Large-Scale Graph Processing. Google. [2]
Week 12: Monday Nov. 25th - Large-Scale Visualization
- Invited lectures by:
- Dr. Lauro Lins (AT&T Research)
- Dr. Huy Vo (NYU Center for Urban Science and Progress)
- Lecture notes:
Required Reading
The Value of Visualization, Jarke Van Wijk http://www.win.tue.nl/~vanwijk/vov.pdf
Tamara Munzner's Book draft 2 available online http://www.cs.ubc.ca/~tmm/courses/533/book/
Nanocubes Paper http://nanocubes.net http://nanocubes.net/assets/pdf/nanocubes_paper_preprint.pdf
Additional Reading
imMens Paper (to contrast with nanocubes) http://vis.stanford.edu/papers/immens
Week 13: Monday Dec. 2nd - Frequent Itemsets
Additional Reading
- Mining association rules between sets of items in large databases. Agrawal et al., SIGMOD 1993. http://www.almaden.ibm.com/cs/quest/papers/sigmod93.pdf
- Fast algorithms for mining association rules. Agrawal and Srikant, VLDB 1994. https://www.seas.upenn.edu/~jstoy/cis650/papers/Apriori.pdf
- An effective hash-based algorithm for mining association rules. Park et al., SIGMOD 1995. http://www.dmi.unict.it/~apulvirenti/agd/PCY95.pdf
Optional Quiz
Due Dec 9th
Week 14: Monday Dec. 9th - - EM and exam review
Readings
Data-Intensive Text Processing with MapReduce, Chapter 6 (EM Algorithms for Text Processing)