Difference between revisions of "Course: Massive Data Analysis 2014/Hadoop Exercise"

Revision as of 20:10, 3 October 2014

You must have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
What to submit for these exercises:
- Code: place your code for exercises 1, 2 and 3 in a public GitHub repository
- Results: put the results in your S3 bucket (don't forget to make it public)
- Complete this form to submit the links to your GitHub repository and S3 bucket. Deadline: 11:59 PM on the same day of class (Oct 6, 2014)

Exercise 0: WordCount
- Run the basic WordCount example on your local machine and AWS
- Follow the instructions to create your Amazon Elastic MapReduce (EMR) cluster
- Instructions to run WordCount on your local machine and EMR cluster will be given in class
- Note: You don't have to submit code and results for this exercise.

Exercise 1: Fixed-Length WordCount
- For this exercise, you will only count words with 5 characters
- Output: Key is the word, and value is the number of times the word appears in the input.

Exercise 2: InitialCount
- Count the number of words based on their initial (first character), i.e., count the number of words per initial
- The letter case should not be taken into account. For example, Apple and apple will be both counted for initial A
- Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).

Exercise 3: Top-K WordCount
- Output the top 100 most frequent 7-character words, in descending order of frequency
- Output: Key is the word, and value is the number of times the word appears in the input.

@@ Line 10: / Line 10: @@
 * Exercise 0: WordCount
 ** Run the basic WordCount example on your local machine and AWS
-** Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf
+** Follow the [http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf instructions] to create your Amazon Elastic MapReduce (EMR) cluster
 ** Instructions to run WordCount on your local machine and EMR cluster will be given in class
 ** '''Note: You don't have to submit code and results for this exercise.'''
@@ Line 19: / Line 19: @@
 * Exercise 2: InitialCount
-** Count the number of words based on theirs initial (first character), i.e., count the number of words per initial
+** Count the number of words based on their initial (first character), i.e., count the number of words per initial
 ** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A'''
 ** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).
 * Exercise 3: Top-K WordCount
-** Output the top 100 most frequent 7-character words, in descending order of frequency
+** Output the top '''100''' most frequent '''7-character''' words, in descending order of frequency
 ** Output: Key is the word, and value is the number of times the word appears in the input.