Difference between revisions of "Course: Massive Data Analysis 2014/Hadoop Exercise"
Jump to navigation
Jump to search
Line 10: | Line 10: | ||
* Exercise 0: WordCount | * Exercise 0: WordCount | ||
** Run the basic WordCount example on your local machine and AWS | ** Run the basic WordCount example on your local machine and AWS | ||
** Follow the | ** Follow the [http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf instructions] to create your Amazon Elastic MapReduce (EMR) cluster | ||
** Instructions to run WordCount on your local machine and EMR cluster will be given in class | ** Instructions to run WordCount on your local machine and EMR cluster will be given in class | ||
** '''Note: You don't have to submit code and results for this exercise.''' | ** '''Note: You don't have to submit code and results for this exercise.''' | ||
Line 19: | Line 19: | ||
* Exercise 2: InitialCount | * Exercise 2: InitialCount | ||
** Count the number of words based on | ** Count the number of words based on their initial (first character), i.e., count the number of words per initial | ||
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A''' | ** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A''' | ||
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase). | ** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase). | ||
* Exercise 3: Top-K WordCount | * Exercise 3: Top-K WordCount | ||
** Output the top 100 most frequent 7-character words, in descending order of frequency | ** Output the top '''100''' most frequent '''7-character''' words, in descending order of frequency | ||
** Output: Key is the word, and value is the number of times the word appears in the input. | ** Output: Key is the word, and value is the number of times the word appears in the input. |
Revision as of 20:10, 3 October 2014
Before you start
- You must have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
- Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
- What to submit for these exercises:
- Code: place your code for exercises 1, 2 and 3 in a public GitHub repository
- Results: put the results in your S3 bucket (don't forget to make it public)
- Complete this form to submit the links to your GitHub repository and S3 bucket. Deadline: 11:59 PM on the same day of class (Oct 6, 2014)
Hands-on exercises
- Exercise 0: WordCount
- Run the basic WordCount example on your local machine and AWS
- Follow the instructions to create your Amazon Elastic MapReduce (EMR) cluster
- Instructions to run WordCount on your local machine and EMR cluster will be given in class
- Note: You don't have to submit code and results for this exercise.
- Exercise 1: Fixed-Length WordCount
- For this exercise, you will only count words with 5 characters
- Output: Key is the word, and value is the number of times the word appears in the input.
- Exercise 2: InitialCount
- Count the number of words based on their initial (first character), i.e., count the number of words per initial
- The letter case should not be taken into account. For example, Apple and apple will be both counted for initial A
- Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).
- Exercise 3: Top-K WordCount
- Output the top 100 most frequent 7-character words, in descending order of frequency
- Output: Key is the word, and value is the number of times the word appears in the input.