Difference between revisions of "Course: Massive Data Analysis 2014/Hadoop Exercise"

From VistrailsWiki
Jump to navigation Jump to search
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Before you start ==
== Before you start ==
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
* You '''must''' have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
* Download the following package: http://vgc.poly.edu/~fchirigati/mda-class/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
* Download the following package: http://bigdata.poly.edu/~tuananh/files/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
* What to submit
* What to submit for these exercises:
** Code: place your code in a public GitHub repository
** Code: place your code for exercises 1, 2 and 3 in a public GitHub repository
** Results: put the results in your S3 bucket (don't forget to make it public)
** Results: put the results in your S3 bucket (don't forget to make it public) [[http://bigdata.poly.edu/~tuananh/files/S3MakePublicInstruction.pdf instruction]]
** Complete this [http://bit.ly/1vAxovu form] to add the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on the same day of class (Oct 6, 2014)'''
** Complete this [http://bit.ly/1vAxovu form] to submit the links to your GitHub repository and S3 bucket. '''Deadline: 11:59 PM on Oct 8, 2014'''
** Office Hours: Oct 7 (Tue) from 3pm to 5pm, at 2 MetroTech Center, 10th floor, 10.053A


== Hands-on exercises ==
== Hands-on exercises ==
* '''Note''': Input for exercises: s3://mda2014/input/wikipedia.txt
* Exercise 0: WordCount
* Exercise 0: WordCount
** Run the basic WordCount example on your local machine and AWS
** Run the basic WordCount example on your local machine and AWS
** Follow the instruction here to create your Amazon Elastic MapReduce (EMR): http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf
** Follow the [http://vgc.poly.edu/~fchirigati/mda-class/RunHadoopAWS.pdf instructions] to create your Amazon Elastic MapReduce (EMR) cluster
** Instructions to run WordCount on your local machine and EMR cluster will be given in class
** Instructions to run WordCount on your local machine and EMR cluster will be given in class
** '''Note: You don't have to submit code and results for this exercise.'''
** '''Note: You don't have to submit code and results for this exercise.'''
Line 19: Line 21:


* Exercise 2: InitialCount
* Exercise 2: InitialCount
** Count the number of words based on theirs initial (first character), i.e., count the number of words per initial
** Count the number of words based on their initial (first character), i.e., count the number of words per initial
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A'''  
** The letter case should not be taken into account. For example, '''Apple''' and '''apple''' will be both counted for initial '''A'''  
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).
** Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).


* Exercise 3: Top-K WordCount
* Exercise 3: Top-K WordCount
** Output the top 100 most frequent 7-character words, in descending order of frequency
** Output the top '''100''' most frequent '''7-character''' words, in descending order of frequency
** Output: Key is the word, and value is the number of times the word appears in the input.
** Output: Key is the word, and value is the number of times the word appears in the input.

Latest revision as of 20:46, 8 October 2014

Before you start

  • You must have Hadoop installed and working on your local machine. You also need to setup your Amazon AWS account. Refer to the instruction in the course page.
  • Download the following package: http://bigdata.poly.edu/~tuananh/files/hadoop-exercise.zip. This package contains the basic WordCount example to help you get started.
  • What to submit for these exercises:
    • Code: place your code for exercises 1, 2 and 3 in a public GitHub repository
    • Results: put the results in your S3 bucket (don't forget to make it public) [instruction]
    • Complete this form to submit the links to your GitHub repository and S3 bucket. Deadline: 11:59 PM on Oct 8, 2014
    • Office Hours: Oct 7 (Tue) from 3pm to 5pm, at 2 MetroTech Center, 10th floor, 10.053A

Hands-on exercises

  • Note: Input for exercises: s3://mda2014/input/wikipedia.txt
  • Exercise 0: WordCount
    • Run the basic WordCount example on your local machine and AWS
    • Follow the instructions to create your Amazon Elastic MapReduce (EMR) cluster
    • Instructions to run WordCount on your local machine and EMR cluster will be given in class
    • Note: You don't have to submit code and results for this exercise.
  • Exercise 1: Fixed-Length WordCount
    • For this exercise, you will only count words with 5 characters
    • Output: Key is the word, and value is the number of times the word appears in the input.
  • Exercise 2: InitialCount
    • Count the number of words based on their initial (first character), i.e., count the number of words per initial
    • The letter case should not be taken into account. For example, Apple and apple will be both counted for initial A
    • Output: Key is the initial (A to Z in UPPERCASE), and value is the number of words having that initial (in either uppercase or lowercase).
  • Exercise 3: Top-K WordCount
    • Output the top 100 most frequent 7-character words, in descending order of frequency
    • Output: Key is the word, and value is the number of times the word appears in the input.