Assignment 3 - FAQ

From VistrailsWiki
Jump to navigation Jump to search

Frequently Asked Questions

Hadoop Streaming

How do I specify which subset of a key to be used by the partitioner?

  • Hadoop Streaming provides an option for you to modify the partitioning strategy
  • Here's an example:

hadoop jar /usr/bin/hadoop/contrib/streaming/hadoop-streaming-1.0.3.16.jar -D mapred.reduce.tasks=2 -D stream.num.map.output.key.fields=2 -D num.key.fields.for.partition=2 -file wordMatrix_mapperPairs.py -mapper wordMatrix_mapperPairs.py -file wordMatrix_reducerPairs.py -reducer wordMatrix_reducerPairs.py -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner -input /user/juliana/input -output /user/juliana/output2

    • stream.num.map.output.key.fields=2 informs Hadoop that the first 2 fields of the mapper output form the key -- in this case (word1,word2), and the third field corresponds to the value.
    • num.key.fields.for.partition=2 specifies that both fields are to be used by the partitioner.
    • Note that we also need to specify the partitioner: -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
  • Here's another example, now using only the first field as the key:

hadoop jar /usr/bin/hadoop/contrib/streaming/hadoop-streaming-1.0.3.16.jar -D mapred.reduce.tasks=2 -D stream.num.map.output.key.fields=2 -D num.key.fields.for.partition=1 -file wordMatrix_mapperPairs.py -mapper wordMatrix_mapperPairs.py -file wordMatrix_reducerPairs.py -reducer wordMatrix_reducerPairs.py -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner -input /user/juliana/input -output /user/juliana/output2

    • num.key.fields.for.partition=1 specifies that both fields are to be used by the partitioner.

How do I specify which subset of a key to be used by the partitioner on AWS?

  • You can do this when you configure a step, by adding the appropriate directives and parameters in the Arguments box:

Emr-partitioner.png

How do I run a mapreduce job on a single-node installation?

  • I have created an example and detailed instructions on how to run a mapreduce job on a single-node installation.

The instructions are in: http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/MapReduceExample/readme.txt

  • And you can download all the necessary files from

http://vgc.poly.edu/~juliana/courses/BigData2014/Lectures/MapReduceExample/mr-example.tgz

How do I run a mapreduce job on the NYU Hadoop Cluster?

  • Note: in the first version of this file, there was an extraneous space in alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'. There should be no spaces around "="


How do I test mapreduce code on my local machine?

  • You can use unix pipes, e.g., cat "samplefilename" | mapper.py | reducer.py

See details in http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python

When I try to run the actual MapReduce job on the NYU cluster I get an error: Exception in thread "main" java.io.IOException: Error opening job jar: /

  • The error "Error opening job jar: /" seems to indicate hadoop is looking for the jar file in the root directory "/"
  • Try issuing the *full* command without using the alias:

$ /usr/bin/hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.0.3.16.jar -file pmap.py -mapper pmap.py -file pred.py -reducer pred.py -input /user/juliana/wikipedia.txt -output /user/juliana/wikipedia.output