NYU HPC Access Instructions
Accessing the NYU HPC Cluster
1. Log into the main HPC node:
ssh <netid>@hpc.nyu.edu
2. From the HPC node, log into the Hadoop cluster:
ssh dumbo
You will be using a set of commands, and it will save you some time to first create aliases for them. Once on "dumbo", run the following commands on your terminal:
bash alias hfs='/usr/bin/hadoop fs ' export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'
To be able to re-use these aliases every time you login to dumbo, append the following lines to the end of your .bashrc file:
alias hfs='/usr/bin/hadoop fs '
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'
%% Note: you should not have any spaces around "="!
If you have bash as your default shell, do
source .bashrc
This will create the aliases. If you have tcsh as your default shell, just invoke bash -- it will automatically read the .bashrc file and create the aliases.
Here are some common commands:
hfs %% See available commands.
hfs -help %% more command details.
hfs -ls [<path>] %% List files
hfs -cp <src> <dst> %% Copy stuff
hfs -mkdir <path> %% Create path
hfs -rm <path> %% remove a file
hfs -chmod <path> %% Modify permissions.
hfs -chown <path> %% Modify owner.
Some remote access commands:
hfs -cat <src> %% Cat contents to stdout.
hfs -copyFromLocal <localsrc> <dst> %% Copy stuff
hfs -copyToLocal <src> <localdst> %% Copy stuff
Using Hadoop Streaming
- Hadoop streaming allows the use any program written in any language for mapreduce operations.
- You can use the "hjs" alias you created to run Hadoop Streaming
To run the example I provided, do the following:
1) Copy the directory containing the Python files and input data to dumbo. You will first need to "scp" from your machine to the hpc node, and them from the hpc node to dumbo. Assuming the directory is called /Users/julianafreire/MRExample
scp -r /Users/julianafreire/MRExample your_netid@hpc.nyu.edu:
Then, from the hpc node:
scp -r MRExample dumbo
- Remember to replace your_netid with your actual netid!
2) From dumbo, you will now copy the data file to HDFS
hfs -copyFromLocal /home/you_netid/MRExample/wikipedia.txt wikipedia.txt
3) Check if the file is on HDFS
hfs -ls
4) Now, to run the job, make sure you are on the right directory
cd /home/your_netid/MRExample hjs -file pmap.py -mapper pmap.py -file pred.py -reducer pred.py -input /user/your_netid/wikipedia.txt -output /user/your_netid/wikipedia.output
5) The outputs of this job are now in HDFS, in the directory /user/your_netid/wikipedia.output. To list the output files:
hfs -ls /user/jf1870/wikipedia.output
You can also inspect the content of the files:
hfs -cat wikipedia.output/*
If you'd like to copy the files over to your local directory:
hfs -get /user/jf1870/wikipedia.output output
This will copy the outputs to the local directory "output" on dumbo
Using Spark
- Spark allow you to write and run applications quickly in Java, Scala, Python and R
- You can either use Spark interactive shell or Spark submission tool
To run Spark interactive shell (Scala or Python):
1) Login to dumbo
2) Execute one of the following: spark-shell (to run applications in Scala)
pyspark (to run applications in Python)
If you want to access your files stored on HDFS, use the following URL as filename in Spark hdfs://babar.es.its.nyu.edu:8020/user/<your_net_id>/<your_files> (the hdfs:// URL must be absolutely correct, otherwise you won't be able to access file from HDFS)
To submit job to Spark:
1) Login to dumbo
2) Execute spark-submit --num-executors <10-100> <your_python_script> <your_script_arguments>
DUMBO cluster has 100 executors. Feel free to choose any number of executors for your submission. The bigger the faster. However if many people submit Spark job at the same time, performance will be downgraded.
Spark word count example:
Without streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py With streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py
Spark streaming is not the same as Hadoop streaming: in contrast with Hadoop, you can originally run Python/R/Java/Scala script in Spark. The difference is that Spark Streaming provide streaming processing of live data stream.
Some references:
1) Submitting application to Spark: http://spark.apache.org/docs/latest/submitting-applications.html 2) Data transformation: http://spark.apache.org/docs/latest/programming-guide.html#transformations