Difference between revisions of "NYU HPC Access Instructions"

From VistrailsWiki
Jump to navigation Jump to search
(Created page with '== Accessing the NYU HPC Cluster == 1. Log into the main HPC node: ssh <netid>@hpc.nyu.edu 2. From the HPC node, log into the Hadoop cluster: ssh dumbo You …')
 
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Accessing the NYU HPC Cluster ==  
== Accessing the NYU HPC Cluster ==  
If you don't have an account, request one at https://wikis.nyu.edu/display/NYUHPC/Request+or+Renew


1. Log into the main HPC node:
1. Log into the main HPC node:
Line 11: Line 13:
<code>
<code>
bash
bash
alias hfs='/usr/bin/hadoop fs '
alias hfs='/usr/bin/hadoop fs '
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar  
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar  
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'
</code>
</code>
Line 19: Line 25:


To be able to re-use these aliases every time you login to dumbo, append the following lines to the end of your .bashrc file:
To be able to re-use these aliases every time you login to dumbo, append the following lines to the end of your .bashrc file:
<code>
alias hfs='/usr/bin/hadoop fs '
alias hfs='/usr/bin/hadoop fs '
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars
export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar  
export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar  
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'


%% Note: you should not have any spaces around "="!
%% Note: you should not have any spaces around "="!
</code>




Line 33: Line 44:


Here are some common commands:
Here are some common commands:
<code>
hfs        %% See available commands.
hfs        %% See available commands.
hfs -help  %% more command details.
hfs -help  %% more command details.
hfs -ls [<path>]  %% List files
hfs -ls [<path>]  %% List files
hfs -cp <src> <dst>  %% Copy stuff
hfs -cp <src> <dst>  %% Copy stuff
hfs -mkdir <path> %% Create path
hfs -mkdir <path> %% Create path
hfs -rm <path> %% remove a file
hfs -rm <path> %% remove a file
hfs -chmod <path> %% Modify permissions.
hfs -chmod <path> %% Modify permissions.
hfs -chown <path> %%  Modify owner.
hfs -chown <path> %%  Modify owner.
</code>


Some remote access commands:
Some remote access commands:
<code>
hfs -cat <src>  %% Cat contents to stdout.
hfs -cat <src>  %% Cat contents to stdout.
hfs -copyFromLocal <localsrc> <dst> %% Copy stuff
hfs -copyFromLocal <localsrc> <dst> %% Copy stuff
hfs -copyToLocal <src> <localdst> %% Copy stuff
hfs -copyToLocal <src> <localdst> %% Copy stuff
</code>




----------------------------------------------------------------------
=== Using Hadoop Streaming ===
Using Hadoop Streaming


* Hadoop streaming allows the use any program written in any language for mapreduce operations.
* Hadoop streaming allows the use any program written in any language for mapreduce operations.
Line 57: Line 80:


1) Copy the directory containing the Python files and input data to dumbo. You will first need to "scp" from your machine to the hpc node, and them from the hpc node to dumbo.
1) Copy the directory containing the Python files and input data to dumbo. You will first need to "scp" from your machine to the hpc node, and them from the hpc node to dumbo.
Assuming the directory is called /Users/julianafreire/MRExample
Assuming the directory in your machine is called /Users/julianafreire/MRExample
       scp -r /Users/julianafreire/MRExample  your_netid@hpc.nyu.edu:  
       scp -r /Users/julianafreire/MRExample  your_netid@hpc.nyu.edu:  
Then, from the hpc node:
Then, from the hpc node:
Line 86: Line 109:
This will copy the outputs to the local directory "output" on dumbo
This will copy the outputs to the local directory "output" on dumbo


----------------------------------------------------------------------
=== Using Spark ===
Using Spark


* Spark allow you to write and run applications quickly in Java, Scala, Python and R
* Spark allow you to write and run applications quickly in Java, Scala, Python and R
Line 111: Line 133:
spark-submit --num-executors <10-100> <your_python_script> <your_script_arguments>
spark-submit --num-executors <10-100> <your_python_script> <your_script_arguments>


DUMBO cluster has 100 executors. Feel free to choose any number of executors for your submission.  
DUMBO cluster has 100 executors. Feel free to choose any number of executors for your submission. The bigger the faster. However if many people submit Spark jobs at the same time, performance will be degraded.
The bigger the faster. However if many people submit Spark job at the same time, performance will
be downgraded.
 
Spark word count example:


Without streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py
You can try some examples:
With streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py
* Without streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py
* With streaming: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py


Spark streaming is not the same as Hadoop streaming: in contrast with Hadoop, you can originally run Python/R/Java/Scala script in Spark.  
Spark streaming is not the same as Hadoop streaming: in contrast with Hadoop, you can run Python/R/Java/Scala script in Spark.  
The difference is that Spark Streaming provide streaming processing of live data stream.
The difference is that Spark Streaming supports processing of live data stream.


Some references:
Some references:

Latest revision as of 07:56, 22 February 2016

Accessing the NYU HPC Cluster

If you don't have an account, request one at https://wikis.nyu.edu/display/NYUHPC/Request+or+Renew

1. Log into the main HPC node:

      ssh <netid>@hpc.nyu.edu    

2. From the HPC node, log into the Hadoop cluster:

      ssh dumbo

You will be using a set of commands, and it will save you some time to first create aliases for them. Once on "dumbo", run the following commands on your terminal:

bash

alias hfs='/usr/bin/hadoop fs '

export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars

export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar

alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'


To be able to re-use these aliases every time you login to dumbo, append the following lines to the end of your .bashrc file: alias hfs='/usr/bin/hadoop fs '

export HAS=/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars

export HSJ=hadoop-streaming-2.6.0-cdh5.4.5.jar

alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'

%% Note: you should not have any spaces around "="!


If you have bash as your default shell, do

     source .bashrc

This will create the aliases. If you have tcsh as your default shell, just invoke bash -- it will automatically read the .bashrc file and create the aliases.


Here are some common commands: hfs  %% See available commands.

hfs -help  %% more command details.

hfs -ls [<path>]  %% List files

hfs -cp <src> <dst>  %% Copy stuff

hfs -mkdir <path> %% Create path

hfs -rm <path> %% remove a file

hfs -chmod <path> %% Modify permissions.

hfs -chown <path> %% Modify owner.

Some remote access commands: hfs -cat <src>  %% Cat contents to stdout.

hfs -copyFromLocal <localsrc> <dst> %% Copy stuff

hfs -copyToLocal <src> <localdst> %% Copy stuff


Using Hadoop Streaming

  • Hadoop streaming allows the use any program written in any language for mapreduce operations.
  • You can use the "hjs" alias you created to run Hadoop Streaming

To run the example I provided, do the following:

1) Copy the directory containing the Python files and input data to dumbo. You will first need to "scp" from your machine to the hpc node, and them from the hpc node to dumbo. Assuming the directory in your machine is called /Users/julianafreire/MRExample

      scp -r /Users/julianafreire/MRExample  your_netid@hpc.nyu.edu: 

Then, from the hpc node:

      scp -r MRExample  dumbo
    • Remember to replace your_netid with your actual netid!

2) From dumbo, you will now copy the data file to HDFS

      hfs -copyFromLocal /home/you_netid/MRExample/wikipedia.txt wikipedia.txt

3) Check if the file is on HDFS

     hfs -ls

4) Now, to run the job, make sure you are on the right directory

    cd /home/your_netid/MRExample
    hjs -file pmap.py  -mapper pmap.py   -file pred.py -reducer pred.py   -input /user/your_netid/wikipedia.txt -output /user/your_netid/wikipedia.output

5) The outputs of this job are now in HDFS, in the directory /user/your_netid/wikipedia.output. To list the output files:

    hfs -ls /user/jf1870/wikipedia.output

You can also inspect the content of the files:

   hfs -cat wikipedia.output/*

If you'd like to copy the files over to your local directory:

   hfs -get /user/jf1870/wikipedia.output  output

This will copy the outputs to the local directory "output" on dumbo

Using Spark

  • Spark allow you to write and run applications quickly in Java, Scala, Python and R
  • You can either use Spark interactive shell or Spark submission tool

To run Spark interactive shell (Scala or Python):

1) Login to dumbo

2) Execute one of the following: spark-shell (to run applications in Scala)

       pyspark (to run applications in Python)

If you want to access your files stored on HDFS, use the following URL as filename in Spark hdfs://babar.es.its.nyu.edu:8020/user/<your_net_id>/<your_files> (the hdfs:// URL must be absolutely correct, otherwise you won't be able to access file from HDFS)

To submit job to Spark:

1) Login to dumbo

2) Execute spark-submit --num-executors <10-100> <your_python_script> <your_script_arguments>

DUMBO cluster has 100 executors. Feel free to choose any number of executors for your submission. The bigger the faster. However if many people submit Spark jobs at the same time, performance will be degraded.

You can try some examples:

Spark streaming is not the same as Hadoop streaming: in contrast with Hadoop, you can run Python/R/Java/Scala script in Spark. The difference is that Spark Streaming supports processing of live data stream.

Some references:

1) Submitting application to Spark: http://spark.apache.org/docs/latest/submitting-applications.html 2) Data transformation: http://spark.apache.org/docs/latest/programming-guide.html#transformations