Difference between revisions of "Hadoop Package"
Line 2: | Line 2: | ||
== Installation == | == Installation == | ||
=== | === Mac === | ||
This binary version of vistrails has the hadoop package preinstalled: | |||
http://vgc.poly.edu/files/tommy/vistrails-mac-10.6-master-2014-02-25.dmg | |||
=== Linux === | |||
Install vistrails from source: | |||
git clone http://vistrails.org/git/vistrails.git | git clone http://vistrails.org/git/vistrails.git | ||
git clone https://github.org/rexissimus/BatchQ-PBS -b remoteq | |||
cp -r BatchQ-PBS/remoteq vistrails/ | |||
cd vistrails | cd vistrails | ||
python vistrails/run.py | |||
The first time vistrails is started it will download and install all the dependencies. | |||
=== | === Windows === | ||
The BatchQ library used by the RemoteQ package does not support Windows. But you should be able to run Linux in a [http://www.virtualbox.org virtual machine] and install vistrails there. | |||
== Modules used with the hadoop package == | == Modules used with the hadoop package == |
Revision as of 15:30, 25 February 2014
This page describes how to use the hadoop package in VisTrails. This package works on Mac and Linux.
Installation
Mac
This binary version of vistrails has the hadoop package preinstalled:
http://vgc.poly.edu/files/tommy/vistrails-mac-10.6-master-2014-02-25.dmg
Linux
Install vistrails from source:
git clone http://vistrails.org/git/vistrails.git git clone https://github.org/rexissimus/BatchQ-PBS -b remoteq cp -r BatchQ-PBS/remoteq vistrails/ cd vistrails python vistrails/run.py
The first time vistrails is started it will download and install all the dependencies.
Windows
The BatchQ library used by the RemoteQ package does not support Windows. But you should be able to run Linux in a virtual machine and install vistrails there.
Modules used with the hadoop package
Dialogs.PasswordDialog
Used to specify a password to the remote machine
Remote PBS.Machine
Represents a remote machine running SSH.
- server - the server url
- username - the remote server username, default is your local username
- password - your password, connect the PasswordDialog to here
- port - the remote ssh port, set to 0 if using an ssh tunnel
Connecting to the Poly cluster through vgchead
The hadoop job submitter runs on gray02.poly.edu. If you are outside the poly network you need to use a ssh tunnel to get through the firewall.
Add this to ~/.ssh/config:
Host vgctunnel HostName vgchead.poly.edu LocalForward 8101 gray02.poly.edu:22 Host gray02 HostName localhost Port 8101 ForwardX11 yes
Set up a tunnel to gray02 by running:
ssh vgctunnel
In vistrails, create a Machine module with host=gray02 and port=0. Now you have a connection that can be used by the hadoop package
HadoopStreaming
Runs a hadoop job on a remote cluster.
- CacheArchive - Jar files to upload
- CacheFiles - Other files to upload
- Combiner - combiner file to use after mapper. Can be same as reducer.
- Environment - Environment variables
- Identifier - A unique string identifying each new job. The job files on the server will be called ~/.vistrails-hadoop/.batchq.%Identifier%.*
- Input - The input file/directory to process
- Mapper - The mapper program (required)
- Output - The output directory name
- Reducer - The reducer program (optional)
- Workdir - The server workdir (Default is ~/.vistrails-hadoop)
HDFSEnsureNew
Deletes file/directory from remote HDFS storage
HDFSGet
Retrieve file/directory from remote HDFS storage. Used to get the results.
- Local File - Destination file/directory
- Remote Location - Source file/directory in HDFS storage
HDFSPut
Upload file/directory to remote HDFS storage. Used to upload mappers, reducers and data files.
- Local File - Source file/directory
- Remote Location - Destination file/directory in HDFS storage
PythonSourceToFile
PythonSource that is written to a file. Used to create mapper/reducer files.
URICreator
Creates links to locations in HDFS storage for input data and other files
Deleting a job
To make sure a job can be executed from the beginning:
- Clear the vistrails cache
- Delete the job in the job monitor by selecting it an pressing "Del"
Example
Lets try using gray02.poly.edu to run basic example with a mapper that returns info about the machine it was executed on.
You will need an account on vgchead.
In a terminal run:
ssh vgctunnel
Enter your password and keep the window open. Open vistrails-hadoop/example_nodeinfo.vt. It contains a working hadoop workflow.
Enter the machine info by going to Preferences->Module Packages, select RemoteQ and click "configure...". Enter this in the configuration:
server | gray02 |
username | <yourusername> |
port | 0 |
password | True |
defaultFS | hdfs://gray02.poly.edu:8020/user/<yourusername>/ |
uris | hdfs:///user/tommy/wikitext-big-notitle.csv#wikitext-big-notitle.csv |
Execute the workflow. The workflow will halt while waiting for the job to finish. Pressing cancel will detach the running job an add it to the Job Monitor. The status of the execution can be checked by right-clicking HadoopStreaming and (for hadoop) selecting "View Standard error". The job can be resumed by re-executing the workflow in vistrails. Once it completes the spreadsheet will list info about the 20 lines processed by the mapper. (usually the same)
Using Amazon AWS
First do the AWS_Setup.
AWS uses "*.pem" key files for access. Make sure you have one, then edit ~/.ssh/config and add
Host aws HostName ec2-54-201-233-14.us-west-2.compute.amazonaws.com IdentityFile ~/.ssh/<yourusername>.pem
after replacing the host name and path to your key file. Enter the machine info by going to Preferences->Module Packages, select RemoteQ, click "configure...", and enter this in the configuration:
server | aws |
username | hadoop |
port | 0 |
password | False |
defaultFS | s3n://<yourusername>/ |
uris | s3://cs9223/wikitext-big-notitle.csv#wikitext-big-notitle.csv |
Change defaultFS to your s3 bucket
An example file is available at vistrails-hadoop/aws.vt. It contains a working hadoop workflow. Change the S3 bucket instances to point to your bucket and execute. When it finishes you should see the same result as in the Example above.