Difference between revisions of "Hadoop Package"

From VistrailsWiki
Jump to navigation Jump to search
(Created page with 'This page describes how to use the hadoop package in VisTrails. === Getting vistrails === Get the latest working version with: git clone http://vistrails.org/git/vistrails.git …')
 
Line 1: Line 1:
This page describes how to use the hadoop package in VisTrails.
This page describes how to use the hadoop package in VisTrails. This package works in Mac and Linux.


=== Getting vistrails ===
== Installation ==
Get the latest working version with:
=== Install vistrails ===
Get vistrails using git and check out a version supporting the hadoop package:
  git clone http://vistrails.org/git/vistrails.git
  git clone http://vistrails.org/git/vistrails.git
  cd vistrails
  cd vistrails
  git checkout 976255974f2b206f030b2436a5f10286844645b0 # a known working version
  git checkout 976255974f2b206f030b2436a5f10286844645b0
If you are using a binary distribution of vistrails you should replace the vistrails folder in that with this one.
If you are using a binary distribution of vistrails you should replace the vistrails folder in that with this one.


=== Getting the hadoop package ===
=== Install BatchQ-PBS and the RemotePBS package ===
This python package is used for communication over ssh. Get it with:
git clone https://github.com/rexissimus/BatchQ-PBS
Copy BatchQ-PBS/batchq to your vistrails python installations site-packages folder.


Copy BatchQ-PBS/batchq/contrib/vistrails/RemotePBS to ~/.vistrails/userpackages/
=== Install the hadoop package ===
  git clone git://vgc.poly.edu:src/vistrails-hadoop.git ~/.vistrails/userpackages/hadoop
  git clone git://vgc.poly.edu:src/vistrails-hadoop.git ~/.vistrails/userpackages/hadoop


=== Modules in the hadoop package ===
== Modules used by the hadoop package ==
 
==== Dialogs/PasswordDialog ====
Used to specify a password to the remote machine
 
==== Remote PBS/Machine ====
Represents a remote machine running SSH.
* server - the server url
* username - the remote server username, default is your local username
* password - your password, connect the PasswordDialog to here
* port - the remote ssh port, set to 0 to use the default port
 
===== Example connecting to the Poly cluster through vgchead =====
The hadoop job submitter runs on gray02.poly.edu. If you are outside the poly network you need to use a ssh tunnel to get through the firewall.
 
Add this to ~/.ssh/config:
Host vgctunnel
HostName vgchead.poly.edu
LocalForward 8101 gray02.poly.edu:22
 
Host gray02
HostName localhost
Port 8101
ForwardX11 yes
Set up a tunnel to gray02 by running:
ssh vgctunnel
In vistrails, create a Machine module with host=gray02 and port=0. Now you have a connection that can be used by the hadoop package
 
==== HDFSGet ====
==== HDFSGet ====

Revision as of 17:47, 10 January 2014

This page describes how to use the hadoop package in VisTrails. This package works in Mac and Linux.

Installation

Install vistrails

Get vistrails using git and check out a version supporting the hadoop package:

git clone http://vistrails.org/git/vistrails.git
cd vistrails
git checkout 976255974f2b206f030b2436a5f10286844645b0

If you are using a binary distribution of vistrails you should replace the vistrails folder in that with this one.

Install BatchQ-PBS and the RemotePBS package

This python package is used for communication over ssh. Get it with:

git clone https://github.com/rexissimus/BatchQ-PBS

Copy BatchQ-PBS/batchq to your vistrails python installations site-packages folder.

Copy BatchQ-PBS/batchq/contrib/vistrails/RemotePBS to ~/.vistrails/userpackages/

Install the hadoop package

git clone git://vgc.poly.edu:src/vistrails-hadoop.git ~/.vistrails/userpackages/hadoop

Modules used by the hadoop package

Dialogs/PasswordDialog

Used to specify a password to the remote machine

Remote PBS/Machine

Represents a remote machine running SSH.

  • server - the server url
  • username - the remote server username, default is your local username
  • password - your password, connect the PasswordDialog to here
  • port - the remote ssh port, set to 0 to use the default port
Example connecting to the Poly cluster through vgchead

The hadoop job submitter runs on gray02.poly.edu. If you are outside the poly network you need to use a ssh tunnel to get through the firewall.

Add this to ~/.ssh/config:

Host vgctunnel
HostName vgchead.poly.edu
LocalForward 8101 gray02.poly.edu:22
Host gray02
HostName localhost
Port 8101
ForwardX11 yes

Set up a tunnel to gray02 by running:

ssh vgctunnel

In vistrails, create a Machine module with host=gray02 and port=0. Now you have a connection that can be used by the hadoop package

HDFSGet