JobSubmission
Introduction
This page describes and discusses the "job submission" effort in VisTrails, i.e. running job from VisTrails on remote servers, and getting a job's result asynchronously in a later session.
Long-running jobs
VisTrails's supports long-running jobs through the ModuleSuspended mechanism. A Module can suspend itself once a job is running (after submitting it on the first run, or after checking that it's not done yet on subsequent runs) by raising a ModuleSuspended exception, which contains information allowing the JobMonitor to automatically check for the status of the job in the background.
The JobMonitor then notifies the user if a known running job is now completed, so that the user can re-run the workflow, which should now be able to pass the suspending module.
In addition, the JobMonitor serializes this information so that VisTrails knows to check for these jobs if you restart it later on (or on a different machine, if this is written to the vistrail).
This is a very abstract and high-level interface for background jobs that any job-submission mechanism is built on.
Remote job packages
Running jobs on a remote machine can be done with ad-hoc packages that use the ModuleSuspended mechanism.
RemoteQ
This is the only such package right now.
It allows a user to run commands on a server through SSH, via such modules as Machine, CopyFile, RunCommand, and RunPBSScript.
- [RR] The problem here is that filenames need to be explicit in the workflow. There is no job isolation. This means that there are side-effects, which should NOT be permitted in data flows, and ARE going to break things (especially if the vistrail gets shared). Files and jobs should be associated with the job "signature" so that running a version doesn't corrupt the results of another (NECESSARY for sharing job-submitting vistrails!) and running a pipeline should get the output of the correct job.
Running jobs are associated to a subpipeline signature (in JSON previously, now in annotation; FIXME: still in JSON in XML!)
- [RR] JSON in XML is gross
- [RR] Should probably have one annotation per job
- [RR] What are these 'workflow' and 'id' UUIDs?
Jobs have to contain the subpipeline signature to be matched with subsequent invocations, but also workflow/version so that job can be checked or resumed from JobMonitor. It also has to serialize whatever information it needs to resume, e.g. the output filename if it gets that from a parameter. Objective is to resume without running the upstream (since re-running the upstream might yield different results and thus a require a different job).