UsersGuideVisTrailsPackages

From VistrailsWiki
Revision as of 21:15, 2 March 2007 by Cscheid (talk | contribs) (Added new section)
Jump to navigation Jump to search

Introduction

VisTrails provides infrastructure for user-defined functionality to be incorporated into the main program. Specifically, users can incorporate their own visualization and simulation codes into pipelines by defining custom modules. These modules are bundled in what we call packages. A VisTrails package is simply a collection of Python classes -- each of these classes will represent a new module -- created by the user that respects a certain convention. Here's a simplified example of a very simple user-defined module:

class Divide(Module):
    def compute(self):
        arg1 = self.getInputFromPort("arg1")
        arg2 = self.getInputFromPort("arg2")
        if arg2 == 0.0:
            raise ModuleError(self, "Division by zero")
        self.setResult("result", arg1 / arg2)

registry.addModule(Divide)
registry.addInputPort(Divide, "arg1", (basic.Float, 'dividend'))
registry.addInputPort(Divide, "arg2", (basic.Float, 'divisor'))
registry.addOutputPort(Divide, "result", (basic.Float, 'quotient'))

New VisTrails modules must subclass from Module, the base class that defines basic functionality. The only required override is the compute() method, which performs the actual module computation. Input and output is specified through ports, which currently have to be explicitly registered with VisTrails. However, this is straightforward, and done through method calls to the module registry. A complete documented example of a (slightly) more complicated module is available here.

Dealing with command line tools and side effects

In an ideal world, each module would be referentially transparently. In other words, a module's outputs should be completely determined by its inputs. This is important for provenance purposes - if modules have implicit dependencies, it is not possible to be certain that when the process is reexecuted, the same results will be generated.

However, it is clear that certain modules are inherently side-effectful (reading/writing files, network, etc). For the common case of temporary files, VisTrails provides a convenience layer that removes part of the burden of managing temporary files. As an illustrative example, consider one of the packages we make available for image conversion, using the ImageMagick suite:

class Convert(ImageMagick):
    """Convert is the base Module for VisTrails Modules in the ImageMagick
package that deal with operations on images. Convert is a bit of a misnomer since
the 'convert' tool does more than simply file format conversion. Each subclass
has a descriptive name of the operation it implements."""

    def create_output_file(self):
        """Creates a File with the output format given by the
outputFormat port."""
        if self.hasInputFromPort('outputFormat'):
            s = '.' + self.getInputFromPort('outputFormat')
            return self.interpreter.filePool.create_file(suffix=s)

    def geometry_description(self):
        """returns a string with the description of the geometry as
indicated by the appropriate ports (geometry or width and height)"""
        # if complete geometry is available, ignore rest
        if self.hasInputFromPort("geometry"):
            return self.getInputFromPort("geometry")
        elif self.hasInputFromPort("width"):
            w = self.getInputFromPort("width")
            h = self.getInputFromPort("height")
            return "'%sx%s'" % (w, h)
        else:
            raise ModuleError(self, "Needs geometry or width/height")

    def run(self, *args):
        """run(*args), runs ImageMagick's 'convert' on a shell, passing all
arguments to the program."""        
        cmdline = ("convert" + (" %s" * len(args))) % args
        if not self.__quiet:
            print cmdline
        r = os.system(cmdline)
        if r != 0:
            raise ModuleError(self, "system call failed: '%s'" % cmdline)

    def compute(self):
        o = self.create_output_file()
        i = self.input_file_description()
        self.run(i, o.name)
        self.setResult("output", o)

(...)

    reg.addModule(Convert)
    reg.addInputPort(Convert, "geometry", (basic.String, 'ImageMagick geometry'))
    reg.addInputPort(Convert, "width", (basic.String, 'width of the geometry for operation'))
    reg.addInputPort(Convert, "height", (basic.String, 'height of the geometry for operation'))
    reg.addOutputPort(Convert, "output", (basic.File, 'the output file'))

This example introduces several new VisTrails features. The last line of the snippet registers an output port that provides a file. Immediately, a file output presents several problems when a pipeline is to be shared among users in heterogenous environments. For example, where should a file be written to? For temporary files, VisTrails provides a file pool class, that manages temporary files and their lifetimes automatically, so that users don't have to worry about deleting them post-execution. To create a temporary file, a user calls, for example

fileObj = self.interpreter.filePool.create(suffix=".png")

fileObj will then contain a module that represents a file. The file pool simply creates a temporary file with write permissions, whose local name is available, in this case, as fileObj.name. The package developer is then free to use this file for any purpose.

Another feature of this example is the use of command line tools. Notice that Python provides a very convenient way to execute commands through a shell. In this case, we use os.system on a command-line that executes the appropriate program.

Interaction with Caching

VisTrails provides a caching mechanism, in which portions of pipelines that are common across different executions are automatically shared. However, some modules are intrinsically side-effectful (writing a report to stdout, or a file to disk, or creating a user interface widget), and should not be shared. Caching control is therefore up to the package developer. By default, caching is enabled. So a developer that doesn't want caching to apply must make small changes to the module. There's a convenient way to disable caching entirely, by using multiple inheritance, and deriving from a mixin class that's provided by VisTrails. For example, look at the StandardOutput module:

from core.modules.vistrails_module import Module, newModule, \
    NotCacheable, ModuleError
(...)
class StandardOutput(NotCacheable, Module):
    """StandardOutput is a VisTrails Module that simply prints the
    value connected on its port to standard output. It is intended
    mostly as a debugging device."""
    
    def compute(self):
        v = self.getInputFromPort("value")
        print v

By subclassing from NotCacheable as well as from Module (or one of its subclasses), VisTrails automatically will not cache this module, or anything downstream from it.

VisTrails also allows a more sophisticated decision on whether to use caching or not. To do that, a user simply overrides the method is_cacheable to return the correct value. This allows context-dependent decisions. For example, in the teem package, there's a module that generates a scalar field with random numbers. This is non-deterministic, so shouldn't be cached. However, this module only generates non-deterministic values in special occasions, depending on its input port values. To keep efficiency when caching is possible, while still maintaining correctness, that module implements the following override:

class Unu1op(Unu):
(...)
    def is_cacheable(self):
        return not self.getInputFromPort('op') in ['rand', 'nrand']
(...)

Notice that the module explicitly uses inputs to decide whether it should be cached. This allows reasonably fine-grained control over the process.

Interaction with Other Packages

When developing more complicated packages, it becomes natural to split code among different VisTrails packages, and have one logically depend on the other. For example, in one package (say, named ' package_base '), a user might define

class PackageBaseModule(Module):
 ...
def initialize():
 registry.addModule(PackageBaseModule)
 ...

And then, in another package (say, ' package_derived '),

class DerivedModule(PackageBaseModule):
 ...
def initialize():
 registry.addModule(DerivedModule)
 ...

Because of the way packages are loaded, package_derived cannot be initialized before package_base. VisTrails provides a mechanism for specifying interpackage dependencies. Every VisTrails package can provide a list of necessary installed packages. This is done by providing a callable in the package under the name package_dependencies. For example, here's how the VTK VisTrails package declares dependencies:

def package_dependencies():
    import core.packagemanager
    manager = core.packagemanager.get_package_manager()
    if manager.has_package('spreadsheet'):
        return ['spreadsheet']
    else:
        return []

The callable must return a list of strings, representing the name of the packages it depends on. We also use this example to introduce the package manager API, that is useful here for inspecting packages present in the system. Notice that the dependencies are not static. vtk depends on spreadsheet if and only if spreadsheet is present in the system. Otherwise, it has no dependencies.

Note: Circular dependencies are not allowed. They will be detected by VisTrails and an error will be signalled.

Note: Currently, package names are reasonably brittle, in the sense that conflicts in package naming might become an issue. We are in the process of designing an API that will allow more robust naming schemes.

User-defined module shapes and colors

Help! This documentation wasn't good enough!

Sorry, it's our fault! If you need help, email cscheid@sci.utah.edu, or, preferably, join the vistrails-users list.