Multithreaded-interpreter
The multithreaded-interpreter is a branch of development adding support for parallel processing to VisTrails modules. It allows to execute parts of a pipeline in parallel, or even on remote machines.
Currently, support for threads, multiprocessing and IPython has been put in.
Introduction
In 2.0, modules execute serially; each module calls it's upstream modules' update() method recursively to execute the full graph.
Because of the way VisTrails behaves and limitations of the libraries it uses (PyQt, VTK), this should be opt-in: by default, modules should still run sequentially on a single thread.
Modules that do support parallel execution would get executed in a different thread in parallel with each other and in parallel with classic modules, so a mechanism to allow the interpreter thread to continue while a module went to the background is necessary.
Execution logic
Currently, the execution logic for the pipeline is entirely contained in the Module class. In the update() method, a module calls update_upstream() then compute(); this first method simply calls update() on each upstream module recursively and in no particular order.
These recursive calls are incompatible with backgrounding tasks as there is no way for a module to do anything while these calls are no complete. Moreover, some modules override the update_upstream() logic, for instance the ControlFlow package (If/Default/ExecuteInOrder modules) or don’t actually use it, for instance ControlFlow’s or ParallelFlow’s Map modules.
Task system
Thus, I replaced the recursive calls with a task system (vistrails.core.task_system). This system provides a way to add tasks to be run later, and for a callback to be run when done; it also handles priorities. The system only runs tasks sequentially, but provides a way to register asynchronous tasks to support backgrounding modules.
To make sure that background tasks are started as soon as possible (before foreground tasks), the base priorities are as follow:
- 10 (UPDATE_UPSTREAM_PRIORITY) for “update upstream” tasks, so that every module has a chance to register its tasks before anything starts computing
- 100 (COMPUTE_PRIORITY) for “compute” tasks
- 50 (COMPUTE_BACKGROUND_PRIORITY) for “start thread/process” tasks, so they they are started before compute tasks. This allows them to run in parallel with regular compute tasks, in addition to running in parallel between themselves
Parallelization layer
On top of this task system, a simple interface allowing standard modules to use parallilization without much efforts has been added. It consists of two parts:
- A @parallelizable class decorator for Module subclasses, used to declare which targets a module can be executed on. A Module will only be executed on targets it supports.
- A ParallelizationScheme base class, that provides a specific way to execute a module. Currently implemented are:
- ThreadScheme, that uses concurrent.futures.ThreadPoolExecutor
- ProcessScheme, that uses concurrent.futures.ProcessPoolExecutor
- IPythonScheme, that uses IPython.parallel to dispatch modules to a cluster
Issues
- When there are non-parallel modules or tasks, the task system may not run them in the most efficient order. This is because the actual duration of each task is not know. An effort has been made to try to reach the backgroundable tasks first, but as their upstream modules need to be run, finding the most efficient order is difficult. Moreover, using complex strategies to choose a task might make execution slower even though it uses parallelism.
- upstream modules are not necessarily run consecutively, this causes issues with packages that wrongly assume that to keep some global state, like matplotlib that creates a Figure then expects only its own upstream modules to be run.
- Executing a Group module remotely should be possible if all the contained modules support it.