Difference between revisions of "SDM Provenance"
m (added questions about wf evolution) |
m (→Merged Schema) |
||
(4 intermediate revisions by 2 users not shown) | |||
Line 191: | Line 191: | ||
|- | |- | ||
| time || numeric(10,2) | | time || numeric(10,2) | ||
|- | |||
| version || long | |||
|- | |- | ||
|colspan="2" align="center" | ... | |colspan="2" align="center" | ... | ||
Line 220: | Line 222: | ||
** Director not the same as a module. | ** Director not the same as a module. | ||
*** Have parameters, but not ports. Either rename module_id column in PARAMETER table to something more general, or create additional PARAMETER tables for directors and workspaces (see below). | *** Have parameters, but not ports. Either rename module_id column in PARAMETER table to something more general, or create additional PARAMETER tables for directors and workspaces (see below). | ||
** Composite actors ( | ** Composite actors are subworkflows (hierarchical modeling). | ||
*** Can have parameters. | *** Can have parameters. | ||
*** | *** Can contain different director. | ||
*** Add | *** Can contain composite actors (nesting) | ||
*** Add contained_in_id column to ACTOR, DIRECTOR | |||
**** foriegn key to COMPOSITE(id) | |||
* Kepler uses more general relationships (one output, multiple input) than connections (one output, one input) between modules. | * Kepler uses more general relationships (one output, multiple input) than connections (one output, one input) between modules. | ||
* Wiki says model workflow as DAG? Existing schema does not require this, and Kepler can have cycles. | * Wiki says model workflow as DAG? Existing schema does not require this, and Kepler can have cycles. | ||
* Need to identify the version of each module (and director) used in a workflow. Is the version incorporated into the id column? | * Need to identify the version of each module (and director) used in a workflow. Is the version incorporated into the id column? | ||
* | * Questions about workflow evolution: | ||
** Shouldn't the foreign key be from ACTION to OPERATION? The same operation can be performed in multiple actions | ** Shouldn't the foreign key be from ACTION to OPERATION? (The same operation can be performed in multiple actions?) | ||
** Why is object type and parent information in OPERATION? | ** Why is object type and parent information in OPERATION? | ||
* Modifications to schema: | * Modifications to schema: | ||
Line 237: | Line 240: | ||
| | | | ||
{| border="1" cellpadding="3" cellspacing="0" | {| border="1" cellpadding="3" cellspacing="0" | ||
|+ | |+ ACTOR | ||
!align="left" | Column !!align="left" | Type | !align="left" | Column !!align="left" | Type | ||
|- | |- | ||
| id || long | | id || long | ||
|- | |- | ||
| | | contained_in_id || long | ||
|- | |- | ||
|colspan="2" align="center" | ... | |colspan="2" align="center" | ... | ||
Line 256: | Line 259: | ||
| id || long | | id || long | ||
|- | |- | ||
| | | contained_in_id || long | ||
|- | |- | ||
|colspan="2" align="center" | ... | |colspan="2" align="center" | ... | ||
Line 265: | Line 268: | ||
| | | | ||
{| border="1" cellpadding="3" cellspacing="0" | {| border="1" cellpadding="3" cellspacing="0" | ||
|+ | |+ COMPOSITE | ||
!align="left" | Column !!align="left" | Type | !align="left" | Column !!align="left" | Type | ||
|- | |- | ||
| id || long | | id || long | ||
|- | |||
| contained_in_id || long | |||
|- | |- | ||
|colspan="2" align="center" | ... | |colspan="2" align="center" | ... | ||
Line 287: | Line 292: | ||
| direction || ('in' , 'out') | | direction || ('in' , 'out') | ||
|- | |- | ||
| | | actor_id || long | ||
|- | |- | ||
|colspan="2" align="center" | ... | |colspan="2" align="center" | ... | ||
Line 296: | Line 301: | ||
| | | | ||
{| border="1" cellpadding="3" cellspacing="0" | {| border="1" cellpadding="3" cellspacing="0" | ||
|+ | |+ RELATION | ||
!align="left" | Column !!align="left" | Type | !align="left" | Column !!align="left" | Type | ||
|- | |- | ||
Line 313: | Line 318: | ||
| id || long | | id || long | ||
|- | |- | ||
| | | actor_id || long | ||
|- | |- | ||
| port_id || long | | port_id || long |
Latest revision as of 21:47, 15 June 2007
Requirements
The provenance data model is based on a three-layered architecture that collects information about:
- workflow evolution,
- workflow meta-information, and
- workflow execution
For the scientific workflows we are considering, we will model them using a DAG. In addition, our execution information will not store data products produced at each stage of the workflow, but rather the user can specify arbitrary data product storage in their workflow.
Workflow evolution
We propose a general add/delete structure for storing workflow evolution. Any modification of a workflow can be decomposed into a sequence of actions that add or delete features of the workflow. These actions are organized into a tree such that each node represents a workflow that is generated by composing the actions along the path from the root of the tree to the given node. Note that this structure is independent of the actual workflow representation.
|
|
actionId is a foreign key into the Action table.
Workflow Meta-Information
We take an abstract view of the workflow for this proposal, although we expect more requirements and details to be added. A workflow can be modeled as a DAG where each node represents a computational module and each edge represents a flow of data from one box to another--a connection. A module may also simply perform some I/O operation (reading a file, writing to a database). Because each node may have multiple in- and out- edges, we also identify ports for each module that specify the role and format of the incoming or outgoing data. Finally, each module contains some static parameters (or properties) that are set during workflow design. We should identify modules and properties based on some naming scheme (hierarchical pattern). Also, we need to support user-specified annotations as key-value pairs. With these five elements: modules, connections, ports, parameters, and annotations, we can define a workflow as:
- a collection of modules where each module contains
- a collection of input ports
- a collection of output ports
- a collection of parameters
- a collection of connections where each connection contains
- a reference to the outgoing port of the input module
- a reference to the incoming port of the output module
- a collection of annotations
|
|
| ||||||||||||||||||||||||||||||||
|
|
Workflow Execution
Again, we emphasize that for large-scale applications, we cannot arbitrarily store all intermediate data. However, we note that it may make sense for the user to specify that some information is stored during run-time execution. We propose that execution information include:
- a link to the representation of the workflow executed (note that because this provenance is stored at the representation layer, we need not store it in this layer).
- the name of the user executing workflow
- the name of the organization the user belongs to
- the time of execution
- system identities (operation system, library versions, etc.)
- tracking of the actual executed modules
- runtime specific information/annotations
The first five items are simply attributes, but the last two are more involved. Note that knowing a workflow execution failed is fairly meaningless unless you know during which step it failed. By tracking each module execution, you can determine the point of failure. We also note that storing runtime specific information is useful for cases where executing the same workflow generates different results. These might be stored as key-value pairs or some more specific.
|
|
|
Comments
Daniel Crawl
- PORT needs a data type (e.g. string, int) column; the current type column is direction.
- Kepler has directors and composite actors.
- Director not the same as a module.
- Have parameters, but not ports. Either rename module_id column in PARAMETER table to something more general, or create additional PARAMETER tables for directors and workspaces (see below).
- Composite actors are subworkflows (hierarchical modeling).
- Can have parameters.
- Can contain different director.
- Can contain composite actors (nesting)
- Add contained_in_id column to ACTOR, DIRECTOR
- foriegn key to COMPOSITE(id)
- Director not the same as a module.
- Kepler uses more general relationships (one output, multiple input) than connections (one output, one input) between modules.
- Wiki says model workflow as DAG? Existing schema does not require this, and Kepler can have cycles.
- Need to identify the version of each module (and director) used in a workflow. Is the version incorporated into the id column?
- Questions about workflow evolution:
- Shouldn't the foreign key be from ACTION to OPERATION? (The same operation can be performed in multiple actions?)
- Why is object type and parent information in OPERATION?
- Modifications to schema:
|
|
| |||||||||||||||||||||||||||||||||||||
|
|
|
George Chin
- Agree with David that we need a mechanism to represent subworkflows, although not crazy about the term "workspace".
- What can annotations be attached to? I can imagine one wanting to stick annotations on workflows, modules, connections, ports, and parameters, and annotation_exec on both workflow_exec and module_exec.
- Would be useful for wf_exec and module_exec to have a "status" field for capturing execution progress.
- module_exec should also have machine fields (machine_os, machine_proc, machine_ram) because different modules of same workflow may be running on different computers.
- Other provenance systems such as MINDSWAP and MYGRID use ontologies to organize objects/actors. Should we consider using ontologies or semantic hierarchies to capture the structures and relationships of elements such as actors, parameters, ports, and data? This would facilitate more sophisticated querying and could provide the application developer/scientist more semantic information that could aid in workflow design by better identifying which parameters and ports are compatible, which actors have similar I/O requirements, how to transform data between actors, etc. Would Vistrails’ data model with its use of generic modules that can pretty much be anything constrain our ability to use ontologies?
Types of Provenance
There are 4 types of provenance that we can include:
1) “Data” Provenance: This is provenance of the data that is produced by the program. This provenance would keep track of all the transformations that the data went through for a particular run. Thus we need to keep track of the input and output for each run. The provenance for a particular output would include the run in which it was produced, the input parameters for that run, and the intermediate transformations if any. Since the output data, and data produced in the intermediate steps is going to large, we can store them remotely and maintain links to these remote locations as part of provenance data.
2) Process Provenance: This is provenance about the process statistics for each run. This can include details like execution time, N/W speed, amount of data generated (or anything else that may be of importance). As this data is not going to be very large it can be stored in local workspace.
3) Workflow Provenance: This is provenance about the workflows. This will keep track of all the changes. So this could be something like a CVS repository where all versions of the workflow are kept and we can find the difference between each version. As this data is not going to be very large it can be stored in the local repository.
4) System Provenance: This keeps track of the system profile. When a change is made to the system configuration, then either a new copy or just the change can be saved. Also we can either save a copy of the system itself (which is easy and robust but very large) or save details about the configuration such that we can reconstruct the exact environment (this requires only small space but is going to be very complicated and risk of not being able to reconstruct the exact environment may not be possible).
General Notes: (3) is already represented as the VisTrails Version Tree. A similar format can be used for (1) too. Every input change creates a new branch in the tree. (2), and (4) are not provenance in its original sense. The tracking just goes back one step: the data and the particular configuration that produced that data. Also provenance can be a director in the workflow.