Difference between revisions of "SDM Provenance"

From VistrailsWiki
Jump to navigation Jump to search
m
(added comments about changes needed for MoML)
Line 12: Line 12:


We propose a general ''add''/''delete'' structure for storing workflow evolution.  Any modification of a workflow can be decomposed into a sequence of actions that add or delete features of the workflow.  These actions are organized into a tree such that each node represents a workflow that is generated by composing the actions along the path from the root of the tree to the given node.  Note that this structure is independent of the actual workflow representation.
We propose a general ''add''/''delete'' structure for storing workflow evolution.  Any modification of a workflow can be decomposed into a sequence of actions that add or delete features of the workflow.  These actions are organized into a tree such that each node represents a workflow that is generated by composing the actions along the path from the root of the tree to the given node.  Note that this structure is independent of the actual workflow representation.
 
   
{| cellspacing="20"
{| cellspacing="20"
|-valign="top"
|-valign="top"
Line 216: Line 216:
===Comments===
===Comments===
====Daniel Crawl====
====Daniel Crawl====
* coming soon
* PORT needs a data type (e.g. string, int) column; the current type column is direction.
* Kepler has directors and composite actors.
** Director not the same as a module.
*** Have parameters, but not ports. Either rename module_id column in PARAMETER table to something more general, or create a second PARAMETER table for directors.
** Composite actors (or workspaces) are subworkflows, and each can contain possibly different director.
*** Add workspace_id column to MODULE
* Kepler uses more general relationships (one output, multiple input) than connections (one output, one input) between modules.
* Wiki says model workflow as DAG? Existing schema does not require this, and Kepler can have cycles.
* Modifications to schema:
 
{| cellspacing="20"
  |-valign="top"
  |
  {| border="1" cellpadding="3" cellspacing="0"
    |+ MODULE
    !align="left" | Column !!align="left" | Type
    |-
    | id || long
    |-
    | workspace_id || long
    |-
    |colspan="2" align="center" | ...
    |-
    |colspan="2" align="center" | ''other identifying information''
    |-
  |}
  |
  {| border="1" cellpadding="3" cellspacing="0"
    |+ DIRECTOR
    !align="left" | Column !!align="left" | Type
    |-
    | id || long
    |-
    | workspace_id || long
    |-
    |colspan="2" align="center" | ...
    |-
    |colspan="2" align="center" | ''other identifying information''
    |-
  |}
  |
  {| border="1" cellpadding="3" cellspacing="0"
    |+ WORKSPACE
    !align="left" | Column !!align="left" | Type
    |-
    | id || long
    |-
    |colspan="2" align="center" | ...
    |-
    |colspan="2" align="center" | ''other identifying information''
    |-
  |}
  |-valign="top"
  |
  {| border="1" cellpadding="3" cellspacing="0"
    |+ PORT
    !align="left" | Column !!align="left" | Type
    |-
    | id || long
    |-
    | type || varchar
    |-
    | direction || ('in' , 'out')
    |-
    | module_id || long
    |-
    |colspan="2" align="center" | ...
    |-
    |colspan="2" align="center" | ''other identifying information''
    |-
  |}
  |
  {| border="1" cellpadding="3" cellspacing="0"
    |+ RELATIONSHIP
    !align="left" | Column !!align="left" | Type
    |-
    | id || long
    |-
    |colspan="2" align="center" | ...
    |-
    |colspan="2" align="center" | ''other identifying information''
    |-
  |}
  |
  {| border="1" cellpadding="3" cellspacing="0"
    |+ LINK
    !align="left" | Column !!align="left" | Type
    |-
    | id || long
    |-
    | module_id || long
    |-
    | port_id || long
    |-
    | relation_id || long
    |-
    |colspan="2" align="center" | ...
    |-
    |colspan="2" align="center" | ''other identifying information''
    |-
  |}
  | 
|}

Revision as of 21:24, 3 April 2007

Requirements

The provenance data model is based on a three-layered architecture that collects information about:

  • workflow evolution,
  • workflow meta-information, and
  • workflow execution

For the scientific workflows we are considering, we will model them using a DAG. In addition, our execution information will not store data products produced at each stage of the workflow, but rather the user can specify arbitrary data product storage in their workflow.

Workflow evolution

We propose a general add/delete structure for storing workflow evolution. Any modification of a workflow can be decomposed into a sequence of actions that add or delete features of the workflow. These actions are organized into a tree such that each node represents a workflow that is generated by composing the actions along the path from the root of the tree to the given node. Note that this structure is independent of the actual workflow representation.

ACTION
Column Type
id long
parent_id long
user varchar
time datetime
OPERATION
Column Type
type ('add', 'delete')
obj_id long
obj_type varchar
parent_obj_id long
parent_obj_type varchar
action_id long

actionId is a foreign key into the Action table.

Workflow Meta-Information

We take an abstract view of the workflow for this proposal, although we expect more requirements and details to be added. A workflow can be modeled as a DAG where each node represents a computational module and each edge represents a flow of data from one box to another--a connection. A module may also simply perform some I/O operation (reading a file, writing to a database). Because each node may have multiple in- and out- edges, we also identify ports for each module that specify the role and format of the incoming or outgoing data. Finally, each module contains some static parameters (or properties) that are set during workflow design. We should identify modules and properties based on some naming scheme (hierarchical pattern). Also, we need to support user-specified annotations as key-value pairs. With these five elements: modules, connections, ports, parameters, and annotations, we can define a workflow as:

  • a collection of modules where each module contains
    • a collection of input ports
    • a collection of output ports
    • a collection of parameters
  • a collection of connections where each connection contains
    • a reference to the outgoing port of the input module
    • a reference to the incoming port of the output module
  • a collection of annotations
MODULE
Column Type
id long
...
other identifying information
CONNECTION
Column Type
id long
in_port_id long
out_port_id long
...
other identifying information
PORT
Column Type
id long
type `out')
module_id long
...
other identifying information
PARAMETER
Column Type
id long
module_id long
...
other identifying information
ANNOTATION
Column Type
key varchar
value varchar
...
other identifying information

Workflow Execution

Again, we emphasize that for large-scale applications, we cannot arbitrarily store all intermediate data. However, we note that it may make sense for the user to specify that some information is stored during run-time execution. We propose that execution information include:

  • a link to the representation of the workflow executed (note that because this provenance is stored at the representation layer, we need not store it in this layer).
  • the name of the user executing workflow
  • the name of the organization the user belongs to
  • the time of execution
  • system identities (operation system, library versions, etc.)
  • tracking of the actual executed modules
  • runtime specific information/annotations

The first five items are simply attributes, but the last two are more involved. Note that knowing a workflow execution failed is fairly meaningless unless you know during which step it failed. By tracking each module execution, you can determine the point of failure. We also note that storing runtime specific information is useful for cases where executing the same workflow generates different results. These might be stored as key-value pairs or some more specific.

WF_EXEC
Column Type
wf_id long
user varchar
org varchar
time datetime
machine_os varchar
machine_proc varchar
machine_ram long
...
other identifying information
MODULE_EXEC
Column Type
exec_id long
module_id long
error bool
time numeric(10,2)
...
other information
ANNOTATIONS_EXEC
Column Type
key varchar
value varchar
...
other information

Comments

Daniel Crawl

  • PORT needs a data type (e.g. string, int) column; the current type column is direction.
  • Kepler has directors and composite actors.
    • Director not the same as a module.
      • Have parameters, but not ports. Either rename module_id column in PARAMETER table to something more general, or create a second PARAMETER table for directors.
    • Composite actors (or workspaces) are subworkflows, and each can contain possibly different director.
      • Add workspace_id column to MODULE
  • Kepler uses more general relationships (one output, multiple input) than connections (one output, one input) between modules.
  • Wiki says model workflow as DAG? Existing schema does not require this, and Kepler can have cycles.
  • Modifications to schema:
MODULE
Column Type
id long
workspace_id long
...
other identifying information
DIRECTOR
Column Type
id long
workspace_id long
...
other identifying information
WORKSPACE
Column Type
id long
...
other identifying information
PORT
Column Type
id long
type varchar
direction ('in' , 'out')
module_id long
...
other identifying information
RELATIONSHIP
Column Type
id long
...
other identifying information
LINK
Column Type
id long
module_id long
port_id long
relation_id long
...
other identifying information