Difference between revisions of "Provenance challenge"

From VistrailsWiki
Jump to navigation Jump to search
 
(61 intermediate revisions by 2 users not shown)
Line 2: Line 2:




This page describes the implementation of how to answer the queries of the second provenance challenge.
This page describes our approach to answer the queries of the [http://twiki.ipaw.info/bin/view/Challenge/ Second Provenance Challenge]
 
We have developed a new, general  API for  querying  provenance models used by different scientific workflow systems.
 
== Scientific Workflow Provenance Data Model (SWPDM): Overview ==
 
The SWDM aims to capture entities and relationships that are relevant to both the definition
and execution of workflows. The goal is to define a "general" model that is able to represent
provenance information obtained by different workflow systems.
 
 
<b>Module definition</b> is a description of a processor that takes inputs and generates outputs.
 
<b>Workflow definition</b> is a description of a simple workflow that contains modules and connections between them through ports. It also contains the evolution of the workflow through a parent relation as used by [[Main_Page|VisTrails]].
 
<b>Execution log</b> is the information about a workflow execution. It contains information about the processors that were executed and the data items that were created.
 
[[Image:model.png]]
 
== Model Entities ==
 
<table border="1">
<tr><th>name</th><th>description</th></tr>
<tr><td>dataitem</td><td>
A data item that is input/output to executions.
This could be a local file, a file on the internet with a url, or some internal representation of the file in the workflow system.
</td></tr><tr><td>procedure</td><td>
the procedure/service that is to be executed. This could be a local program, a web service, a sub-workflow or some other kind of processing entity.
</td></tr>
<tr><td>module</td><td>
the instance of a procedure in the workflow. This module is connected with other module through ports.
</td></tr>
<tr><td>execution</td><td>
The execution of a module as recorded internally by the workflow-system or externally in some other way.
</td></tr>
<tr><td>workflow</td><td>
a description of the process containing the module instances and connections
</td></tr>
<tr><td>run</td><td>
Executions are grouped into a run that is associated with a workflow.
</td></tr>
<tr><td>input</td><td>
represents an input/parameter slot of a module
</td></tr>
<tr><td>output</td><td>represents an output/result slot of a module
<tr><td>inPort</td><td>
represents a connected input port in the workflow
</td></tr>
<tr><td>outPort</td><td>represents a connected output port in the workflow
</td></tr>
</table>
 
== API Functions ==
 
The relations among the model entities are shown in the diagram above.
Note that few of the current provenance systems capture all of this information. For example,
some lack information about data products, and some do not explicitly
store the workflow definition. Thus, in designing API bindings for the different systems,
the goal is to extract (and map) as much information as possible from each source.
 
The model ER-diagram above suggest the following API functions.
 
Execution:
<ul>
<li>getDataFromOutPort - Get all data items created from this port</li>
<li>getOutPortFromData - Get the port that created this data item</li>
<li>getDataFromInPort - Get all data items used as input to this type of port</li>
<li>getInPortFromData - Get the input ports that have used this data item</li>
<li>getExecutionFromInData - Get the execution that used this data item</li>
<li>getInDataFromExecution - Get the data used by this execution</li>
<li>getExecutionFromOutData - Get the execution that created this data item</li>
<li>getOutDataFromExecution - Get the data created by this execution</li>
<li>getModuleFromExecution - Get the module that this execution executes</li>
<li>getExecutionFromModule - Get the executions of this module</li>
<li>getRunFromExecution - Get the workflow run that this execution is associated with</li>
<li>getExecutionFromRun - Get the executions in this workflow run</li>
<li>getRunFromWorkflow - Get the runs of this workflow</li>
<li>getWorkflowFromRun - Get the workflow that this is a run of</li>
<li>getModuleFromWorkflow - Get all module instances in this workflow</li>
<li>getWorkflowFromModule - Get the workflow that this module is part of</li>
<li>getParentDataItem - Get the data items this is derived from</li>
<li>getChildDataItem - Get the data items derived from this one</li>
<li>getParentExecution - Get the execution that preceded this one</li>
<li>getChildExecution - Get the execution that came after this one</li>
</ul>
 
Workflow:
<ul>
<li>getProcedureFromModule - Get the module this is an instance of</li>
<li>getModuleFromProcedure - Get the instances of a module</li>
<li>getOutPortFromModule - Get the output port of a module</li>
<li>getInPortFromModule - Get the input port of a module</li>
<li>getModuleFromOutPort - Get the module that has this out port</li>
<li>getModuleFromInPort - Get the module that has this in port</li>
<li>getInPortFromOutPort - Get all InPorts connected to this OutPort</li>
<li>getOutPortFromInPort - Get all OutPorts connected to this InPort</li>
<li>getParentWorkflow - Get the workflow this is derived from</li>
<li>getChildWorkflow - Get the workflows that is derived from this one</li>
<li>getConnectionFromInPort - Get all connections using this port</li>
<li>getInPortFromConnection - Get the Input port it is connected to</li>
<li>getConnectionFromOutPort - Get all connections using this port</li>
<li>getOutPortFromConnection - Get the Output port it is connected to</li>
Derived:
<li>getParentModule - Get the previous modules it is connected to</li>
<li>getChildModule - Get the next modules it is connected to</li>
</ul>
 
Module:
<ul>
<li>getOutputFromProcedure - Get all output ports of a procedure</li>
<li>getInputFromProcedure - Get all input ports of a procedure</li>
<li>getProcedureFromOutput - Get all procedures having this output port</li>
<li>getProcedureFromInput - Get all procedures having this input port</li>
</ul>
 
Global functions:
<ul>
<li>getAll(type)</li>
<li>getAll(type, annotation restriction)</li>
<li>getAnnotation</li>
</ul>
 
=== Upstream ===
 
The most common use of provenance data will be to perform the transitive closure of some connected executions or data items i.e. to track data dependencies back and forward in time. We call this <b>upstream</b> for tracking back in time and <b>downstream</b> for tracking forward in time.
 
We have identified 4 primitives to which upstream/downstream tracking is relevant:
 
<table border="1">
<tr><th>primitive</th><th>description</th></tr>
<tr><td>dataitem</td><td>tracking data dependencies</td></tr>
<tr><td>moduleExecution</td><td>tracking execution dependencies</td></tr>
<tr><td>moduleInstances</td><td>tracking module dependencies within a workflow</td></tr>
<tr><td>workflow</td><td>tracking workflow design history e.g. different workflow versions in the [[Main_Page|VisTrails]] action tree</td></tr>
</table>
 
For a general transitive function we propose:
<ul><li>traverse(start, end, limit, stop)</li></ul>
The definition is:
<b>start</b> is the start point and <b>end</b> is the endpoint. <b>Limit</b> is the max number of steps to search where <b>0</b> means no limit. <b>Stop</b> is an optional list of nodes which should not be explored further. 'Stop' is needed by some queries in the challenge and also adds to the expressiveness of the queries.
 
The function shold return all elements between start and end with a maximum length of 'limit' and excluding branches after 'stop'.
If start or end is omitted e.g. '*', the function should continue to traverse until no more results are found or limit is reached.
 
Examples:
<ul>
<li>traverse(start, *, 0) (downstream)</li>
<li>traverse(*, end, 0) (upstream)</li>
<li>traverse(start, end, 0) (find all elements in between)</li>
<li>traverse(*, end, 0, softmean) (upstream excluding nodes after softmean)</li>
</ul>
 
There should also exist a way of checking if two nodes are related to each other
<ul>
<li>related(start,end,limit) (Return true if a path exist between the nodes)</li>
</ul>
 
== Implementation ==
 
The implementation is done in python. Currently the API has been partially implemented for [http://www.w3.org/XML/ XML] using [http://www.w3.org/TR/xpath XPath] and [http://www.w3.org/RDF/ RDF] using [http://www.w3.org/TR/rdf-sparql-query/ SPARQL].
 
There are currently three ways to implement the transitive closure:
 
<OL>
<LI>Natively in the Query processor (not implemented yet)
<LI>As an extra graph structure as described below
<LI>An implementation in python using basic relation queries on the source. (Very slow)
</OL>
 
For small data stores any implementation is possible.
For large data stores (2) might not be possible. If (1) is not possible in the query engine, your only choice is (3), which might be very slow.
An assumption would be that an implementation of a provenance store would probably support (1). Or implement its own version of (2).
 
For query processors not capable of transitive closure (like the ones above), we have implemented a Graph structure. During initialization it loads the transitive relations in the data store. All upstream/downstram queries will then be directed to that Graph structure for fast computation.
 
== System overview ==
 
The structure of the system can be summarized in the figure below.
 
[[Image:pc_system.png]]
 
<b>PQObject</b> represents a node in the provenance data by storing its id and namespace (Its PId). It can call methods in <b>PQueryFactory</b> to traverse the data as a graph.
<b>PQueryFactory</b> contains the sources and forwards queries to the correct source by comparing namespaces.
 
All data source inherits from <b>PStore</b>. It contains functions implementing data aliases, execution dependencies in other sources and sending queries to the right functions. It also provides access to <b>PGraph</b> that implements a structure for storing and querying transitive relations. This structure can be used if the query engine does not support efficient querying of transitive closure. It works by loading the transitive data during initialization and then direct all transitive queries to it. It is thus not suited for very large data.
 
There are currently 2 classes implementing specific data type access:
<b>XMLStore</b> implements loading of an XML data file and provides access through XPath.
<b>RDFStore</b> implements access to an RDF server using SPARQL.
Other data types i.e. relational DB can be implemented in a similar way.
 
The figure shows <b>TavernaStore</b> and <b>TavernaRDFStore</b>. They are both implementing the API for the [http://twiki.ipaw.info/bin/view/Challenge/MyGrid2 MyGrid] team data files. The MyGrid Taverna data is in XML/RDF which makes it possible to process it using both XPath and SPARQL. Having two versions makes comparison between implementations easy without having to worry about different data formats.
 
Another partially implemented XML source is the [http://twiki.ipaw.info/bin/view/Challenge/Southampton2 Southampton] team.
 
 
 
--[[User:Tommy|Tommy]] 03:40, 13 April 2007 (MDT)--

Latest revision as of 09:49, 18 June 2007

Second provenance challenge design overview

This page describes our approach to answer the queries of the Second Provenance Challenge

We have developed a new, general API for querying provenance models used by different scientific workflow systems.

Scientific Workflow Provenance Data Model (SWPDM): Overview

The SWDM aims to capture entities and relationships that are relevant to both the definition and execution of workflows. The goal is to define a "general" model that is able to represent provenance information obtained by different workflow systems.


Module definition is a description of a processor that takes inputs and generates outputs.

Workflow definition is a description of a simple workflow that contains modules and connections between them through ports. It also contains the evolution of the workflow through a parent relation as used by VisTrails.

Execution log is the information about a workflow execution. It contains information about the processors that were executed and the data items that were created.

Model.png

Model Entities

namedescription
dataitem

A data item that is input/output to executions. This could be a local file, a file on the internet with a url, or some internal representation of the file in the workflow system.

procedure

the procedure/service that is to be executed. This could be a local program, a web service, a sub-workflow or some other kind of processing entity.

module

the instance of a procedure in the workflow. This module is connected with other module through ports.

execution

The execution of a module as recorded internally by the workflow-system or externally in some other way.

workflow

a description of the process containing the module instances and connections

run

Executions are grouped into a run that is associated with a workflow.

input

represents an input/parameter slot of a module

outputrepresents an output/result slot of a module
inPort

represents a connected input port in the workflow

outPortrepresents a connected output port in the workflow

API Functions

The relations among the model entities are shown in the diagram above. Note that few of the current provenance systems capture all of this information. For example, some lack information about data products, and some do not explicitly store the workflow definition. Thus, in designing API bindings for the different systems, the goal is to extract (and map) as much information as possible from each source.

The model ER-diagram above suggest the following API functions.

Execution:

  • getDataFromOutPort - Get all data items created from this port
  • getOutPortFromData - Get the port that created this data item
  • getDataFromInPort - Get all data items used as input to this type of port
  • getInPortFromData - Get the input ports that have used this data item
  • getExecutionFromInData - Get the execution that used this data item
  • getInDataFromExecution - Get the data used by this execution
  • getExecutionFromOutData - Get the execution that created this data item
  • getOutDataFromExecution - Get the data created by this execution
  • getModuleFromExecution - Get the module that this execution executes
  • getExecutionFromModule - Get the executions of this module
  • getRunFromExecution - Get the workflow run that this execution is associated with
  • getExecutionFromRun - Get the executions in this workflow run
  • getRunFromWorkflow - Get the runs of this workflow
  • getWorkflowFromRun - Get the workflow that this is a run of
  • getModuleFromWorkflow - Get all module instances in this workflow
  • getWorkflowFromModule - Get the workflow that this module is part of
  • getParentDataItem - Get the data items this is derived from
  • getChildDataItem - Get the data items derived from this one
  • getParentExecution - Get the execution that preceded this one
  • getChildExecution - Get the execution that came after this one

Workflow:

  • getProcedureFromModule - Get the module this is an instance of
  • getModuleFromProcedure - Get the instances of a module
  • getOutPortFromModule - Get the output port of a module
  • getInPortFromModule - Get the input port of a module
  • getModuleFromOutPort - Get the module that has this out port
  • getModuleFromInPort - Get the module that has this in port
  • getInPortFromOutPort - Get all InPorts connected to this OutPort
  • getOutPortFromInPort - Get all OutPorts connected to this InPort
  • getParentWorkflow - Get the workflow this is derived from
  • getChildWorkflow - Get the workflows that is derived from this one
  • getConnectionFromInPort - Get all connections using this port
  • getInPortFromConnection - Get the Input port it is connected to
  • getConnectionFromOutPort - Get all connections using this port
  • getOutPortFromConnection - Get the Output port it is connected to
  • Derived:
  • getParentModule - Get the previous modules it is connected to
  • getChildModule - Get the next modules it is connected to

Module:

  • getOutputFromProcedure - Get all output ports of a procedure
  • getInputFromProcedure - Get all input ports of a procedure
  • getProcedureFromOutput - Get all procedures having this output port
  • getProcedureFromInput - Get all procedures having this input port

Global functions:

  • getAll(type)
  • getAll(type, annotation restriction)
  • getAnnotation

Upstream

The most common use of provenance data will be to perform the transitive closure of some connected executions or data items i.e. to track data dependencies back and forward in time. We call this upstream for tracking back in time and downstream for tracking forward in time.

We have identified 4 primitives to which upstream/downstream tracking is relevant:

primitivedescription
dataitemtracking data dependencies
moduleExecutiontracking execution dependencies
moduleInstancestracking module dependencies within a workflow
workflowtracking workflow design history e.g. different workflow versions in the VisTrails action tree

For a general transitive function we propose:

  • traverse(start, end, limit, stop)

The definition is: start is the start point and end is the endpoint. Limit is the max number of steps to search where 0 means no limit. Stop is an optional list of nodes which should not be explored further. 'Stop' is needed by some queries in the challenge and also adds to the expressiveness of the queries.

The function shold return all elements between start and end with a maximum length of 'limit' and excluding branches after 'stop'. If start or end is omitted e.g. '*', the function should continue to traverse until no more results are found or limit is reached.

Examples:

  • traverse(start, *, 0) (downstream)
  • traverse(*, end, 0) (upstream)
  • traverse(start, end, 0) (find all elements in between)
  • traverse(*, end, 0, softmean) (upstream excluding nodes after softmean)

There should also exist a way of checking if two nodes are related to each other

  • related(start,end,limit) (Return true if a path exist between the nodes)

Implementation

The implementation is done in python. Currently the API has been partially implemented for XML using XPath and RDF using SPARQL.

There are currently three ways to implement the transitive closure:

  1. Natively in the Query processor (not implemented yet)
  2. As an extra graph structure as described below
  3. An implementation in python using basic relation queries on the source. (Very slow)

For small data stores any implementation is possible. For large data stores (2) might not be possible. If (1) is not possible in the query engine, your only choice is (3), which might be very slow. An assumption would be that an implementation of a provenance store would probably support (1). Or implement its own version of (2).

For query processors not capable of transitive closure (like the ones above), we have implemented a Graph structure. During initialization it loads the transitive relations in the data store. All upstream/downstram queries will then be directed to that Graph structure for fast computation.

System overview

The structure of the system can be summarized in the figure below.

Pc system.png

PQObject represents a node in the provenance data by storing its id and namespace (Its PId). It can call methods in PQueryFactory to traverse the data as a graph. PQueryFactory contains the sources and forwards queries to the correct source by comparing namespaces.

All data source inherits from PStore. It contains functions implementing data aliases, execution dependencies in other sources and sending queries to the right functions. It also provides access to PGraph that implements a structure for storing and querying transitive relations. This structure can be used if the query engine does not support efficient querying of transitive closure. It works by loading the transitive data during initialization and then direct all transitive queries to it. It is thus not suited for very large data.

There are currently 2 classes implementing specific data type access: XMLStore implements loading of an XML data file and provides access through XPath. RDFStore implements access to an RDF server using SPARQL. Other data types i.e. relational DB can be implemented in a similar way.

The figure shows TavernaStore and TavernaRDFStore. They are both implementing the API for the MyGrid team data files. The MyGrid Taverna data is in XML/RDF which makes it possible to process it using both XPath and SPARQL. Having two versions makes comparison between implementations easy without having to worry about different data formats.

Another partially implemented XML source is the Southampton team.


--Tommy 03:40, 13 April 2007 (MDT)--