Towards answering the question: provenance data in myGrid what will it look like?
Background
ProvenanceIF5 provides the context for this page, it is motivated by the
ProvenanceData contribution to the
InformationModel.
At the basic level, provenance involves recording information about the origin of a data value.
Assumption 1 - All data within the myGrid system comes from somewhere. It may be the result of a workflow enactment, a invocation of the DQP, data typed in by a user, copied from an external file system etc.
A user may wish to select a piece of data within myGrid and ask for its provenance. Similarly, at several points a myGrid component may want to say, here is a new value and this is its provenance. The current thinking (
IntegrationFest5 september 2003) is that the mIR will provide the implementation for a provenance service. There will be a provenance storage interface, and a provenance query interface.
Assumption 2 - Where data is the result of a program or service then it must be possible to record provenance data from both the perspective of the invoker, and from the perspective of the program or service itself.
Draft Provenance Schema
A first draft at provenance types within the overall information model is in CVS module
infomodel file myGridComplexTypes_Provenance.xsd.
ProvenanceDiagrams gives a top-level view of the key aspects.
ProvDatabaseDesign this shows a design of an example provenance database from [1] ( see
http://www.pasoa.org/papers.html )
[1] Martin Szomszor and Luc Moreau. Recording and reasoning over data provenance in web and grid services. In International Conference on Ontologies, Databases and Applications of SEmantics (ODBASE'03), volume 2888 of Lecture Notes in Computer Science, pages 603-620, Catania, Sicily, Italy, November 2003.
Examples
Example 1
Suppose that I have a data item
XYZ in the mIR and I want to know its provenance.
One answer would be to say that it is the output of a workflow invocation and here is the workflow provenance record that has
XYZ as output named
blah
provenance from workflow service client
Another answer is that:
- the myGrid run_workflow wizard was invoked
- on date
some date
- by user
myGrid user
- and used the enactor service at endpoint
enactor id
- with inputs
- workflow description
LSID of workflow description
- input named X
LSID of input named X
- input named Y
LSID of input named Y
- ...
- and produced outputs
- output named R1
LSID of output named R1 , value XYZ
- ...
- in context
some context description
- (somewhere to put any why information that is available)
Note that it is important to identify the specific data that was used for an input, not just some data item whose value was
ABC. There may be several data items in the mIR with the same value but not with the same metadata (they may have been produced by different activities). In addition, we want the provenance store to be able to answer queries such as, where has this data been used.
There needs to be a way of linking the provenance information from the perspective of the invoker (run workflow wizard above) and the provenance information from the perspective of the invoked service.
provenance from workflow service invoked
The workflow provenance record includes information about its invocation (the parameters that it was passed and the results it returned) and information about the services that it invoked.
For a detailed description of a workflow provenance record see
GDProvenanceExample.
The elements in current workflow provenance XML are:
- workflowID
- the unique identifier for the workflow instance given by the workflow enactment engine
- workflowStatus
- COMPLETE or FAILED or RUNNING
- startTime
- the workflow instance start time
- endTime
- the workflow instance end time
- usedID
- the user identifier given to the workflow enactment engine - part of the inputs to the workflow enactor
- xscuflDefinition
- the workflow definition in XScufl - part of the inputs to the workflow enactor
- workflowInput
- the workflow input data given to the enactment engine - part of the inputs to the workflow enactor
(described in detail in
GDProvenanceExample)
- workflowOutput
- the workflow output data (described in detail in GDProvenanceExample)
- processorList
- the set of service invocation provenance records - these describe the provenance of the invoked services from the perspective of the enacting workflow
Example 2
Provenance of user editing a value in the mIR (e.g. revising a workflow description)
TO BE DONE
Example 3
Provenance of user directly invoking a service
TO BE DONE
Example 4
Provenance of user annotating a value in the mIR
invoker perspective:
- the myGrid annotate wizard was invoked
- there may be several of these?
- on date
some date
- by user
myGrid user
- (Is automatic annotation done on behalf of a user? How do we identify whether it is automatic or manual?)
- with inputs
- the LSID of the data item that is being annotated
- and produced outputs
- output LSIDs of annotations (and local mIR names)
- in context
some context description
Example 5
Provenance of user adding a value to the mIR
invoker perspective:
- the myGrid upload wizard was invoked
- on date
some date
- by user
myGrid user
- with inputs
- and produced outputs
- output LSIDs of uploaded values (and local mIR names?)
- in context
some context description
In this situation it is unlikely that there will be useful information available from the upload wizard itself. Is the version of the upload wizard important, does it use any other services itself?
Caveat
Of course it is possible that a particular myGrid setup might choose not to record all possible provenance information. However, it is important that any generic schemas for provenance data in myGrid are flexible enough to cope.
Notes (do not forget)
- provenance in the event of failure
- choice and iteration in workflow provenance
- nested workflows
- links to provenance questions
Related Pages
Related pages -
ProvenanceIF5,
ProvenanceData
--
MarkGreenwood - 01 Oct 2003