r4 - 12 Dec 2003 - 10:32:17 - MarkGreenwoodYou are here: myGrid wiki >  Mygrid Web  > WorkInProgress > ProvenanceData > ProvenanceOutline
Towards answering the question: provenance data in myGrid what will it look like?

Background

ProvenanceIF5 provides the context for this page, it is motivated by the ProvenanceData contribution to the InformationModel.

At the basic level, provenance involves recording information about the origin of a data value.

Assumption 1 - All data within the myGrid system comes from somewhere. It may be the result of a workflow enactment, a invocation of the DQP, data typed in by a user, copied from an external file system etc.

A user may wish to select a piece of data within myGrid and ask for its provenance. Similarly, at several points a myGrid component may want to say, here is a new value and this is its provenance. The current thinking (IntegrationFest5 september 2003) is that the mIR will provide the implementation for a provenance service. There will be a provenance storage interface, and a provenance query interface.

Assumption 2 - Where data is the result of a program or service then it must be possible to record provenance data from both the perspective of the invoker, and from the perspective of the program or service itself.

Draft Provenance Schema

A first draft at provenance types within the overall information model is in CVS module infomodel file myGridComplexTypes_Provenance.xsd.

ProvenanceDiagrams gives a top-level view of the key aspects.

ProvDatabaseDesign this shows a design of an example provenance database from [1] ( see http://www.pasoa.org/papers.html )

[1] Martin Szomszor and Luc Moreau. Recording and reasoning over data provenance in web and grid services. In International Conference on Ontologies, Databases and Applications of SEmantics (ODBASE'03), volume 2888 of Lecture Notes in Computer Science, pages 603-620, Catania, Sicily, Italy, November 2003.

Examples

Example 1

Suppose that I have a data item XYZ in the mIR and I want to know its provenance.

One answer would be to say that it is the output of a workflow invocation and here is the workflow provenance record that has XYZ as output named blah

provenance from workflow service client

Another answer is that:

  • the myGrid run_workflow wizard was invoked
  • on date some date
  • by user myGrid user
  • and used the enactor service at endpoint enactor id
  • with inputs
    • workflow description LSID of workflow description
    • input named X LSID of input named X
    • input named Y LSID of input named Y
    • ...
  • and produced outputs
    • output named R1 LSID of output named R1 , value XYZ
    • ...
  • in context some context description
    • (somewhere to put any why information that is available)

Note that it is important to identify the specific data that was used for an input, not just some data item whose value was ABC. There may be several data items in the mIR with the same value but not with the same metadata (they may have been produced by different activities). In addition, we want the provenance store to be able to answer queries such as, where has this data been used.

There needs to be a way of linking the provenance information from the perspective of the invoker (run workflow wizard above) and the provenance information from the perspective of the invoked service.

provenance from workflow service invoked

The workflow provenance record includes information about its invocation (the parameters that it was passed and the results it returned) and information about the services that it invoked.

For a detailed description of a workflow provenance record see GDProvenanceExample.

The elements in current workflow provenance XML are:

workflowID
the unique identifier for the workflow instance given by the workflow enactment engine
workflowStatus
COMPLETE or FAILED or RUNNING
startTime
the workflow instance start time
endTime
the workflow instance end time
usedID
the user identifier given to the workflow enactment engine - part of the inputs to the workflow enactor
xscuflDefinition
the workflow definition in XScufl - part of the inputs to the workflow enactor
workflowInput
the workflow input data given to the enactment engine - part of the inputs to the workflow enactor
(described in detail in GDProvenanceExample)
workflowOutput
the workflow output data (described in detail in GDProvenanceExample)
processorList
the set of service invocation provenance records - these describe the provenance of the invoked services from the perspective of the enacting workflow

Example 2

Provenance of user editing a value in the mIR (e.g. revising a workflow description)

TO BE DONE

Example 3

Provenance of user directly invoking a service

TO BE DONE

Example 4

Provenance of user annotating a value in the mIR

invoker perspective:

  • the myGrid annotate wizard was invoked
    • there may be several of these?
  • on date some date
  • by user myGrid user
    • (Is automatic annotation done on behalf of a user? How do we identify whether it is automatic or manual?)
  • with inputs
    • the LSID of the data item that is being annotated
  • and produced outputs
    • output LSIDs of annotations (and local mIR names)
  • in context some context description
    • (what might go here?)

Example 5

Provenance of user adding a value to the mIR

invoker perspective:

  • the myGrid upload wizard was invoked
  • on date some date
  • by user myGrid user
  • with inputs
    • the actual values???
  • and produced outputs
    • output LSIDs of uploaded values (and local mIR names?)
  • in context some context description

In this situation it is unlikely that there will be useful information available from the upload wizard itself. Is the version of the upload wizard important, does it use any other services itself?

Caveat

Of course it is possible that a particular myGrid setup might choose not to record all possible provenance information. However, it is important that any generic schemas for provenance data in myGrid are flexible enough to cope.

Notes (do not forget)

  • provenance in the event of failure
  • choice and iteration in workflow provenance
  • nested workflows
  • links to provenance questions

Related Pages

Related pages - ProvenanceIF5, ProvenanceData

-- MarkGreenwood - 01 Oct 2003

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r4 < r3 < r2 < r1 | More topic actions
 
Powered by myGrid wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding myGrid wiki? Send feedback