Questions
1 Is there going to be provenance information about
all data in the myGrid repository (personal repository)? For example, this piece of data was URI
xxx downloaded by user
yyy on date
zzz
2 Are we going to have non-repudiation? The position statement from WP2
WP2 provenance position takes this as essential. In the thoughts below I have been thinking of provenance as any metadata that indicates how results have been produced.
3 What uses to we envisage for the provenance data?
Thoughts from a workflow viewpoint
This is biased view of provenance within
Mygrid. It is based on looking at the provenance records generated by the workflow enactment engine for the pre-prototype (and 0.1). (For an example see the attachments below.)
Identifiers
The most significant issue I think is how we deal with the issue of
identifiers. Provenance fundamentally depends on the ability to attach an identity to an object. (Indeed this applies generally to annotating data with metadata.)
At the moment the provenance records generated by the workflow enactment engine have identity (ID) fields, but they are always given dummy values. This is because we don't currently have a way of deciding appropriate identifiers.
workflow input identification
Take an example data input, the swissprot ID "P04637". This could be just a swissprot id - that is it could have an identifier
org.swissprot.protien#P04637 which has a value "P04637". On the other hand it could be the result of some previous experiment - it could have an identifier
mybestexperiment.run1.result7 which has value "P04637" and has an annotation to say that it is a swissprot ID.
workflow result identification
The workflow result data will also need an identifier (rather than the
current default -1). Is this just an identifier for the personal repository
which holds the result data?
There must be an annotation for this
identifier identifying the workflow provenance record. If there is an
intermediate result, then it should be possible to a corresponding
identifier and provenance record (up to the provenance result). Perhaps this should also have the additional annotation to indicate the workflow of which it is an intermediate part.
service identification
The WSFL workflow is composed of activities which are mapped onto specific operations provided by a web service. The mapping is between a WSFL service provider, which can provide several activities, and a web service. This mapping is either done statically, the WSFL service provider is mapped to a specific service WSDL, or dynamically, the WSFL service provider indicates the UDDI(-M?) request and the selection policy. For the static case the web service operation is identified by the WSDL file, the service port and operation name (and the operation input message name if the port has
overloaded operations).
This may need some revision depending on how myGrid
chooses to identify services. It is possible for a service to have multiple WSDL files describing it. In addition, it is possible to mirror services. I expect that we should look at how the I3C deals with mirrors of a database such as swissprot.
workflow description identification
The workflow description itself is data. It should have an identity so that it can be annotated with metadata: who wrote it, when, is it based on an earlier workflow description, and so on. It could be very useful to have a provenance record for a workflow description: who editied it, when, etc.
Service Invocation provenance
The 0.1 portal provided the ability for a user to directly invoke a
service. This could generate a service provenance record to link the input and result. If a user does several such direct invocations then we want to be able to traverse the graph of service provenance records and create a corresponding workflow.
Concrete Provenance Examples
still to do
- look at TalismanRAD provenance
- look at LSID from I3C
- identification of Grid services
--
MarkGreenwood - 28 Nov 2002
Attachments