Chris Greenhalgh's Speculations on Provenace and Metadata
Following the AG meeting today (29 Nov 2002)...
Current state (29 Nov 2002)
We currently have two (tending to three) cases of 'activity-oriented provenance' recording:
- Workflow Enactment Engine's XML document, placed in the personal repository, identifying the services invoked (by e.g. URL) and the input and output data.
- Talisman's log XML document, recording events and actions executed by Talisman pages.
- In principle the gateway should do something like the WFE for direct invocations of individual services.
Note that in each case no additional provenance information is obtained from the end services used (e.g. blast, soaplab/emboss); therefore no additional end service state information can be recorded (e.g. database version).
We then have a few cases of 'data-oriented provenance':
- The personal repository maintains associations (INPUT, OUTPUT) between domain etities and WFE instances with which they were used (limited to all being in the same repository).
- The personal repository (version 2) also has some other meta-data for domain entities which may serve for provenance, e.g. owner, date created, plus limited other meta-data, e.g. concept type.
These facilities are not currently user-extensible.
--
ChrisGreenhalgh - 29 Nov 2002
Next steps?
Possibilities:
- Use a service metadata interface (e.g. GridService?) to expose aspects of 'objective' (client- and invocation-independent) internal state, such as database version. Extend e.g. WFE, Talisman, Gateway to query and record this information in the corresponding activity logs.
- Add additional getProvenanceInfo operation(s) to 'extended invocation' port types (including the WFE interface and the SoapLab?/OpenBSA interface), that return invocation-specific provenance information. For the WFE this might be the current provenance XML file. For SoapLab? this might include default argument values used, database and executable versions, etc. As above, have WFE, Talisman and Gateway query and record this information.
- Extend existing provenance log formats to have space for user anotations (e.g. 'why'). Expose at interfaces for (optional) completion.
- Do some work on pretty output of existing and planned logs.
- Think more generally about how this activity record-style provenance can or should be made available downstream to the same or other users of the resulting data. Is the provence just the (transitive closure of all) activity logs that contribute to that data? I doubt it...
--
ChrisGreenhalgh - 29 Nov 2002
(older) A few thoughts/speculations on provenance
I have attached the metadata inventory to the
ProvenanceData page; this shows currently maintained information in the Personal Repository v2 and the Workflow enactment log as of IF-2. It is in more-or-less triple form, i.e. like RDF.
At the moment I don't make a strong distinction between provenance and general metadata. I assume there are things (o.k., call them 'resources'), about which there is verious descriptive metadata, including various associations or relatonships between things. Provenance information includes both, e.g. author, date of a single resource, and derivation relationships, e.g. through workflow execution or service invocation. This forms a large (in general effectively infinite graph).
I regard workflow definitions as stereotypes for subgraphs.
I wonder if the core of the workflow enactment engine could/should be used more flexibly to manage this graph of relationships (actual and potential) more dynamically.
I presume that the user will wish to browse, search and reason over this graph.
I presume that this graph will be distributed, and may not all be accessible.
I note that garbage collection (and determination) can be problematic.
I regard the 'lab book' presentation as a user-directed linearisation of a portion of this graph, with some imposed narative, for the purposes of documentation, presentation, accountability management, and so on.
We expect that metadata will be distributed - first, second and third party. One place where conpositing can occur is the gateway.
Trust, authentication, non-repudiation are hard
--
ChrisGreenhalgh - 27 Nov 2002