The Current Approach
Currently, provenance data is generated (only) by the workflow
enactment engine, with the details for each enactment held in a
separate file within the myGrid repository (mIR).
As seen in the picture in
ProvenanceData, an element within this file
identifies the enactment by identifying (amongst other things) the
relevant workflow definition, the time and date of execution, its
constituent steps and the inputs to and outputs from the overall
workflow and constituent service executions.
Input and output data are identified by some unique identifier (UID)
and/or the actual value. In practice, since the repository does not
yet support any form of UID, the provenance record always contains
a verbatim copy of the data.
Problems and Opportunities
- As Mark Greenwood points out, the Gateway will support direct invocation of individual services as well as workflows, so we also need to record provenance data for them too. We could exploit this provenance to abstract a future workflow from past interaction.
- However, we could only do this if the output of one invocation can be correlated with the input to a later execution. Since these results would likely be held in the mIR, this requires that the mIR support some form of UID. The WP3 team are currently investigating the I3C's LSID for this purpose.
- This immediately raises issues of mutability of resources. We know that a query on some service for a given protein may give different answers at different times, as the curators of the service add new information, but if we store the returned data and mark it as the output of a certain query invocation, then that provenance is nugatory if the file can be overwritten or edited. This suggests that repository files should be immutable. However, they may still be delible: this could lead to "orphaned" provenance links. These still have some utility, but the deletion code should probably check for such references. To avoid the user having to invent lots of fresh names, the mIR could support and generate revision numbers (LSID supports these).
- Storing provenance data as files could make searching (for implied workflows; prior to deleting an mIR file) difficult, although the mIR's XML capability could help here.
- It is a long term objective of myGrid that we should support industry standards where appropriate. In the case of workflow, the aim is to support the use of commmercial off-the-shelf (COTS) implementations of the workflow enactment engine. Since provenance generation is unlikely to be supported by COTS workflow engines, we need ventually to separate provenance generation from workflow enactment.
Possible Approaches
Representation and Storage
The WP4 (metadata) team have proposed that we exploit
RDF and RDF Schema to represent metadata
within myGrid. Since provenance is a form of metadata, I suggest
that we investigate using RDF, rather than instances of a specific
XML Schema or DTD to represent provenance data (approach outlined
below. This would have the following benefits:
- The provenance data is no longer split into distinct per-invocation "islands", so it should be easier to extract proto-workflows or determine input_to/output_from references when manipulating mIR files.
- If the provenance data is held in the same repository as other myGrid metadata, it may be easier to use provenance and other kinds of metadata in a single query.
One possible problem is the visibility of the metadata. The mIR will
presumably support different kinds of visibility (owner, team,
organization, public) on contained items. Can we meaninfully extend
this kind of control to metadata, and if not, how much does it
matter?
Information Model
Chris Greenhalgh has suggested the beginnings of a myGrid invocation
model, with a hierarchy of types of entities for which we can
discover metadata (and do type-specific things). This hierarchy is
rooted in the type
Thing. I suggest we rename this root type as
Resource, which at once marks us out as Serious People and also
matches the RDF term for a described entity.
In basic RDF, Resources appear to be untyped. However,
RDF Schema allows us to define
vocabularies that constrain the domain and range of certain
properties to specific resource classes, and supports a subclass
relationship between resource types.
Thus Chris's Thing hierarchy can be mapped into an RDF Schema
resource class hierarchy.
This suggests a basic approach to converting the exisiting provenance
DTD to RDF Schema:
- Each DTD attribute becomes an RDF Property
- Each DTD element type becomes an RDF Resource class
In particular,
WorkflowInvocation and
ServiceInvocation become
resource types, and to support arbitrary workflow composition, we
might introduce a common Invocation superclass.
Invocation instances are all provenance and no content; they could be
represented by anonymous resources, but do we want to identify
certain "interesting" executions, or just discover them by querying
on workflow, service and/or data item? If the former, we may find it
useful to view the existing mIR and the metadata respository as two
parts of a single service at some level.
Generating Provenance
As Mark noted, we want to generate provenance for both workflow and
service invocations. The e-Science Gateway intermediates between its
clients (scientists or applications) and the invoked services, so it
becomes a likely candidate for provenance generation.
This works for primitive services, but used only in this way would
make workflow executions opaque. We still want to record the
internal workings of workflows as at present. This suggests that the
Gateway act as an intermediary between a workflow and its invoked
services, too. This could involve transforming WSFL to achieve this.
In WSFL and BPEL4WS, intermediate results can be held in what are
effectively "local variables". We would like to capture these
transient results in provenance too, along with literal values
supplied directly from the user interface or by applications. The
former might involve some tricky WSFL manipulation and/or effectively
replicating the XPath expressions that assemble inputs from
intermediate outputs.
Implementation
The WP4 team have recommended adopting
Jena as the interface
to metadata as RDF. Jena supports persisting RDF in both Berkeley DB
and relational repositories. This suggests that (as above) the mIR
and metadata repositories could share the same storage system.
--
NickSharman - 26 Nov 2002