r1 - 26 Nov 2002 - 13:35:59 - NickSharmanYou are here: myGrid wiki >  Mygrid Web  > WorkInProgress > ProvenanceData > NickSharmanProvenanceViewpoint

The Current Approach

Currently, provenance data is generated (only) by the workflow enactment engine, with the details for each enactment held in a separate file within the myGrid repository (mIR).

As seen in the picture in ProvenanceData, an element within this file identifies the enactment by identifying (amongst other things) the relevant workflow definition, the time and date of execution, its constituent steps and the inputs to and outputs from the overall workflow and constituent service executions.

Input and output data are identified by some unique identifier (UID) and/or the actual value. In practice, since the repository does not yet support any form of UID, the provenance record always contains a verbatim copy of the data.

Problems and Opportunities

  • As Mark Greenwood points out, the Gateway will support direct invocation of individual services as well as workflows, so we also need to record provenance data for them too. We could exploit this provenance to abstract a future workflow from past interaction.

  • However, we could only do this if the output of one invocation can be correlated with the input to a later execution. Since these results would likely be held in the mIR, this requires that the mIR support some form of UID. The WP3 team are currently investigating the I3C's LSID for this purpose.

  • This immediately raises issues of mutability of resources. We know that a query on some service for a given protein may give different answers at different times, as the curators of the service add new information, but if we store the returned data and mark it as the output of a certain query invocation, then that provenance is nugatory if the file can be overwritten or edited. This suggests that repository files should be immutable. However, they may still be delible: this could lead to "orphaned" provenance links. These still have some utility, but the deletion code should probably check for such references. To avoid the user having to invent lots of fresh names, the mIR could support and generate revision numbers (LSID supports these).

  • Storing provenance data as files could make searching (for implied workflows; prior to deleting an mIR file) difficult, although the mIR's XML capability could help here.

  • It is a long term objective of myGrid that we should support industry standards where appropriate. In the case of workflow, the aim is to support the use of commmercial off-the-shelf (COTS) implementations of the workflow enactment engine. Since provenance generation is unlikely to be supported by COTS workflow engines, we need ventually to separate provenance generation from workflow enactment.

Possible Approaches

Representation and Storage

The WP4 (metadata) team have proposed that we exploit RDF and RDF Schema to represent metadata within myGrid. Since provenance is a form of metadata, I suggest that we investigate using RDF, rather than instances of a specific XML Schema or DTD to represent provenance data (approach outlined below. This would have the following benefits:

  • The provenance data is no longer split into distinct per-invocation "islands", so it should be easier to extract proto-workflows or determine input_to/output_from references when manipulating mIR files.

  • If the provenance data is held in the same repository as other myGrid metadata, it may be easier to use provenance and other kinds of metadata in a single query.

One possible problem is the visibility of the metadata. The mIR will presumably support different kinds of visibility (owner, team, organization, public) on contained items. Can we meaninfully extend this kind of control to metadata, and if not, how much does it matter?

Information Model

Chris Greenhalgh has suggested the beginnings of a myGrid invocation model, with a hierarchy of types of entities for which we can discover metadata (and do type-specific things). This hierarchy is rooted in the type Thing. I suggest we rename this root type as Resource, which at once marks us out as Serious People and also matches the RDF term for a described entity.

In basic RDF, Resources appear to be untyped. However, RDF Schema allows us to define vocabularies that constrain the domain and range of certain properties to specific resource classes, and supports a subclass relationship between resource types.

Thus Chris's Thing hierarchy can be mapped into an RDF Schema resource class hierarchy.

This suggests a basic approach to converting the exisiting provenance DTD to RDF Schema:

  • Each DTD attribute becomes an RDF Property
  • Each DTD element type becomes an RDF Resource class

In particular, WorkflowInvocation and ServiceInvocation become resource types, and to support arbitrary workflow composition, we might introduce a common Invocation superclass.

Invocation instances are all provenance and no content; they could be represented by anonymous resources, but do we want to identify certain "interesting" executions, or just discover them by querying on workflow, service and/or data item? If the former, we may find it useful to view the existing mIR and the metadata respository as two parts of a single service at some level.

Generating Provenance

As Mark noted, we want to generate provenance for both workflow and service invocations. The e-Science Gateway intermediates between its clients (scientists or applications) and the invoked services, so it becomes a likely candidate for provenance generation.

This works for primitive services, but used only in this way would make workflow executions opaque. We still want to record the internal workings of workflows as at present. This suggests that the Gateway act as an intermediary between a workflow and its invoked services, too. This could involve transforming WSFL to achieve this.

In WSFL and BPEL4WS, intermediate results can be held in what are effectively "local variables". We would like to capture these transient results in provenance too, along with literal values supplied directly from the user interface or by applications. The former might involve some tricky WSFL manipulation and/or effectively replicating the XPath expressions that assemble inputs from intermediate outputs.

Implementation

The WP4 team have recommended adopting Jena as the interface to metadata as RDF. Jena supports persisting RDF in both Berkeley DB and relational repositories. This suggests that (as above) the mIR and metadata repositories could share the same storage system.

-- NickSharman - 26 Nov 2002

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r1 | More topic actions
 
Powered by myGrid wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding myGrid wiki? Send feedback