A brief overview regarding provenance from the EBI's perspective - We take the viewpoint of why a scientist wants to store & use provenance with some real use-cases to provide a framework for further elaboration.
Some example situations:
Scenario: An "aide memoir" for the individual scientist.
Description: A scientist has performed an 'in silico' experiment. Sufficient provenance information needs to be captured that s/he or a colleague is able to return to this at a later date & understand the process.
Notes: As well as capturing automatically the "what" of provenance, allow a scientist to add free text annotations to the steps in their workflow to capture the "why", e.g. "I chose this restriction enzyme as it cut only three times within 200 base pairs of the SNP". A general issue is that applications & databases may have provenance which needs to be captured, but may not be provided explicitly, e.g. "Which version of SWISS-PROT did BLASTX use?", "What were the default parameters used by HMMER?". How should a myGrid compatible-service supply this provenance information?
Requirements: Capture user-provided input for applications, capture application provenance, capture free-text annotation associated with the process.
End result: Producing a human-readable & printable report of the provenance log for sticking in a paper lab-book would be satisfactory.
Scenario: Evidence for results in public databases.
Description: Databases contain entries with both human-curated & automatic annotation, e.g. the feature table of a SWISS-PROT entry may report a POU domain in a peptide. Some users will want to see what evidence exists for a particular result - i.e. how it was produced.
Notes: This scenario has virtually identical requirements to the one above, but is a different context & requires a proper handling of persistence. Sometimes scientists need to understand exactly how a particular result was derived, e.g. "What is the evidence that there is a POU domain in this protein?". A provenance log should provide a detailed breakdown of how a particular result was arrived at. This provenance will have been captured during the annotation process.
Requirements: [See above.]
End result: Electronic archive of provenance.
Scenario: Collaborative provenance.
Description: A collaborative provenance environment that allows scientists to add their own provenance to an existing process.
Notes: Following on from the last example situation - During a discussion about provenance & DAS, the concept of "third-party provenance" came up, e.g. the ability for processes & provenance logs themselves to be annotated with further provenance . The original records are retained & immutable, but other lab members may add their provenance, experiences & observations on this process to the record. A real-world example of this may be around a common laboratory protocol.
Requirements: Handling & presenting provenance from a number of sources.
End result: A "multi-layer collaborative lab-book"?
Scenario: A management overview & report generation tool.
Description: As a team of curators work on entries, details of their activities may be captured & used to auto-generate a report summarising progress.
Notes: There is active interest from the INTERPRO group (for whom Talisman was written originally), for the ability to capture information about which entries have been touched in the last week, by which annotators & what they did. (Side issue of employees & data protection act here.)
Requirements: Capture provenance about user, application, entry, etc. Side issue of access control & security in provenance repository.
End result: A document that summarises who did what.
Scenario: Dealing with change.
Description: Scientists are notified when a result should be re-evaluated because a tool was updated, a database was updated, the starting data changed, etc. compared to the provenance recorded in a log book for a result.
Notes: Services need to provide this information & the provenance log needs to store it in a form that allows these comparisons to be performed, e.g. a BLASTP service should provide information on which SWISS-PROT database it used. A user may wish to indicate which provenance information is important for comparisons. For example, if a new version of SWISS-PROT has been released, it may be prudent to re-run BLASTP - but if only the location of the BLASTP service has changed, it's not necessary to re-run the BLASTP's.
Requirements: Services provide provenance information necessary to enable change management. The user should be able to indicate which provenance information is important as regards change.
End result: Compare recorded provenance information with notification messages to evaluate if a process should be re-evaluated.
Other issues:
- Who/what is responsible for gathering provenance & how?
- Provenance information - Centralised or distributed?
- Different, non-myGrid services may have their own models & concept of how to supply provenance information - Merging provenance.
--
AlanRobinson - 28 Nov 2002