r2 - 24 Jan 2003 - 17:44:06 - AlanRobinsonYou are here: myGrid wiki >  Mygrid Web  > WorkInProgress > MetaData > AlansBioinformaticsMetadataThoughts

Data , Metadata & Provenance in a Bioinformatics Setting

As a means of thinking through some issues surrounding data, metadata & provenance in a bioinformatics setting, I've been walking through a very simple scenario in which a user:

1) Creates & stores some data.

2) Chooses an application & service instance to use.

3) Stores the results.

The particular scenario I will choose is the perennial "I have a sequence & I want to find possible homologues", aka "Running a BLAST". For the moment, I'm skirting around the issue of workflows. I'm also ignoring the issue of adding annotation to data.

Storing the data:

From somewhere I have found a DNA sequence, e.g. it was published in a paper. An alternative is that the sequencing lab has e-mailed it to me in FASTA format as "mySeq.fasta", but then one has to deal with the provenance of the sequencing facility & trace files.

> [This is a free-text comment line.]
atcgatcgtacgtagctagctagctagctgatcgatgctacgtagctgactg
atgctagctgactactactgcatactgactagctactagctactgcagatca
cagttgacgacttgactgcatgcatgcatgcatcagtgcatgaccatgactg
atgctgca

I want to store this sequence in a manner that means someone could access it later & understand its nature & history.

All of the following lists are non-exhaustive!

Ownership metadata (use v-card?)

Owner: Alan J. Robinson. [Automatic]
Dept.: Services [Automatic]
Contact phone: 01223 494444 [Automatic]
E-mail: alan @ ebi.acuk [Automatic]

Submission metadata:

Date submitted: Thursday, January 23rd 2003 [Automatic]
Time submitted: 11:50am GMT [Automatic]
Submission host: medusa.ebi.ac.uk [Automatic]
Submission tool: myGrid portal [Automatic]

N.B. Have to consider when this entry is modified - How is this metadata stored?

General metadata:

ID (unique): DM12345 [User entered/Automatically assigned: LSID?]
Name: "my sequence" [User entered - Not necessarily unique]
Description: "A Dm sequence for Sox70D" [User entered]
Syntactic type: FASTA format [Semi-automatic?]
Semantic type: DNA sequence [Semi-automatic? Use controlled vocabulary?]
MIME type: text/ASCII [Automatic?]
Security: Group-readable & User-writable [User entered]
Source: Science 295:234-245 (2001) [Free-text? PMID? Dublin Core RDF?]

The "Source" tag causes me concern: I see it as fulfilling the requirement of establishing how this entity came into being. Ideally, this would be a cross-reference to another entity or process. If I had got the sequence from my local sequencing facility, rather than from a journal, then would the source be the v-card for the "Cambridge University, Dept. of Genetics, Sequencing Facility"? Would it be the barcode number for their LIMS system? Would it be the clone's ID? Is the "source" the reference to the logical thing that came previously in an audit trail & for which I might find provenance? Should "source" be part of the data?

Then there's the metadata that I want to attach which is domain specific & adds the detail necessary for interpretation in this particular case. Some of this metadata may be considered data, e.g. it could be encoded in a EMBL formatted version of the data. Can it depend upon the smarts in the mIR in being able to query in the data, as well as its metadata? How extensible should metadata be? Should the myGrid metadata have a bag in which arbitrary metadata may be stuffed?

Domain specific metadata:

Organism: Drosophila melanogaster. [NCBI Taxonomy ID?]
Strain: Oregon-R. [Controlled vocabulary]
Sex: Unknown. [Controlled vocabulary]
Development stage: Syncytial blastoderm. [Controlled vocabulary]
Chromosome: 3 [Controlled vocabulary]
Position: 3-40.7 [User entered / Controlled]

Choosing an application & service instance to use

My discovery & choice of suitable tools for a job would have been done at an earlier stage & would involve reading literature & consulting with colleagues - A service that can suggest homology search tools to consider is useful here. However, in a workflow situation with the data in hand, I suspect I would be looking for specific application(s), not types of application, i.e. I'd be looking for instances of BLASTX using SWISS-PROT, not a "homology search against a non-redundant protein sequence collection".

N.B. BLAST is a suite of programs - For an ASP & service registry, it's probably easier to consider each application separately: BLASTP, BLASTN, BLASTX & TBLASTN. The choice of the BLAST program is dependent upon the type of sequence in the query & the database. I have a DNA sequence & want to search against SWISS-PROT, thus I need to find a service instance of BLASTX using the SWISS-PROT collection.

I'm going to try & separate the metadata about an application into that which describes any instance of a service, & that which describes a specific instance of a service (c.f. class & instance variables in Java).

Service metadata:

Name: BLASTP
Version: WU-BLASTP v2
Classification: Homology search tool.
Description: ". . ."
Official URL: htp://www.washu.edu/BLAST/
Author/Contact: Bill Pearson (v-card?)
Owner: Washington University.

Service instance metadata:

Host: EMBL-EBI
Contact: support @ ebi.ac.uk (v-card?)
Information: htp://www.ebi.ac.uk/WU-BLAST/
Cost: Free.
Endpoint: htp://industry.ebi.ac.uk/soaplab/…./wublastp
WSDL: htp://industry.ebi.ac.uk/soaplab/…./wublastp.wsdl
API: OMG BSA.
Input data types & options: [Syntactic & semantic] [Captured in OMG BSA DTD?]
Output data types & options: [Syntactic & semantic] [Captured in OMG BSA DTD?]
Estimated turnaround time: 10 minutes.

I think I see the "Input/Output data types & options" as a place where I store semantically interesting things, e.g. it would record that an input type is a nucleotide sequence, but would ignore parmeters dealing with specifying the frame. I would use the "Input data types & options" to discover that this service instance uses a recent version of SWISS-PROT. Another application may use them to identify services that has an input with a semantic type of "collections of nucleotide sequences" or output of "multiple sequence alignment". I realise I need to think about this some more - especially if it's redundant with the WSDL.

When I use a service instance, either directly or as part of a workflow I would like all of the above information to be captured & stored. As well as that information, I would be interested to capture the following run-time information: (i) Date & time submitted. (ii) Date & time finished. I presume any exceptions are handled appropriately.

Which of the metadata described above is supplied by the service instance, & which by the service registry? Supposing I can get metadata from the service instance itself, and it conflicts with that given to me by the service registry?

All of the above is metadata that pertains to a single service instance. In myGrid, we use workflows - Thus we need to consider: (i) What is stored in the mIR to represent the workflow; (ii) What metadata is stored about the overall workflow, excluding the specifics of the service instances; (iii) How is the metadata about service instances associated with the data about the workflow, the metadata about the workflow & the output result of the service instance? (iv) If I want to add annotation about why I chose particular service instances or parameters, how does this relate to the metadata about the workflow & the service instances?

Storing the result

The storing of my result will be very similar to the first stage of storing my data:

Ownership information (use v-card?)

Owner: Alan J. Robinson. [Automatic]
Dept: Services [Automatic]
Contact phone: 01223 494444 [Automatic]
E-mail: alan @ ebi.ac.uk [Automatic]

Submission details:

Date submitted: Thursday, January 23rd 2003
Time submitted: 14:05pm GMT
Submission host: mygrid.ebi.ac.uk
Submission tool: The Gateway? The workflow enactor?

Data details:

ID (unique): BP12345XYZ
Name: WU-BLASTP
Description: WU-BLASTP report.
Syntax: BLAST result format
Semantics: BLAST report
MIME type: text/ASCII
Security: Group-readable / User-writable.
Source: [Reference to the (service instance in the) workflow.]

If the result was produced by a workflow, how does the data & metadata about the result associate with the metadata about the service instance that was used?

Conclusions:

  • Whether something is metadata or data depends upon your context.
  • Some bioinformatics formats merge data & metadata.
  • When we capture metadata for provenance - Does it include domain specific properties? e.g. the species from which this DNA sequence originated. Or should this be a part of the data itself?
  • Is the metadata that myGrid requires merely domain-independent stuff like ownership information, submission dates & details, data formats, etc.?
  • I suspect that this metadata may be captured in an XML format (including RDF).

I think there are relationships between:

  • The input data & its metadata.
  • The workflow & its metadata.
  • The workflow & the input data.
  • The workflow & the metadata about service instances used in the workflow.
  • The workflow & the result
  • The input data & the metadata about the service instance that used it.
  • The result & the metadata about the service instance that created it.

[How is this captured by the Gateway & mIR currently?]

-- AlanRobinson - 23 Jan 2003

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r2 < r1 | More topic actions
 
Powered by myGrid wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding myGrid wiki? Send feedback