Simple Sequence Typing
Try to get my head around tying stuff together in the workbench and with workflows and services...
Simple example: getting a sequence record based on its accession number using EMBOSS seqret.
In the Ontology
Accession number examples
- mygrid:EMBL_nucleotide_sequence_id
- has super mygrid:nucleotide_sequence_unique_identifier
- has super mygrid:biological_concept_unique_id
- has super mygrid:unique_identifier
- has super mygrid:identifier
- has super mygrid:physical_structure
- mygrid:SWISS-PROT_accession_number
- has super mygrid:record_id (but that's all in this context)
- has super mygrid:unique_identifier
Sequence record examples
- mygrid:EMBL_record
- has super mygrid:bioinformatics_record
- has super mygrid:record
- has super mygrid:data_structure
- has super mygrid:informatics_data_or_data_structure
- has super mygrid:informatics_physical_structure
- and super mygrid:bioinformatics_data_structure
- has super mygrid:bioinformatics_data_or_data_structure
- has super mygrid:informatics_data_or_data_structure
- and super mygrid:data_structure
- mygrid:SWISS-PROT_record
- has super mygrid:record (but not mygrid:bioinformatics_record)
General Ontology/concept notes
- The ontology does not consider arity: a single ID is labeled 'mygrid:EMBL_nucleotide_sequence_id', and so is a list or array of IDs of that kind.
- The ontology does not consider format for sequence IDs: all of the following would be 'mygrid:EMBL_nucleotide_sequence_id': an identifier qualified with a DB name as in an EMBOSS USA (e.g. 'embl:AF072242'), unqualified with a DB name as used by some of the bespoke web services we have created (e.g. 'AF072242') or in an LSID format (not sure if it is defined yet for EMBL, but would be something like 'URN:LSID:embldomainname:EMBL:AF072242:1').
- I also worry that USAs like 'embl:AF072242' look like global IDs, but are strictly specific to a particular EMBOSS installation and the names used for the DBs that are installed with it.
- I also worry that USAs are jolly flexible things, and trying to tie down their type is rather hard, consider 'embl-acc:AF072242', 'embl-acc:AF072242[10:50:r]' and 'sw:opsd_*' as three (hopefully) valid USAs!
- Does the ontology consider record format? Both EMBL and SWISSPROT have 'well-known' ASCII flat-file formats; are these implied here? Not necessarily.
- If they are not implied, then how do you distinguish format?
- How do you deal with lossy formats (e.g. no features in a FASTA file)? Is it still a mygrid:EMBL_nucleotide_sequence_id, for example.
With MIME types
General notes
MIME types are jolly useful for some things we might work with:
- image/gif, image/jpeg, image/png, image/tiff, etc.
- audio/basic, etc.
- text/html
- model/vrml
But things get a bit more ambiguous with, for example:
- text/plain
- text/xml
- application/xml
- application/octet-stream
With web interfaces organisations typically create their own custom MIME type (as an experimental x-* or x.*, personal pers.* or vendor vnd.* type) for each helper application to be triggered. This may correspond to all uses of a single format, or only to a single specific action (e.g. execute) on a file of a particular format.
In general MIME types have no explicit substitutability. There are two limited exceptions to this:
- Unknown text/* types can be treated as text/plain [RFC2046]
- Unknown */*+xml types can be assumed to be XML (e.g. text/xml) [RFC3023]
Accession number examples
All of them look like text/plain!
Sequence record examples
Most of them look like they would be text/plain. XML ones would be text/xml.
To support e.g. browser plugins some other MIME-types are used by some vendors and products, e.g. 'chemical/x-pdb' (PDB for
RasMol?), 'text/x-bsml' (BSML).
General MIME type notes
- The above treatments would not distinguish arity, i.e. one id/sequence in a file versus multiple.
- This might be done with distinct singleton and multiple MIME types?!
- The above treatments - using relatively standard MIME types - would not distinguish the sequence ID format variants discussed.
- This would require distinct MIME types for each variant.
- This would be unlikely to address the possible local variation in EMBOSS USA DB names.
- Distinct MIME types would be required for each flat file sequence format (e.g. EMBL, SWISSPROT, etc.) to allow them to be distinguished by this mechanism.
- EMBOSS does exhaustive trial and error, attempting to load as different formats until one works.
- Any proliferation of MIME types raises the prospect of substitutability and subsumption. With MIME types the only support is the treatment text/* as text/plain. This seems unlikely to be either precise or general enough.
SOAP encoding types
The various web services expose message parts with particular XML schema-defined types. In the current myGrid services these are almost all auto-derived by AXIS from the Java RPC types.
In practice, the vast majority of arguments and result types are drawn from the Java types: java.lang.String (xsd:string), java.util.Vector, java.lang.String[], java.util.Hashtable, int, boolean or byte[].
In addition, the workflow enactor requires its inputs to be in an appropriate SOAP XML encoding when they are passed to it. They are passed in an explicit XML document (fragment?) generated by the client (and passed as a java.lang.String), rather than by AXIS (or whatever) performing marshalling.
Accession number examples
A single accession number looks like a String.
A list of accession numbers might look like:
- a String (e.g. with comma-separated values),
- an array of Strings,
- a vector of Strings,
- a String, itself containing the SOAP encoding of one of the above (in the case of the enactor).
Sequence record examples
Most look like a String (assuming UFT-8 encoding).
Some might look like a byte[], either of binary data, or of a particular encoding of a document.
General SOAP encoding notes
- In many cases, the SOAP type is no more specific the a generic MIME type such as text/plain.
- In some cases - especially explicit arrays of identifiers (or records) the SOAP encoding is unique in having to account for arity, in particular distinguishing a singleton from a list (even a list of one element).
- Note: the pre-prototype used an ugly fudge to overload concept type with one one of three specific format options: raw (just a string), string (in a single XML element) and string[] (an XML array of elements, each containing a string). This is not present in the current version and is too ugly and limited to contemplate.
EMBOSS ACD type system
It has been suggested that we should use the ACD type system. However, it is not clear to me what this means in detail across the system as a whole...
Accession number examples
* These would be of type 'sequence'. Values are USAs.
* Values which were not USAs (e.g. 'AF072242') would have be 'string'.
* EMBOSS may well extend the USA spec. to include LSIDs as valid USAs.
Sequence record examples
These would typically of ACD types:
- sequence (and seqall, seqset)
- seqout (and seqoutall, seqoutset)
- features
- featout
ACD typically recognises a number of sequence record formats, including:
- embl, fasta, genbank, gff, pfam, swissprot, ...
For feature tables it recognises:
- embl, gff, swissprot and pir.
Features are identified by UFOs (Uniform Feature Objects), analogous to USAs.
General ACD notes
- As already noted, the sequence type (USAs) also supports direct data, various kinds of searches over databases, and the option to read from fles of such USAs. These are only meaningful to EMBOSS applications. Other applications would not understand them.
- As already noted, different EMBOSS installations may have different databases, and might even have the same database availably under different DB names.
- EMBOSS applications often output sequences without features by default; risk of losing important stuff!
Other options
- Should there be other parts of the ontology (or some other taxonomic framework) articulating aspects of format, such as arity, or name qualification?
- Should terms from the ontology (or whatever) be used in preference to specialisation of MIME types?
- What should the standard properties be for a DataThing??
- So far, we only have the concept type, and a place holder for MIME type (new in MIR3 and not yet used in any system or demo).
Other issues
- If there are more standard properties to describe types, how are these factored into the service/workflow matching process?
- Is there any automatic coercion, e.g. between formats? If so, how, where and when?
--
ChrisGreenhalgh - 24 Apr 2003