r2 - 24 Apr 2003 - 07:41:41 - ChrisGreenhalghYou are here: myGrid wiki >  Mygrid Web  > UserGroup > GravesDisease > GravesStoryBoardDetails

GravesDisease storyboard/scenario Details

Exactly what data, types, concepts, links, etc...

This is turning into a mess; I'll try to boil down some key issues with a simpler scenario/discussion: SimpleSequenceTyping

Affymetrix probe IDs as input to stage 1

Input to stage one is an ASCII file containing tens of Affymetrix probe IDs, presumably one per line. The chip ID is (for example) 'HG-U95Av2', the Affymetrix ID is (for example) '33905_at' (AlansRandomWalk, AlansTargetedQueries).

  • MIME type could be 'text/plain', but this would be no help with starting any particular tools. Perhaps it should be (something like) 'text/x-sequence-ids-without-dbname' (but see later discussion about output record IDs and formats/coercions, etc. #Coercion)
  • Concept type not yet defined; presumably something like 'mygrid:Affymetrix_probe_id', a (new) sub-class of 'mygrid:unique_identifier' or 'mygrid:biological_sequence_identifier' (where 'mygrid:' is currently 'http://www.aboutmygrid.org/ontology#').
  • Not XML so no Schema or DTD per se.
  • Carried over SOAP as 'xsd:string' (or something like that).

Issues:

  • How is the concept type (and MIME type for anything other than 'text/plain') allocated? Default is in an import wizard, and the user selects from a list of choices.
    • Where do these choices come from?!
    • How can it be extended? How can it be (semi-)automated?
  • Where does the Chip ID go, versus the probe ID?
    • Does it look like an EMBOSS dbname?
    • Or a file header?
    • Or part of the concept?
    • Or is the concatenated pair actually the full Probe ID?
    • If it looks like an EMBOSS dbname, how do you stop the system trying to feed it to EMBOSS application (which won't know about it)?
  • What do you see when you view it? Default is the plain text file. With a different MIME type (and/or also keying off the concept type (assuming we could do this)) a different standard view such as a table might be provided.
  • Could you select/split out a single ID from the file? Implies specialised document editor/structure support (requiring the previous point as a minimum) to expose the individual IDs as nodes.
  • Could you explode the file into one file per ID? Requires an additional custom action, either global or contextually enabled.
  • Can you do anything at all with the ID other than view the raw text of the ID or search for a matching WorkflowDefinition? based on its conceptual type? Not at the moment!
  • Typically the probe ID will not be globally unique (like an LSID would be); so the concept type (e.g. 'mygrid:Affymetrix_probe_id') is used to infer its namespace and for matching to workflow inputs. However, 'mygrid:Affymetrix_probe_id' (or whatever) is a subclass of 'mygrid:unique_identifier', but when treated as such has lost the unique identification of its namespace, i.e. Affymetrix probes. So be careful...

Set(s) of other record IDs as output of stage 1

Output from stage one is, for each input ID, "the AffyId?, the appropriate EMBL accession number for each AffyId?, The Swissprot ID if available for each AffyId? , A list of Medline Ids, and the GO term if available for the protein in the cross referenced SwissProt? ID." (GravesStoryBoard). In fact there may be multiple GO terms per input ID.

Options:

  1. Each kind of ID is a separate output part from the workflow, either a single ID as a string (e.g. Swissprot ID) or a list of IDs either as an XML-encoded array or as a string (=plain text file) with space, comma or newline separated IDs.
    • Concept types would be presumably
      1. 'mygrid:EMBL_nucleotide_sequence_id', which is a subclass of 'mygrid:nucleotide_sequence_unique_identifier' and hence of 'mygrid:biological_concept_unique_id', 'mygrid:unique_identifier' (but not of 'mygrid:biological_sequence_identifier', which is a subclass of 'mygrid:identifier').
      2. 'mygrid:SWISSPROT_accession_number', which is a subclass of 'mygrid:record_id' and hence of 'mygrid:unique_identifier' (but not of 'mygrid:biological_sequence_identifier', which is a subclass of 'mygrid:identifier').
      3. 'mygrid:MEDLINE_reference_id' which is a subclass of 'mygrid:record_id' and hence of 'mygrid:unique_identifier'. The current ontology does not normally encode arity, so an array of class X is classified as class X.
      4. 'mygrid:Gene_Ontology_term_id' which is a subclass of 'mygrid:biological_concept_unique_id' and hence of 'mygrid:unique_identifier'. Again, the current ontology does not normally encode arity, so an array of class X is classified as class X.
    • MIME type could again be 'text/plain', with the loss of any particular clues about content handling or ID separation in list/array.
    • SOAP transport types are again likely to be 'xsd:string' or similar.
  2. All of the output IDs for one input are parts of a single result value/file, e.g. a feature table containing database cross-reference features.
    • Concept type might be 'mygrid:EMBL_record' (?) which is a subclass of 'mygrid:bioinformatics_record' and hence of 'mygrid:bioinformatics_data_structure'. I can't find a specific feature table concept.
    • MIME type could be just 'text/plain', or would now make more sense to reflect the particular sequence/feature table file format, e.g. 'text/x-sequence-record-embl'.
    • SOAP transport type is likely to be 'xsd:string' or similar.
  3. All of the output IDs for all of the inputs are parts of a single result value/file, e.g. a multi-sequence file, each with its own feature table (as in the previous option).
    • Types would presumably be for the case above, restricting sequence file format to one supporting multiple sequences per file.

Issues:

  • With option 1 the individual data elements are clear labelled with concepts as with the AffyId? examples, previously. Consequently they can be used as inputs to other workflows directly (or after selection of or exploding to individual(s)).
  • With option 1 (or, indeed, any of the options) what does having e.g. the EMBL ID get you? At the basic level all you can do is use it as input to other workflows, e.g. to retrieve the associated record. But you might well want something more direct than that. E.g can you fetch the record? If so, how exactly? Using a workflow taking that kind of ID (concept) and returning the record?!
  • With option 1 (or any of the others) are the IDs qualified in any way, e.g. with SRS/EMBOSS database IDs, or are they just the plain database-specific accession number/ID (e.g. 'EMBL:AF072242' or just 'AF072242')? How do applications (e.g. taking them as inputs) know whether they are or not?
#coercion
  • Relating to this - and the similar point about the Affy ids - many EMBOSS operations take EMBOSS USAs, say 'mygrid:emboss_sequence_usa', but this has to have 'embl:', and so is not directly compatible with a 'mygrid:EMBL_nucleotide_sequence_id'.
    • Does this kind of coercion have to happen? If so, how and where?
    • At the moments its 'workflows all the way down', so you have to find and execute a workflow to get to the USA form (implicitly using the right DB name for the EMBOSS/whatever installation you are working with), so that you can then find and run a workflow to get the record. Can we provide specialised support for this, e.g. as in-process 'workflows' or workflow-like NetBeans actions?!
    • What relationships would be established between the resulting things? Should this exploit the MultiDataObject? support across accession, USA and record (say)?
    • Should the coercion be auto-magical? If so, how general is it? How is it configured and extended? How does it tie in to semantic discovery??
    • Valid dbnames (such as 'embl:' depend on the database configured with a particular installation of EMBOSS or SRS and the precise names given in that installation, which MAY vary)! It's not exactly a global naming scheme! Do we just 'standardise' on some for myGrid/demos?
  • Is it really a format rather than a concept thing, e.g. 'text/x-sequence-without-dbname' vs 'text/x-sequence-id-with-dbname' vs 'text/x-sequence-lsid'?
    • If so, then again, how does the change happen?
    • Where/how is this MIME type used in service/workflow discovery and compatibility checking?
    • How does the system know that an ID in the latter format can be passed as an input to a service taking 'mygrid:emboss_sequence_usa' whereas an ID in the former format cannot?
    • If it is a format thing, then are MIME types adequate, or is (e.g.) subsumption also required, so we should have a format ontology (fairly) orthogonal to the data_or_data_structure type/concept part of the ontology?
  • With option 1, how can they be viewed in a combined fashion within the NetBeans idiom, since they are essentially a large number of files related only by Associations (perhaps visible as file properties)? Perhaps a TopComponent? which visualises the current (possibly multiple) selection (esp. if it is a selection of WorkflowInstances?)?!
  • With option 2/3, what file format should we use by default (e.g. for editor support in NetBeans)? Or will (say) bioJava be a suitable parsing framework? For editing this would require loss-less round-tripping of sequence and feature files!
  • With option 2/3, you want to be able to get at the cross-reference IDs inside the file, which means exposing them as NetBeans Nodes based on a parsing of the file, and appropriate MIME & concept types (or other mapping to a suitable Node type).

Sorting the above by GO term

"The table is sorted by GO term in order to cluster AffyIds?? that map to similar GO terms together." (GravesStoryBoard)

Issues:

  • This is actually quite hard, since you have to account for the isA and isPartOf relations between GO terms (different genes may be very closely related via common ancestors, but not have exactly the same GO terms), and the three complementary groupings of GO terms, for process, function and location, each of which may be more or less significant in different contexts.

(add more here smile

-- ChrisGreenhalgh - 24 Apr 2003

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r2 < r1 | More topic actions
 
Powered by myGrid wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding myGrid wiki? Send feedback