r5 - 06 May 2003 - 08:49:00 - ChrisWroeYou are here: myGrid wiki >  Mygrid Web  > GDScenarioReqs > GDScenarioReqsWithWP6Annotation > ClassificationWizardNotes
-- ChrisWroe - 06 May 2003

The semantic descriptions can get complicated very quickly. Therefore I'm going to try and list all the type matching scenarios in which this information could be useful and use that to scope what information is actually required. TypeMatchScenario

-- ChrisWroe - 28 Apr 2003

Annotation to first stage of GD scenario

-- ChrisWroe - 28 Apr 2003

.....

Visualisation and interaction requirements for the GD scenario.

1st section of the scenario:

2) User needs to interact with labbook to upload a collection of AffyIds? to the new project

  • Option 1: the user mounts the local filesystem in NetBeans. They copy the file, and paste it into the project folder in the MIR filesystem.
    • This creates a new file in the MIR filesystem and a corresponding DataThing? in the MIR.
      • It may be that this file is initially editable (but cannot be used in workflows etc.), and only becomes immutable (and usable in workflows etc.) following an explicit commit action. This would be required to allow the user to edit things in or on the way into the MIR.
  • cont.: either as a result of this, or by an explicit context action on the new MIR file the user opens a Classify data wizard. They use this to select the concept(s) (mygrid:Affymetrix_HG_U95_probe_id) and (perhaps implied from this) MIME-type (text/x-record-ids) (and/or other types) for the new file from (say) a drop down list (say) configured from a new NetBeans Option.
    • This adds additional 'isA' Annotations to the DataThing? in the MIR, which are visible in the standard file's properties as 'concepts'.
  • Option 2: the user right clicks the Project folder and selects action myGrid->New->Imported Data. This opens the New Wizard for Imported Data, which allows the user to select the file to be imported, and then classify it (as above).
    • Other than not requiring support for pasting into the MIR filesystem the other internal requirements are as for option 1.
.....

--++ Datatypes and Classification Wizard notes

  • The classify data wizard should collect at least three bits of information from the user.
    1. The concept which represents the meaning of the data e.g. mygrid:Affymetrix_HG_U95_probe_id. It would be good if I didn't have to pre-enumerate concepts for all ID's within a given scope. An mygrid:Affymetrix_HG_U95_probe_id is a unique identifier for a probe with respect to a given probe set (Human Genome U95). Ideally a would present a hierarchy of concepts in that area some of which can't be picked because they are too abstract (mygrid:unique_identifier) and some which if picked must be further qualified (mygrid:Affymetrix_probe_id). This would require the user to be able to compose a new concept from a template. Something I am working on but may not be available for IF4.
    2. Whether the data represents a singleton or a collection, and if so what kind of collection (bag, set, list, map). Although I'm not sure what depends on the collection type in this scenario. The more important piece of information is the collection format i.e what is the delimiter between items.
    3. The choice of concept will have narrowed down the choice of formats that can encode that concept. For example mygrid:Affymetrix_HG_U95_probe_id could imply MIME-type (text/x-record-ids)

    • The information gathered by the classify data wizard should also be available for all information sent to and received from a service. This gets more interesting for complex data structures such as an EMBL or SWISS-PROT record. I suggest I try and write at that information. The concept type for an EMBL record would specify it was a database record from the EMBL database, it was about a gene, it has components such as id, nucleotide sequence etc. This would have to go hand in hand with a schema of the record. ( Given a multi-table database I wonder if you could auto-generate the schema of the record given the schema of the database, and the query?). I am not sure of the best way to represent the schema. I think I will start with XML Schema. I know the data is not in XML but the only other alternative seems to be something like an eBNF syntax which is too much detail.

+++--- Trial descriptions

Working off the documentation of the EMBL format I'm going to try and come up with a semantic description. The aim is to relate the record and its components to concepts in the ontoogy. The aim is not to provide sufficient information to parse, render or view an EMBL entry, something that is best left to existing code. I envisage creating some way of mapping concept description to schema to implementations that can actually process the record.

The descriptions get complex. There is a question as to whether applications can actually get through the complexity to find out what they need. Suggestions for simplier shortcuts are welcome. I don't want to go down this road too far until someone comes up with a scenario in which these can be used. Should I only describe important bits?

EMBL record format

Here's a description of an EMBL record as taken from EBI

Key Name Cardinality
ID - identification (begins each entry; 1 per entry)
AC - accession number (>=1 per entry)
SV - new sequence identifier (>=1 per entry)
DT - date (2 per entry)
DE - description (>=1 per entry)
KW - keyword (>=1 per entry)
OS - organism species (>=1 per entry)
OC - organism classification (>=1 per entry)
OG - organelle (0 or 1 per entry)
RN - reference number (>=1 per entry)
RC - reference comment (>=0 per entry)
RP - reference positions (>=1 per entry)
RX - reference cross-reference (>=0 per entry)
RA - reference author(s) (>=1 per entry)
RT - reference title (>=1 per entry)
RL - reference location (>=1 per entry)
DR - database cross-reference (>=0 per entry)
FH - feature table header (0 or 2 per entry)
FT - feature table data (>=0 per entry)
CC - comments or notes (>=0 per entry)
XX - spacer line (many per entry)
SQ - sequence header (1 per entry)
bb - (blanks) sequence data (>=1 per entry)
// - termination line (ends each entry; 1 per entry)


Here's some text daml+oil to describe the various components in terms of a commmon ontology.
"EMBL division" definedAs (structure which is_division_of (one-of "EMBL database")).
"EMBL nucleotide sequence record" definedAs (record which <is_part_of (one-of "EMBL database")
                  encodes "nucleotide sequence"
                  encodes (feature is_feature_of "nucleotide sequence")).
"EMBL nucleotide sequence record" necessarily 

    <has_component ("EMBL ID")
     has_component ("EMBL Accession Number")
     has_component ("")
     has_component ("")
     has_component ("")
     has_component ("")
     has_component ("")
     has_component ("")
     has_component ("")
     has_component ("")
     has_component ("")>.

"EMBL ID" definedAs (record which is_component_of (one-of EMBL)).
"EMBL ID" necessarily 
   <has_component "EMBL entry name"
    has_component "EMBL entry class name"
    has_component "EMBL entry molecule type name"
    has_component "EMBL entry database division name"
    has_component "nucleotide sequence length in nucleotide bases">.

"EMBL entry name" definedAs (name which <is_identity_of "nucleotide sequence"
               is_encoded_by "EMBL nucleotide sequence record">).
"EMBL standard entry name" instanceOf "EMBL entry class name".
"EMBL standard entry name" has_string_value "standard".
"EMBL entry class name" definedAs (name which <is_identity_of (class which is_class_of "EMBL nucleotide sequence record")).

"EMBL entry molecule type name" definedAs 
   (name which <is_identity_of 
   ((class which is_class_of biological_molecule) which has_sequence 
      (sequence is_encoded_by "EMBL nucleotide sequence record">).
"EMBL RNA molecular type name" instanceOf    (name which <is_identity_of 
   ((class which is_class_of RNA) which has_sequence 
      (sequence is_encoded_by "EMBL nucleotide sequence record">).
"EMBL RNA molecular type name" has_string_value "RNA".
"EMBL DNA molecular type name" instanceOf  (name which <is_identity_of 
   ((class which is_class_of DNA) which has_sequence 
      (sequence is_encoded_by "EMBL nucleotide sequence record">).
     
"EMBL DNA molecular type name" has_string_value "DNA".

"EMBL entry database division name" definedAs (name which is_identity_of
                   (structure which is_division_of 
                     (one-of "EMBL database"))).
"EMBL database EST division name" instanceOf (name which is_identity_of 
                  ("EMBL division" has_part someAndOnly
                    ("EMBL nucleotide sequence record" which encodes 
                     (sequence is_sequence_of "expressed sequence tag")))).
"EMBL database EST division name" has_string_value "EST".

                          ESTs                    EST
                          Bacteriophage           PHG
                          Fungi                   FUN
                          Genome survey           GSS
                          High Throughput cDNA    HTC
                          High Throughput Genome  HTG
                          Human                   HUM
                          Invertebrates           INV
                          Mus musculus            MUS
                          Organelles              ORG
                          Other Mammals           MAM
                          Other Vertebrates       VRT
                          Plants                  PLN
                          Prokaryotes             PRO
                          Rodents                 ROD
                          STSs                    STS
                          Synthetic               SYN
                          Unclassified            UNC
                          Viruses                 VRL


Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r5 < r4 < r3 < r2 < r1 | More topic actions
 
Powered by myGrid wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding myGrid wiki? Send feedback