r1 - 07 Mar 2003 - 13:11:00 - AlanRobinsonYou are here: myGrid wiki >  Mygrid Web  > AlansAnnotationPipelineWalkThrough > AlansTargetedQueries

Approach 2: Targeted queries against specific resources

In the previous demonstration, I was a user who had a handle on a gene from an Affymetrix chip, but didn't know which resources to go to, so I wandered through the web & collected information. I made much use of other web sites that had either collected this information together or pointers to it, e.g. GeneLynx. However, I know there where some sites that I missed & I didn't make full use of SRS's capabilities.

What I (unintentionally) did during this search is find a number of primary identifiers that were useful to look up information in other search tools, including in order of usefulness:

  • HUGO: 6917; MBD2.
  • LocusLink: 8932.
  • TREMBL: Q9UBB5.
  • EMBL: AF072242.
  • UniGene: Hs.25674.

However, a cursory look at the data-fields for the Affymetrix U95Av2 chip illustrates a problem we're going to have:

It shows that the chip has ~63,000 gene probes. However there are references to:

  • 14,788 different HUGO gene symbols.
  • 14,803 different LocusLink IDs
  • 30,841 different SWISSPROT IDs.
  • 26,117 different full-length EMBL IDs.
  • 41,444 different UniGene IDs
  • 7,662 different OMIM IDs.

I don't know the redundancy on the U95Av2 chip, but I suspect that fairly often we're not going to be able to get a HUGO or LocusLink ID and will have to resort to one of the less useful identifiers. Things will be most interesting if all we have is a UniGene ID - The bad news is that these have very little annotation; the good news is that it may mean we've discovered the involvement of a previously uncharacterised gene in a disease process.

So suppose for this second approach, I am a bit more savvy, I know which sites to go to directly to get the most information the quickest. If I were to start again with Affymetrix ID, '33905_at', then I would try to get identifiers for EMBL, SWISS-PROT, HUGO & LocusLink as quickly as possible. Then use these to query in other sites & download data. To be fair to Affymetrix, the primary identifiers were in their entry for '33905_at', including in order of usefulness: HUGO:MBD2, LocusLink:8932, TREMBL:Q9UBB5, EMBL:AF072242 & UniGene:Hs.25674.

Note that to get these IDs & entries, I can either parse a citing file & find the IDs, or use the linking feature of SRS. Although the Affymetrix site has a lot of information, my most likely actions would be:

Then I would use the identifiers from the Affymetrix & IPI entry to download information from what I consider the primary annotation databases. There are a number of alternative places & ways I can get this data using either direct look-up or linking in SRS. For example:

Then I would set about using these identifiers to scavenge information from as many other sites as possible. The selection of sites I've chosen have CGI-based querying - However the results you get back may need to be parsed to extract the ID's of the actual entries. There are a number of resources I've not used, e.g. ENZYME.

Next is information where getting the entries requires more effort. I have at least three choices to get these entries: (i) parse citing files for the IDs & look them up directly; (ii) Work out the query CGI API for the web pages using my available identifiers; (iii) Find somewhere that's indexed them in SRS!.

Next is literature information from MEDLINE. There are a number of options here:

A. Search in MEDLINE for e.g. "MBD2 & human":

B. Use the linking feature of SRS for the entries found above, e.g.

C. Parse the various files retrieved above for MEDLINE citations.

  • Ouch!

Conclusions on Approach 2

I've hit a number of resources & collected much information. The search hasn't included all possible resources. Many resources provide a CGI-based query mechanism. With the exception of SRS based systems, all of these have different APIs. Before I can use a site, I've got to first find it & then know how to use its CGI-based API.

As an alternative exercise, how far can I get using just SRS @ EBI starting from EMBL:AF072242? I've mentioned the linking feature of SRS a number of times. SRS can catalogue the cross-references between databanks to build up a graph of the resources. We can navigate this graph to extract information starting from a single point, e.g. the EMBL ID.

If I were starting from the UniGene ID:

Given that SRS has a graph of cross-referenced resources, one could imagine trying to do a traversal of this. However, that is going to be non-trivial wrt ensuring only relevant data is pulled out.

-- AlanRobinson - 07 Mar 2003

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r1 | More topic actions
 
Powered by myGrid wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding myGrid wiki? Send feedback