Approach 2: Targeted queries against specific resources
In the previous demonstration, I was a user who had a handle on a gene from an Affymetrix chip, but didn't know which resources to go to, so I wandered through the web & collected information. I made much use of other web sites that had either collected this information together or pointers to it, e.g. GeneLynx. However, I know there where some sites that I missed & I didn't make full use of SRS's capabilities.
What I (unintentionally) did during this search is find a number of primary identifiers that were useful to look up information in other search tools, including in order of usefulness:
- HUGO: 6917; MBD2.
- LocusLink: 8932.
- TREMBL: Q9UBB5.
- EMBL: AF072242.
- UniGene: Hs.25674.
However, a cursory look at the data-fields for the Affymetrix U95Av2 chip illustrates a problem we're going to have:
It shows that the chip has ~63,000 gene probes. However there are references to:
- 14,788 different HUGO gene symbols.
- 14,803 different LocusLink IDs
- 30,841 different SWISSPROT IDs.
- 26,117 different full-length EMBL IDs.
- 41,444 different UniGene IDs
- 7,662 different OMIM IDs.
I don't know the redundancy on the U95Av2 chip, but I suspect that fairly often we're not going to be able to get a HUGO or LocusLink ID and will have to resort to one of the less useful identifiers. Things will be most interesting if all we have is a UniGene ID - The bad news is that these have very little annotation; the good news is that it may mean we've discovered the involvement of a previously uncharacterised gene in a disease process.
So suppose for this second approach, I am a bit more savvy, I know which sites to go to directly to get the most information the quickest. If I were to start again with Affymetrix ID, '33905_at', then I would try to get identifiers for EMBL, SWISS-PROT, HUGO & LocusLink as quickly as possible. Then use these to query in other sites & download data. To be fair to Affymetrix, the primary identifiers were in their entry for '33905_at', including in order of usefulness: HUGO:MBD2, LocusLink:8932, TREMBL:Q9UBB5, EMBL:AF072242 & UniGene:Hs.25674.
Note that to get these IDs & entries, I can either parse a citing file & find the IDs, or use the linking feature of SRS. Although the Affymetrix site has a lot of information, my most likely actions would be:
- Identify the representative EMBL/GenBank sequence ID from the Affymetrix entry = AF072242.
- Identify the SPTREMBL identifer in the EMBL:AF072242 entry = Q9UBB5.
- Identify the IPI entry for the SPTREMBL entry = IPI:IPI00022489.
- Identify the HUGO ID from the Affymetrix or LocusLink or SPTREMBL entry = MBD2.
Then I would use the identifiers from the Affymetrix & IPI entry to download information from what I consider the primary annotation databases. There are a number of alternative places & ways I can get this data using either direct look-up or linking in SRS. For example:
- EMBL (With EMBL ID taken from Affymetrix entry):
- SPTREMBL (With ID taken from EMBL entry or SRS link from EMBL):
- LocusLink (LocusLink ID taken from IPI or Affymetrix entry):
- InterPro (With ID taken from IPI entry or SRS link from IPI or EMBL or SPTREMBL):
- PfamA (ID taken from IPI or SPTREMBL entry or SRS link from IPI or SPTREMBL or EMBL):
- GOA (Use SPTREMBL ID):
- OMIM (ID taken from Affymetrix or LocusLink entry or SRS link from LocusLink):
Then I would set about using these identifiers to scavenge information from as many other sites as possible. The selection of sites I've chosen have CGI-based querying - However the results you get back may need to be parsed to extract the ID's of the actual entries. There are a number of resources I've not used, e.g. ENZYME.
- Ensembl (Use EMBL ID):
- Ensembl Peptide (Ensembl Peptide ID taken from IPI):
- Ensembl Gene Report (Use HUGO ID):
- GeneCards (Use HUGO ID):
- GenAtlas (Use HUGO ID):
- euGenes (Use munged LocusLink ID):
- RefSeq (RefSeq IDs taken from LocusLink or Affymetrix or ...)
- dbSNP (Use LocusLink ID):
- HGVbase (Use HUGO ID):
- PubGene (Use HUGO ID):
- KEGG (Use LocusLink ID):
Next is information where getting the entries requires more effort. I have at least three choices to get these entries: (i) parse citing files for the IDs & look them up directly; (ii) Work out the query CGI API for the web pages using my available identifiers; (iii) Find somewhere that's indexed them in SRS!.
- PDB/MSD (PDB IDs taken from PfamA):
- GDB (GDB ID taken from HUGO entry or HUGO ID):
- GeneLynx (GeneLynx ID taken from GeneCards entry or HUGO ID or LocusLink ID or ...):
- CGAP (CGAP ID taken from GeneLynx entry or HUGO ID or EMBL ID):
Next is literature information from MEDLINE. There are a number of options here:
A. Search in MEDLINE for e.g. "MBD2 & human":
B. Use the linking feature of SRS for the entries found above, e.g.
C. Parse the various files retrieved above for MEDLINE citations.
Conclusions on Approach 2
I've hit a number of resources & collected much information. The search hasn't included all possible resources. Many resources provide a CGI-based query mechanism. With the exception of SRS based systems, all of these have different APIs. Before I can use a site, I've got to first find it & then know how to use its CGI-based API.
As an alternative exercise, how far can I get using just SRS @ EBI starting from EMBL:AF072242? I've mentioned the linking feature of SRS a number of times. SRS can catalogue the cross-references between databanks to build up a graph of the resources. We can navigate this graph to extract information starting from a single point, e.g. the EMBL ID.
If I were starting from the UniGene ID:
Given that SRS has a graph of cross-referenced resources, one could imagine trying to do a traversal of this. However, that is going to be non-trivial wrt ensuring only relevant data is pulled out.
--
AlanRobinson - 07 Mar 2003