r2 - 07 Mar 2003 - 15:25:26 - AlanRobinsonYou are here: myGrid wiki >  Mygrid Web  > AlansAnnotationPipelineWalkThrough > AlansRandomWalk

Appproach 1: A ~Random Walk Through Web Pages following Hyperlinks

In the first walk-through, we're going to do an ~exhaustive crawl through links in bioinformatics resources...

A gene on the Afymetrix U95Av2 chip with ID '33905_at' has been identified as being differentially expressed between a Graves' disease patient & an unaffected sample. This isn't a made-up example, this is one of the genes that Claire has identified.

The first issue is that '33905_at' is an Affymetrix identifier. We need to find out what the DNA sequence is & how it maps to a gene or protein identifier as used by other databases. You'll need to use the Affymetrix NetAffx tools for this.

  • Go to http://www.affymetrix.com/ and press 'Analysis' at the top of the screen.
  • Choose 'Interactive Query' from the 'ANNOTATION & TOOLS' section.

Our first hurdle is that only registered users can access Affymetrix tools & data, so either (i) register yourself (ii) Borrow my username: alan@ebi.ac.uk & password: mygrid

  • Check the box next to 'HG-U95 Target'. This is the chip set Claire used.
  • Enter '33905_at' into the 'QUICK SEARCH' text field at the top of the screen.
  • Press 'QUICK SEARCH'.

You should see one entry being returned with a number of pieces of information. For the time being, I'm going to ignore the "LINK" option at the top of the page & go for manual browsing instead.

  • Click on 'HG-U95 Target:33905_at_HG-U95Av2'. (N.B. You may notice that the URL for this page is actually a CGI command for SRS.)

You should see a page of information returned. This is our starting point for branching out into other bioinformatics resources, either directly through URL's or by picking out unique identifiers that we can enter into other search services.

Even a cursory look at this file reveals some information that I know I'll want, including:

Basic annotation:

  • Chromosomal location: 18q21
  • Sequence description: [...] Homo sapiens methyl-CpG binding protein MBD2 (MBD2) [...]
  • Gene Title: methyl-CpG binding domain protein 2

References to other databases where I know I'm going to find information:

  • UniGene ID: Hs.25674.
  • Representative Public ID: AF072242.
  • HUGO Gene Symbol: MBD2.
  • LocusLink: 8932.
  • SWISS-PROT: O60535; O95242; Q9UBB5.
  • OMIM: 603547.
  • RefSeq: NM_003927 methyl-CpG binding domain protein 2 isoform 1
  • RefSeq: NM_015832 methyl-CpG binding domain protein 2 testis-specific isoform

Further down the page are 'Functional Annotations'. I'm going to choose to ignore these as I suspect that I can better & more up-to-date information by going to other resources.

N.B. If you're feeling that this page looks like it'll be a bugger to parse, then don't worry too much, we can use the SRS engine & query language to make it a little easier.

I am now going to choose to jump out of the Affymetrix web site & go to the EBI. I'd rather browse the EMBL & SWISS-PROT database entries at source, rather than the ones Affymetrix have downloaded. An interesting aside is that the Affymetrix web site links out to a number of external sources, however, at the EBI site, all of these databases are integrated into the SRS system, generally this allows for more powerful querying & linking, however we're not using the authorative source of the information.

An obvious starting point is the "Representative Public ID: AF072242". Generally, the DNA sequence files are low on information content, but they are a good place to start as other databases refer back to them.

  • In another browser window, go to http://srs.ebi.ac.uk/
  • Click 'Start a Temporary Project'.
  • Enter 'AF072242' in the text field next to 'Quick Search'.
  • Check the box next to 'EMBL' in the 'Sequence libraries - complete'. N.B. Ignore the 'EMBL' options in the 'Sequence libraries - subsections'.
  • Click 'Quick Search'.

You should get back a list of three entries.

  • Click on 'EMBL:AF072242' in the first column. Do not click on 'AF072242' in the second column - that will take you to our archive of sequence versions.

You should see an EMBL entry file.

Things I note immediately:

  • The description line: DE Homo sapiens methyl-CpG binding protein MBD2 (MBD2) mRNA, complete cds.
  • One literature reference: MEDLINE; 98449942.
  • Cross-reference: DR GOA; Q9UBB5;
  • Cross-reference: DR SPTREMBL; Q9UBB5;
  • In the feature lines (FT), there are bits of information, e.g. map="18q21.1"; product="methyl-CpG binding protein MBD2"; note="encodes methyl-CpG binding domain".

If you wish, quickly jump into the GOA; Q9UBB5 link. This lists annotated functions taken from the Gene Ontology for the protein. However, I feel that it's better to pick up protein annotation via a protein sequence entry.

My next port of call is SWISS-PROT/TrEMBL - the annotated protein sequence databases. I'm going to follow the link to SPTREMBL from the EMBL file. I could use the Affymetrix entry - However of the three entries it gives, two are fragments. Again, I'd rather follow the information in the authorative file.

  • Click on the 'SPTREMBL:Q9UBB5' hyperlink.

We are now in the protein sequence world.

Things I notice immediately as interesting:

  • Description: Methyl-CpG binding protein 2 (Methyl-CpG binding domain protein 2).
  • Two literature references: MEDLINE: 99373255 & 98449942.
  • Cross reference: Genew HGNC:6917; MBD2.
  • Cross-reference: InterPro IPR001739; Methyl-CpG_bind.
  • Cross-reference: Pfam PF01429; MBD; 1.

N.B. For the most part, I'm going to ignore all those EMBL entries. Generally they're just parts of the sequence & provide no new information.

  • Click on the 'Genew HGNC:6917; MBD2.' hyperlink.

We are now accessing the HUGO official gene name database. The HUGO Gene Nomenclature Committee has approved 'MBD2' as the official name for the human gene. Unfortunately, not everyone chooses to listen to them, plus we have legacy naming.

At this point I hit a problem with the EBI version of the HGNC entry. It contains less cross-links to other databases than one at the original source of the database. In particular, GeneCards & (embarassingly) Ensembl.

  • Click on the 'Symbol MBD2' hyperlink to be taken to UCL.

Unfortunately, we've not been taken to where I wanted to go frown I actually want the new search engine. Your naive user probably isn't going to realise this!

  • Click on the 'To search both literature aliases and approved symbols, click here' hyperlink.
  • Enter 'MBD2' into the text field.
  • Click 'Search'.
  • Click 'MBD2'.

I've found a number of database links, including:

  • Ensembl (using MBD2 ID).
  • GeneCards (using the MBD2 ID).
  • GDB (using the GDB ID, 9957906).
  • LocusLink (using the LocusLink ID: 8932).
  • OMIM (using the OMIM ID: 603547).
  • RefSeq (using the RefSeq ID: NM_015832).
  • SWISS-PROT (using the SWISS-PROT ID: Q9UBB5).

Now:

  • Click on the Ensembl link.

This takes you to the Ensembl page for the MBD2 gene as it has been mapped to the consensus human genome sequence.

Things I note immediately:

  • View gene in genomic location: 51615682 - 51686265 bp (51.6 Mb) on chromosome 18. [I've got the physical location of the gene on the chromosome.]
  • Description: METHYL-CPG BINDING DOMAIN PROTEIN 2 ISOFORM 1. [Source: RefSeq (NM_003927)]
  • A couple of putative homologues in mouse & rat.
  • Similarity matches & links to GeneCards, LocusLink, RefSeq & SPTREMBL.
  • Links to GO terms describing the function of the gene product:
    • GO:0003677 [DNA binding]
    • GO:0005634 [nucleus]
    • GO:0008327 [methyl-CpG binding]
    • GO:0016481 [negative regulation of transcription]
    • GO:0016564 [transcriptional repressor]
  • InterPro hits:
    • IPR000637 HMG-I and HMG-Y DNA-binding domain (A+T-hook)
    • IPR001739 Methyl-CpG binding
  • Transcript structure.
  • Protein structure.

Now:

  • Click on the 'View gene in genomic location: 51615682 - 51686265 bp (51.6 Mb) on chromosome 18' hyperlink near the top of the page.

You're now in one of the main Ensembl views. I'm not going to even try to explain this. Suffice to say, you can gather numerous information from this user interface (including SNP's) & export it in a number of formats.

You are now at the Weizmann Institute's curated catalogue of human genes. Once again this has a plethora of information that I need to digest. Apart from cross-references to other sites (most of which we've seen before), there's also new information on the expression in human tissues of MBD2. Also information on SNPs & variants in the gene, according to information taken from dbSNP & SWISS-PROT.

Notable new cross-references to other sites are:

  • GeneLynx (using the GeneLynx ID)
  • euGenes (using the LocusLink ID)

Now:

  • Click on the euGenes hyperlink from 'MBD2 in Other Genome Wide Resources'.

More information... We seem to have some extra(?) GO terms:

  • Molecular function: satellite DNA binding; DNA-binding protein.
  • Biological process: methyl-CpG binding domain protein 2, testis-specific isoform; methyl-CpG binding domain protein 2, isoform 1.
  • Cellular component: nucleus; Nuclear; DNA-associated (direct or indirect).

Also a summary of MBD2:

  • "DNA methylation is the major modification of eukaryotic genomes and plays an essential, role in mammalian development. Human proteins MECP2, MBD1, MBD2, MBD3, and MBD4, comprise a family of nuclear proteins related by the presence in each of a methyl-CpG, binding domain (MBD). Each of these proteins, with the exception of MBD3, is capable, of binding specifically to methylated DNA. MECP2, MBD1 and MBD2 can also repress, transcription from methylated gene promoters. MBD2 may function as mediators of, the biological consequences of the methylation signal. It is also reported that, the MBD2 protein functions as a demethylase to activate transcription, as DNA methylation, causes gene silencing. However, MBD2 in HeLa cells does not demethylate DNA, probably, due to HeLa cell's using an alternative pathway involving MBD2 to silence methylated, genes."

Plus cross-references to:

  • RefSeq (using RefSeq ID's).
  • LocusLink (using LocusLink ID).
  • UniGene (using UniGene ID).
  • OMIM (using the OMIM ID).
  • dbSNP (using the LocusLink ID).
  • GDB (using the GDB ID).
  • GeneCards (using the MBD2 ID).

Now:

  • Click on the 'SNP:8932' hyperlink.

We're now at the NCBI and have a list of SNPs that NCBI have identified to occur in the MBD2 gene. The position of SNPs is given relative to the contig NT_033905 for different gene models.

  • Go back to the GeneCards page.
  • Click on the 'GeneLynx' hyperlink.

We're now at the GeneLynx page - There's yet more new cross-refernces, including:

  • CGAP.
  • GenAtlas.
  • HGVbase.
  • PubGene.

Now:

  • Find the & select the CGAP link.

We're at the National Cancer Institute now. I've found new information on:

  • This gene is found in these cDNA libraries from the following tissue types: brain, cerebrum, fetus, nervous.
  • Results from SAGE experiments with a tag for this gene in normal vs cancerours tissues & cell lines.

Now:

  • Go back to the GeneLynx page.
  • Find & select the GenAtlas link.

We're now at Universite Rene Descartes in Paris. There's more annotation here. I notice some notes on the pathology of the MBD2 gene, including:

  • tumor; colorectal and stomach cancers.
  • underexpressed at the early stage of colorectal and stomach carcinogenesis.

Now:

  • Go back to the 'GeneLynx' page.
  • Find & select the 'PubGene' link.

PubGene shows the interactions between genes based on their co-occurence in literatur articles. For example, there are 14 articles that mention both MBD3 & MBD2.

Now:

  • Go back to the 'GeneLynx' page.
  • Find the 'HGVbase' link section.

Clicking on any of the HGVbase links will take you to an entry from the HGVbase databank that catalogs & describes SNPs & variation found in the human genome. This is a very rich source of information on SNPs taken from the literature, submission & dbSNP. It is generally of higher quality than data in dbSNP.

Now:

We're now at the NCBI looking at the MBD2 entry from their curated databank. Information that is interesting is:

  • An overview of the gene (Hey! Isn't that identical to the euGenes summary? - Yes it is. In fact, LocusLink should be considered the authorative version.)
  • There is GeneRIF - A third party annotation facility where people can add relevant literature citations to the current LocusLink entry. Four citations have been added.
  • Gene Ontology terms.
  • Two references with reviewers comments & links to GenBank.
  • Evidence for this locus (i.e. gene).

Now:

We're now at the On-line Mendelian Inheritance in Man archive.

There's a large amount of knowledge & references here about the MBD2 gene & its alleles. There's a link to a structure for a protein Mecp2, though it's not clear yet if this is the same as the MBD2 protein.

Now:

We're back at the NCBI in their reference databank of DNA sequences.

  • There's a number of literature citations.
  • There's cross-references to protein domain databases: CDD, SMART & Pfam.
  • There's three SNPs from dbSNP given.

Now:

This entry provides information of the functional domain found in the MBD2 (& other methyl CpG binding proteins). In particular:

  • References to Pfam & SMART.
  • A molecular function term: DNA binding (GO:0003677).
  • Abstract:
    • The Methyl-CpG binding domain (MBD) binds to DNA that contains one or more symmetrically methylated CpGs [1]. DNA methylation in animals is associated with alterations in chromatin structure and silencing of gene expression. MBD has negligible non-specific affinity for DNA. In vitro foot-printing with MeCP2? showed the MBD can protect a 12 nucleotide region surrounding a methyl CpG pair [1]. MBDs are found in several Methyl-CpG binding proteins and also DNA demethylase [2].
  • Two references about the methyl-CpG binding domain.

Now:

  • Click on the PFAM link: PF01429.

N.B. I find the plain text Pfam entry in SRS easier to read & navigate: http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz?-e+[PFAMA:'MBD']

Pfam is (probably) the best protein domain database in the world. Folks in Manchester may disagree wink

  • There is a description of the protein domain, however it's the one that InterPro lifted.
  • There are links to three structures in MSD/PDB that have the MBD domain.
  • There is a cross-reference to SCOP.
  • There are cross-references to HOMSTRAD, PfamB, SYSTERS & PANDIT.

Now:

  • Find & select the 'MSD' link for the 1QK9 structure.

We are now at the EBI's version of the Protein Structure Databank. It appears that for 1QK9:

  • This structures is a fragment of a protein called "Methyl-Cpg-Binding Protein 2".
  • There is a cross-link for the gene 'MECP2' - Following that link shows that MBD2 & MECP2 are not the same gene, although they contain the same protein fold. Other than information on the fold, I probably cannot extract too much more information from this structure.
  • Mutations in this gene can cause Rett Syndrome.
  • There are cross-links to structure-derived information for each structure.
  • We could drift into ProtoMap, but we are probably going to have much fun later trying to map SNPs from MBD2 onto the stucture of MECP2.

Now:

  • Go back to the InterPro page.
  • Select the SMART hyperlink.

We are now in Heidelberg in Germany. Information available from SMART on this protein domain includes:

  • Methyl-CpG binding domain, also known as the TAM (TTF-IIP5, ARBP, MeCP1) domain.
  • The same description as used in Pfam & InterPro.
  • Links to other proteins having this domain.
  • Links to structures having this domain (which we've seen before).

Let's call it a day now.

Conclusions of Approach 1:

For the most part, I think I've fairly much exhaustively searched through the available links of a large number of resources starting from the EMBL entry. Along the way I passed through a number of resources that attempted to catalogue links between resources, e.g. <nopGeneCards & GeneLynx. However, I have noticed that I never picked up links to some important resources, e.g. IPI:

I happen to know that this is a very good place to pick up identifiers to use in searches of other databases. Also I don't think the links to HGVbase that I picked up were very good. Also I never made proper use of the features of SRS.

-- AlanRobinson - 07 Mar 2003

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r2 < r1 | More topic actions
 
Powered by myGrid wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding myGrid wiki? Send feedback