r4 - 30 Mar 2003 - 21:24:40 - MarkGreenwoodYou are here: myGrid wiki >  Mygrid Web  > WorkInProgress > WorkFlow > ExampleEmbossWorkflow

EMBOSS workflow

This example workflow is based on the multiple sequence alignment section in the EMBOSS tutorial.

Overview

The basic schema is:

    seqret-1 -> getorf -> transeq->
                                    |
                                      -> prophet - RESULT!
                                    |
    seqret-2 -> emma -> prophecy ->

seqret-1 is the users starting nucleotide sequence, e.g. an mRNA.

seqret-2 should be replaced by a (t)blastn of the mRNA vs. SWISS-PROT.

Brief Description

An example input to seqret-1 is embl:xlrhodop (this identifies L07770 Xenopus laevis rhodopsin mRNA)

The result of seqret-1 is the DNA sequence (in fasta format), and this is input to getorf (along with 2 user defined parameters for defining the open reading frame orfminsize=400 orffind=3)

The result of getorf is passed to activity transeq, which translates the nucleic acid sequence into the correponding peptide sequence.

An example input to seqret-2 is _swallid:ops2_* (this identifies a set of proteins in swissprot)

The result of seqret-2 is all swissport sequences whose identifiers begins with ops2, and these are input to activity emma (which is a emboss interface to the ClustalW multiple alignment program)

The result of emma (the alignment of sequences) is input to the activity prophecy, which creates a profile from a multiple sequence alignment. (There is an additional parameter for the profile type G stands for Gribskov)

The prophet combines 2 inputs: the protein sequence, from seqret-1 -> getorf -> transeq, and the profile, from seqret-2 -> emma -> prophecy. This tests the similarity of the sequence to the profile.

In the output: "The vertical bars (|) represent residues that are identical between the ops2 consensus and our rhodopsin, while the colons (:) represent conservative substitutions. We hope you can see that aligning members of a family can reveal conserved regions that may be important for structure and/or function." EMBOSS tutorial prophet

Longer Description

The EMBOSS tutorial workflow demonstrates a bioinformatics workflow for if you suspect you have discovered a previously unpublished protein domain (i.e. a pattern of amino acids in a sequence) that is conserved across a group of proteins. (Personally, I'd be more likely to use hmmer, than profit & prophecy). N.B. If you want to discover if there's any previously published protein domains in a sequence, then I would submit my sequence to InterProScan? first.

         "A DNA sequence" -> getorf -> transeq--|
                                                |
                                                 --> prophet - RESULT!
                                                |
 "Proteins from a family" -> emma -> prophecy --|

Our biologist is building up a collection of protein sequences from different species that all have a similar function, e.g. they catalyse the same reaction. Our biologist hypothesises that their common function may mean that they are from the same protein family & all contain a similar pattern of amino acid sequences that catalyse this function.

By using a multiple sequence alignment, they are able to identify amino acids that are conserved together across the different proteins. A multiple sequence alignment algorithm recognises not only identical amino acids that are conserved, but also different ones that share similar physiochemical properties, and thus are substitutable without affecting the correct functioning of the protein.

As a next step, this multiple sequence alignment may be summarised as a model. In the example here, the complete multiple sequence alignment is summarised as a profile matrix that essentially records the probability of any amino acid occurring at each position in the sequence. Thus positions where particular amino acids are conserved will have high probabilities for those amino acids, and ~0% for all the other amino acids. Other bioinformatics tools may use Hidden Markov models or support vector machines to model the alignment of the protein sequences.

Once the model has been built, it can be used to test new protein sequences to see if they also belong to the family. If they're a very good match, then these could be added to the other protein sequences & used to build a new & hopefully better model.

In the workflow that is shown, the biologist takes a collection of protein sequences that they have discovered previously & have decided belong to the same family. The application 'emma' is used to perform a multiple sequence alignment. The output of 'emma' is passed to 'prophecy' which generates a profile matrix that models the amino acid sequence composition.

During another experiment, our biologist isolates a DNA sequence from a new organism which they suspect may be a member of the protein family, e.g. by PCR cloning. To be able to test this sequence against the model, they first need to translate the DNA into a protein sequence. This involves a sequence of operations in which they first identify long "Open Reading Frames" (ORFs) in the DNA using "getorf", and then translate these from DNA sequence into protein sequence using the universal genetic code with the "transeq" tool.

Having identified the probable protein sequence from the DNA, our biologist uses the "prophet" tool to compare the new protein sequence with the profile matrix model generated previously by "prophecy".

The final output is a score for how well the new sequence fits to the model, & an alignment of the new protein sequence with the consensus sequence of the multiple alignment. If the match is statisitcally significant, then our biologist will add the new sequence to their list of protein sequences & re-build the model using 'emma' & 'prophecy'.

Provenance

A raw XML provenance record for this workflow is included below. A version that incorporates a style sheet is at http://twiki.mygrid.info/twiki/pub/Mygrid/WorkflowResources/emboss_soaplab2provSTY2.xml. For more details on this see WorkflowResources.

EmbossProvenanceExample provides some explanatory text for parts of this provenance example.

EmbossProvenanceExampleResults provides some provenance records from different runs of this workflow.


This workflow brings together multiple sequence analysis and profile analysis. It was first suggested as a candidate workflow by Alan Robinson at the EBI (http://industry.ebi.ac.uk)

-- MarkGreenwood - 17 Jan 2003

toggleopenShow attachmentstogglecloseHide attachments
Topic attachments
I Attachment Action Size Date Who Comment
xmlxml emboss_soaplab2prov.xml manage 209.5 K 17 Jan 2003 - 16:00 MarkGreenwood Provenance record from Example workflow
Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r4 < r3 < r2 < r1 | More topic actions
 
Powered by myGrid wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding myGrid wiki? Send feedback