Message 1
Dear Dr Laughton
I'm an RA working in the School of Computer Science at Nottingham, for a project called myGrid (www.mygrid.org.uk). myGrid's main research aim is to find ways of making access to high-end, remote bioinformatics resources easily available on scientists' desktops, and to make it easier for scientists to automate their use of bioinformatics resources - for example, by providing service discovery mechanisms, so that they can select a particular piece of information that they wish to process, and can then be given a list of resources which could be used to process that particular type of file. myGrid is focussing particulary on using web services to make remote resources available over a network, and aims to start working with grid services soon.
My role on the project is to talk to people who might be potential users of myGrid software, and to find out what facilities would be useful to them. I met up with Jonathan Hirst from Biological and Biomolecular Chemistry yesterday, and he suggested that you might be a useful person to talk to.
So ... I was wondering if either yourself, or someone working in your group might have some time to talk to me to give me some more ideas about what facilities should go into myGrid. Some initial software development has already been done, and I've written a very brief (two pages) introduction to the research topics covered so far. If you are interested, I can email this to you so you can find out a bit more.
Stefan Rennick Egglestone
Mixed Reality Laboratory
School of Computer Science
sre@cs.nott.ac.uk
***************************
Message 2
Hi Stefan,
Sorry for taking so long to get back to you - yes, it would be very good to get together for a talk. As you may know, I am involved in another GRID project - BioSimGRID - and it would be interesting to see how/if they might interact.
Would you be free one day next week? Monday 12th is particularly good for me. And if you could send me a copy of the document you have written that would be good, too.
Cheers,
Charlie
************************
Message 3
Hi Stefan,
I'm afraid I can't do the afternoon of the 15th, but the following week is pretty clear, except the Wednesday.
Cheers,
Charlie
************************
Message 4
Hi Stefan,
OK, how about 11am on Monday? Do you want to come over to my office? I'm in Room A10 on the ground floor of the new Centre for Biomolecular Sciences Building, by the footbridge over to the QMC.
Regards,
Charlie
***************************
Message 5
I've taken a look at the scripts that you gave me when I came to see you, and I think that software built in myGrid may be able to do some useful stuff. I'm first going to see if I can replicate the process that your scripts do, and I was wondering if you could help me clear up my understanding of some of the stages.
As far as I can see it, the steps involved are
1. extract SWISS-PROT accession numbers from allergens.txt and place them in allergens.lis
2. look up SWISS-PROT records for each accession number, and place them in sprot.lis
The next step is one I don't understand
3. take file sport.lis and file dir.cla.scop.txt (downloaded from scop.mrc-lmb.cam.ac.uk/scop/parse/index.html), run script sprotprocess.sh on them, priducings allergens.scop
Examples of entries in allergens.scop seem to be lines such as
C76821; HSSP 1LID b.60.1.2
P50635; NO STRUCTURE
I was wondering if you could tell me what these mean.
Cheers
Stefan
****************************
Message 6
A couple of months ago, I came to see you to find out about any bioinformatics work that you were involved in, and you gave me an example of a script that processed file allergen.txt from
http://www.expasy.org/cgi-bin/lists?allergen.txt.
I've talked to a few other biologists, and identified a fairly common problem that people have of wanting to keep track of new sequence being added to databases which they are interested in, and then automatically being able to look up information about the sequencees.
I've started working with a perl script called BSU (Blast Search Updater). BSU takes input from a directory of protein sequences. Each time it is run it performs a blast, on a protein database at the NCBI, against each of the sequences in the input directory, and generates a file of any new sequences which have been added to the database since the last time the script was run which blast match with a sequence in the input directory.
I've just got the script up and working, and the plan is to interface it with an application we've been developing called Taverna. Taverna allows users to specify a series of tasks to be performed and to execute them automatically (eg tasks such as try and find any literature references for a given gene sequence). So, hopefully when it is all working, once a day a program will run and find new sequences which have been added to databases,, and then go and look up information about them.
As example input data, I was going to use some of the proteins from allergens.txt, but I haven't yet worked out a way of generating input files directly from it (this would involve, for each protein in allergens.txt, generating a file in fasta format containing the sequence for the protein and whose file name was the name of the protein) so I was going to create files by hand. Given there's about 200 proteins on allergens.txt, I can't do them all, so I thought I'd do about 20, perhaps from the same species. It seems sensible to choose proteins that are most likely to generate the most new sequences - I guess this means proteins from species with which the most research is taking place - so I was wondering if you could suggest which proteins might be sensible to use.
Thanks for your help
Stefan