Minutes of Access Grid meeting, 28 Feb 2003
Attendees
EBI: Peter Rice, Alan Robinson
Manchester: Mark Greenwood, Phil Lord, Nick Sharman, Chris Wroe
Newcastle: Peter Li, Savas Parastatidis, Paul Watson
Nottingham: Kevin Glover, Chris Greenhalgh, Milena Radenkovic
Southampton: Simon Miles, Juri Papay, Victor Tan
Agenda
The purpose of the meeting will be to review the proposed repository schema
against the interactions matrix, to check that the former captures all the
information needed to support the latter (and especially the interactions
used in the scenario.
Discussion
Chris G first asked whether we were assessing the overall myGrid
information model, or just the candidate MIR schema. There was
general agreement that, as Paul suggested, we needed an overall
information model and that mapping parts of it to elements in
particular services (such as the MIR) was a follow-on activity.
We then decided to proceed by walking through the
GravesDiseaseScenario, identifying those service interactions that
were needed to support it, identifying the entities involved and
checking those against the information model.
The overall scenario starts with Affymetrix microarray studies, to
identify genes associated with Graves' disease. This initial
activity is outside the scope of the demonstrator for All Hands 2003,
and we take the resulting set of candidate genes as our starting
point. These are inputs to the three remaining legs of the scenario,
which we considered in turn.
Annotation pipeline
The first leg of the scenario is a straight workflow invocation.
Thus the first interaction is
GW -> IR: retrieve workflow definition. In the information model,
WorkflowDefinition isA
ActionDefinition isA
DataThing isA
Thing. We would want to find the workflow by semantic description,
so there needs to be some link between
ActionDefinition and
ServiceDescription or more likely
ServiceOperation.
The next interaction is
GW -> EE: execute workflow. The enactment
engine is itself a
Service. Simon asked whether we needed a new
entity type,
FactoryService isA
Service, assuming that each executing
workflow was a new service instance created by the enactment engine.
In fact, this is not currently so: the enactment engine allocates a
new ID for each running workflow, the client passes the ID explicitly
to the enactment engine itself when checking the status of a
workflow. However, the notion of
FactoryService will be useful in an
OGSI setting, and the enactment engine might later use that model.
We then discussed the possibility of status notifications from the
enactment engine to the gateway. Should the enactment engine itself
generate these notifications, or should they be generated by specific
services embedded in the workflow? In an OGSI setting, a workflow
instance might well make its status visible via its sevice data, in
which case it could be an OGSI notification source.
- Open Issue: are workflow status notifications embedded into the enactment engine, or planted in the WSFL document? If planted, by who -- the user, or some undefined software?
Chris G noted that, for some things, notifications may be "out there":
producers publish them to a NS topic, whether or not there are any
subscribers, and consumers subscribe to a topic, not any specific
producer(s). This needs some shared understanding of the set of
available topics, and also some retention
QoS? for messages sent to a
topic. The WP-2 team propose a notion of "bookmarks", whereby all
messages between two consecutive bookmarks are assumed to be related,
and so new subscribers should receive at least topics from the last
bookmark.
It was also noted that we may need similar retention
QoS? for the MIR
(the only current supported
QoS? is "forever, unless explicitly
deleted").
Back to the
GW -> EE: execute workflow. When the workflow ends,
the Gateway collects the output and provenance document from the
enactment engine, and the next interaction is
GW -> IR: store for
each of the output (a
DataThing) and provenance log (a
WFProvenanceLog isA
ProvenanceLog isA
Report isA
DataThing).
We will need to associate the output
DataThing with semantic and
syntactic type information, taken from WSDL, WSFL and/or
ActionDefinition.
[At some point around here, the Gateway will also need discover the
worflow's input (a
DataThing), create an
ActionPerformed (isA
ProxyThing isA
Thing) and an
Input, and associate these and the
Service,
WFProvenanceLog,
ActionDefinition and input
DataThing.
Question: Do we need an
Output type to associate the output
DataThing to the
ActionPerformed?]
We discussed whether we needed to parse the provenance log and convet
it to stored subsidiary
ActionPerformed objects to represent the full
derivation path. The concensus was to expand on receipt (by default,
possibly subject to a user-selectable property). If this proves
expensive in terms of storage, then we will explore on-demand,
on-the-fly expansion into in-memory objects when browsing the
dependency graphs.
One way of reducing the storage overhead might be to not extract
intermediate results from the provenance, but just create
placeholders to represent them. [a
ProxyThing? or
AbstractDataThing
where
DataThing isA
AbstractDataThing isA
Thing? The
placeholder could contain an XPath expression that burrows into the
provenance log]
The above discussion applies to all workflow executions. We then
discussed the particular bioinformatics services to be
invoked from the workflow. The only issues raised were on
dbSNP
and
ENSEMBL.
- dbSNP: there is no known web service, but it can be accessed via SRS or ENSEMBL. Alan and Peter R would need to know how dbSNP is to be used before they could help further. See below for SRS details.
- ENSEMBL: the ENSEMBL implementers are not willing to implement an ENSEMBL web service themselves. We have two choices: either to treat ENSEMBL as an SQL database (and provide an OGSA_DAI interface) - not recommended; the schema is complex and subject to change - or to implement a web service wrapper round some tailored uses of the ENSEMBL (Java) API. The latter approach was seen as the better of the two, but again the EBI team would need details of the requirments in this case.
- SRS: Alan had implemented a quick-and-dirty web service wrapper for SRS for the pre-prototype, but could not recommend it for the demonstrator. Lion are not expected to produce a product anytime soon, but Thure Etzold may be visiting Manchester [no news as yet] and, if so, we should raise the issue. Otherwise, Peter R reckoned a web service wrapper should be straightforward.
- AJR (11/3/03): An OGSA-fied SRS? (or OGSA-DAI?)
Action: Peter Li: provide Alan & Peter R with detailed requirements
for using dbSNP and ENSEMBL.
The final step in this leg of the scenario is to display the results
to the user, perhaps letting the user select relevant fields in the
output file. [This generalizes to viewing any (Data)Thing in the
repository.]
This raised two issues:
- Data Typing: we need bridging services for (syntactic) type translation - custom? generic? 'standard' XML formats? use of XPath/XQuery?
- Presentation: specifying viewers for different formats
Peter R noted that for gene/protein sequences, EMBOSS makes
transformation trivial, though we do need to understand the formats
of outputs (to see whether they are usable as inputs for other
services).
Genotype assay design system
Alan remarked that this activity is a
very interactive process in
the lab. While it could be done automatically, that hasn't been the
approach the EBI has taken in its work with Newcastle. However, from
the previous minutes (
AccessGrid21Feb2003):
"... we have three choices here:
- a highly interactive process with the biologist, perhaps using Talisman or some other application. The opportunities that this affords are:
- user interaction with a workflow (halting and resuming a workflow)
- a workflow notifying the user proxy
- launching a third party tool from the workflow in the lab book, and notifying the workflo when the tool is exited
- collecting provenance data as free text notes
- we use an autogenerating primer and just run through the workflow, perhaps picking up user preferences.
- the scenario is thought of as 2 separate workflows with an application in the middle. The lab book would host the lanuching of the primer application. On its close, a set of possible workflows that could follow could be suggested.
The opinion for June and then Dec were that we should do then in the
order of 2, 3, 1."
We therefore concentrated on option 2. For user preferences, Peter R
suggested four options:
- the defaults
- specified beforehand
- specified inteactively
- based on those in a previous execution
From the perspective of the e-Science layer, Chrid G suggested three
options:
- If the defaults are stored in the WSFL, they could be edited (to produce a new workflow) for the next run, though there would then be no connection between the two enactments.
- The user agent could prompt for preferences either every time, or could extract a previous enactment's options from the provenance log
- Or we could always use the defaults.
When gathering a workflow's open parameters, we would like some means of
validating supplied values. Peter R remarked that static checking
was not really enough: for example, if one parameter is a sequence
and another is a position within it, then the second must be less
sequence's length. He suggested
[[http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Acd/syntax.html][EMBOSS
ACD]] as a formalism that can express such restrictions. ACD is
used elsewhere in bioinformatics beyond EMBOSS.
Question: Does there need to be a Preferences object in the
model?
3D protein structure and SNP visualization
This leg attempts to show the structure of the protein encoded by a
candidate gene, and highlight the amino acid change caused by the
SNP. Alan and Peter R identified a number of issues with this leg as
described at
GravesDiseaseScenario.
First, there may be more than one appropriate PDB structure for the
protein (or its homologues, if nothing is known for the protein), and
choosing the appropriate structure(s) is a job for the scientist.
The demonstrator could display the alternatives to the user, but in
real life, she would need to refer to the literature and talk to
colleagues to make a suitable choice.
Second, the 3D display may be of limited value, since the relevant
changes may well be beneath the protein's surface.
We agreed that, as it stands, the third leg is primarily for
demonstration purposes rather than contributing to the Graves disease
research aspect. A suggested implementation was to search across
SWISS-PROT to discover PDB cross-references, BLAST these against MSD
for homologues [and then search PASTA for active sites].
- AJR (11/3/03): If you have a cross-reference to PDB in SWISS-PROT, you don't need to do a BLAST.
- AJR (11/3/03): How are any BLAST results going to be interpreted?
- AJR (11/3/03): In some instances, using the Pfam/SCOP cross-references to PDB may be better, as they will focus on the structurally important folds.
Finally, Alan raised the question of database update notification
from (e.g. and especially) MSD.
- Can we do this for June?
- How does a user submit a notification profile to a database? Before starting on DQP, Nedim produced a document that covered this: we should revisit that when the current DQP activity eases.
- AJR (11/3/03): Things a user may want to be notified about:
- A structure I'm interested in is updated: Seach on PDB identifier.
- A new structure is deposited/updated that includes my keyword(s): Text search.
- A new structure is deposited/updated that is similar to my sequence: BLAST search.
- It will need considerable interaction with MSD.
- AJR (11/3/03): Has anyone outside of EBI from myGrid contacted Kim?
Action: Nick: clarify objectives for update notification from MSD.
Next Meeting
The next meeting in this series will be on Friday 14 Mar 2003 at 0900
over Access Grid as usual. Homework is to digest
AlansLabBookStoryBoard with the object of walking throught this and
matching it with the services and interactions identified so far.
--
NickSharman - 07 Mar 2003