r4 - 12 Mar 2003 - 12:22:57 - NickSharmanYou are here: myGrid wiki >  Mygrid Web  > DocStore > MinutesStore > AccessGridMinutes > AccessGrid28Feb2003

Minutes of Access Grid meeting, 28 Feb 2003

Attendees

EBI: Peter Rice, Alan Robinson

Manchester: Mark Greenwood, Phil Lord, Nick Sharman, Chris Wroe

Newcastle: Peter Li, Savas Parastatidis, Paul Watson

Nottingham: Kevin Glover, Chris Greenhalgh, Milena Radenkovic

Southampton: Simon Miles, Juri Papay, Victor Tan

Agenda

The purpose of the meeting will be to review the proposed repository schema against the interactions matrix, to check that the former captures all the information needed to support the latter (and especially the interactions used in the scenario.

Discussion

Chris G first asked whether we were assessing the overall myGrid information model, or just the candidate MIR schema. There was general agreement that, as Paul suggested, we needed an overall information model and that mapping parts of it to elements in particular services (such as the MIR) was a follow-on activity.

We then decided to proceed by walking through the GravesDiseaseScenario, identifying those service interactions that were needed to support it, identifying the entities involved and checking those against the information model.

The overall scenario starts with Affymetrix microarray studies, to identify genes associated with Graves' disease. This initial activity is outside the scope of the demonstrator for All Hands 2003, and we take the resulting set of candidate genes as our starting point. These are inputs to the three remaining legs of the scenario, which we considered in turn.

Annotation pipeline

The first leg of the scenario is a straight workflow invocation. Thus the first interaction is GW -> IR: retrieve workflow definition. In the information model, WorkflowDefinition isA ActionDefinition isA DataThing isA Thing. We would want to find the workflow by semantic description, so there needs to be some link between ActionDefinition and ServiceDescription or more likely ServiceOperation.

The next interaction is GW -> EE: execute workflow. The enactment engine is itself a Service. Simon asked whether we needed a new entity type, FactoryService isA Service, assuming that each executing workflow was a new service instance created by the enactment engine. In fact, this is not currently so: the enactment engine allocates a new ID for each running workflow, the client passes the ID explicitly to the enactment engine itself when checking the status of a workflow. However, the notion of FactoryService will be useful in an OGSI setting, and the enactment engine might later use that model.

We then discussed the possibility of status notifications from the enactment engine to the gateway. Should the enactment engine itself generate these notifications, or should they be generated by specific services embedded in the workflow? In an OGSI setting, a workflow instance might well make its status visible via its sevice data, in which case it could be an OGSI notification source.

  • Open Issue: are workflow status notifications embedded into the enactment engine, or planted in the WSFL document? If planted, by who -- the user, or some undefined software?

Chris G noted that, for some things, notifications may be "out there": producers publish them to a NS topic, whether or not there are any subscribers, and consumers subscribe to a topic, not any specific producer(s). This needs some shared understanding of the set of available topics, and also some retention QoS? for messages sent to a topic. The WP-2 team propose a notion of "bookmarks", whereby all messages between two consecutive bookmarks are assumed to be related, and so new subscribers should receive at least topics from the last bookmark.

It was also noted that we may need similar retention QoS? for the MIR (the only current supported QoS? is "forever, unless explicitly deleted").

Back to the GW -> EE: execute workflow. When the workflow ends, the Gateway collects the output and provenance document from the enactment engine, and the next interaction is GW -> IR: store for each of the output (a DataThing) and provenance log (a WFProvenanceLog isA ProvenanceLog isA Report isA DataThing).

We will need to associate the output DataThing with semantic and syntactic type information, taken from WSDL, WSFL and/or ActionDefinition.

[At some point around here, the Gateway will also need discover the worflow's input (a DataThing), create an ActionPerformed (isA ProxyThing isA Thing) and an Input, and associate these and the Service, WFProvenanceLog, ActionDefinition and input DataThing. Question: Do we need an Output type to associate the output DataThing to the ActionPerformed?]

We discussed whether we needed to parse the provenance log and convet it to stored subsidiary ActionPerformed objects to represent the full derivation path. The concensus was to expand on receipt (by default, possibly subject to a user-selectable property). If this proves expensive in terms of storage, then we will explore on-demand, on-the-fly expansion into in-memory objects when browsing the dependency graphs.

One way of reducing the storage overhead might be to not extract intermediate results from the provenance, but just create placeholders to represent them. [a ProxyThing? or AbstractDataThing where DataThing isA AbstractDataThing isA Thing? The placeholder could contain an XPath expression that burrows into the provenance log]

The above discussion applies to all workflow executions. We then discussed the particular bioinformatics services to be invoked from the workflow. The only issues raised were on dbSNP and ENSEMBL.

  • dbSNP: there is no known web service, but it can be accessed via SRS or ENSEMBL. Alan and Peter R would need to know how dbSNP is to be used before they could help further. See below for SRS details.

  • ENSEMBL: the ENSEMBL implementers are not willing to implement an ENSEMBL web service themselves. We have two choices: either to treat ENSEMBL as an SQL database (and provide an OGSA_DAI interface) - not recommended; the schema is complex and subject to change - or to implement a web service wrapper round some tailored uses of the ENSEMBL (Java) API. The latter approach was seen as the better of the two, but again the EBI team would need details of the requirments in this case.

  • SRS: Alan had implemented a quick-and-dirty web service wrapper for SRS for the pre-prototype, but could not recommend it for the demonstrator. Lion are not expected to produce a product anytime soon, but Thure Etzold may be visiting Manchester [no news as yet] and, if so, we should raise the issue. Otherwise, Peter R reckoned a web service wrapper should be straightforward.
    • AJR (11/3/03): An OGSA-fied SRS? (or OGSA-DAI?)

Action: Peter Li: provide Alan & Peter R with detailed requirements for using dbSNP and ENSEMBL.

The final step in this leg of the scenario is to display the results to the user, perhaps letting the user select relevant fields in the output file. [This generalizes to viewing any (Data)Thing in the repository.]

This raised two issues:

  1. Data Typing: we need bridging services for (syntactic) type translation - custom? generic? 'standard' XML formats? use of XPath/XQuery?

  1. Presentation: specifying viewers for different formats

Peter R noted that for gene/protein sequences, EMBOSS makes transformation trivial, though we do need to understand the formats of outputs (to see whether they are usable as inputs for other services).

Genotype assay design system

Alan remarked that this activity is a very interactive process in the lab. While it could be done automatically, that hasn't been the approach the EBI has taken in its work with Newcastle. However, from the previous minutes (AccessGrid21Feb2003):

"... we have three choices here:

  1. a highly interactive process with the biologist, perhaps using Talisman or some other application. The opportunities that this affords are:
    1. user interaction with a workflow (halting and resuming a workflow)
    2. a workflow notifying the user proxy
    3. launching a third party tool from the workflow in the lab book, and notifying the workflo when the tool is exited
    4. collecting provenance data as free text notes
  2. we use an autogenerating primer and just run through the workflow, perhaps picking up user preferences.
  3. the scenario is thought of as 2 separate workflows with an application in the middle. The lab book would host the lanuching of the primer application. On its close, a set of possible workflows that could follow could be suggested.

The opinion for June and then Dec were that we should do then in the order of 2, 3, 1."

We therefore concentrated on option 2. For user preferences, Peter R suggested four options:

  1. the defaults
  2. specified beforehand
  3. specified inteactively
  4. based on those in a previous execution

From the perspective of the e-Science layer, Chrid G suggested three options:

  1. If the defaults are stored in the WSFL, they could be edited (to produce a new workflow) for the next run, though there would then be no connection between the two enactments.
  2. The user agent could prompt for preferences either every time, or could extract a previous enactment's options from the provenance log
  3. Or we could always use the defaults.

When gathering a workflow's open parameters, we would like some means of validating supplied values. Peter R remarked that static checking was not really enough: for example, if one parameter is a sequence and another is a position within it, then the second must be less sequence's length. He suggested [[http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Acd/syntax.html][EMBOSS ACD]] as a formalism that can express such restrictions. ACD is used elsewhere in bioinformatics beyond EMBOSS.

Question: Does there need to be a Preferences object in the model?

3D protein structure and SNP visualization

This leg attempts to show the structure of the protein encoded by a candidate gene, and highlight the amino acid change caused by the SNP. Alan and Peter R identified a number of issues with this leg as described at GravesDiseaseScenario.

First, there may be more than one appropriate PDB structure for the protein (or its homologues, if nothing is known for the protein), and choosing the appropriate structure(s) is a job for the scientist. The demonstrator could display the alternatives to the user, but in real life, she would need to refer to the literature and talk to colleagues to make a suitable choice.

Second, the 3D display may be of limited value, since the relevant changes may well be beneath the protein's surface.

We agreed that, as it stands, the third leg is primarily for demonstration purposes rather than contributing to the Graves disease research aspect. A suggested implementation was to search across SWISS-PROT to discover PDB cross-references, BLAST these against MSD for homologues [and then search PASTA for active sites].

  • AJR (11/3/03): If you have a cross-reference to PDB in SWISS-PROT, you don't need to do a BLAST.
  • AJR (11/3/03): How are any BLAST results going to be interpreted?
  • AJR (11/3/03): In some instances, using the Pfam/SCOP cross-references to PDB may be better, as they will focus on the structurally important folds.

Finally, Alan raised the question of database update notification from (e.g. and especially) MSD.

  • Can we do this for June?
  • How does a user submit a notification profile to a database? Before starting on DQP, Nedim produced a document that covered this: we should revisit that when the current DQP activity eases.
    • AJR (11/3/03): Things a user may want to be notified about:
      • A structure I'm interested in is updated: Seach on PDB identifier.
      • A new structure is deposited/updated that includes my keyword(s): Text search.
      • A new structure is deposited/updated that is similar to my sequence: BLAST search.
  • It will need considerable interaction with MSD.
    • AJR (11/3/03): Has anyone outside of EBI from myGrid contacted Kim?

Action: Nick: clarify objectives for update notification from MSD.

Next Meeting

The next meeting in this series will be on Friday 14 Mar 2003 at 0900 over Access Grid as usual. Homework is to digest AlansLabBookStoryBoard with the object of walking throught this and matching it with the services and interactions identified so far.

-- NickSharman - 07 Mar 2003

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r4 < r3 < r2 < r1 | More topic actions
 
Powered by myGrid wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding myGrid wiki? Send feedback