r3 - 11 Mar 2003 - 15:34:09 - ChrisGreenhalghYou are here: myGrid wiki >  Mygrid Web  > WorkInProgress > LabBook > AlansLabBookStoryBoard

A Proposal for a myGrid LabBook Scenario for IF-3 & beyond.

See also AlansLabBookStoryBoardWP6 for some additional WP6 comments inlined. -- ChrisGreenhalgh - 11 Mar 2003

Priors:

- The domain specific part of a demonstration is independent of the general myGrid lab-book requirements, i.e. the same functionality would be needed for both the "Graves disease" & the "EMBOSS tutorial" scenarios.

- What is metadata vs. what is annotation? I take annotation to be a type of metadata that is an interpretation of the data to which it is added, e.g. a gene prediction on a genomic sequence. I take metadata to be more simple, e.g. it's a date or a URL or a formal description of the syntactic type of an input parameter of a service.

- Provenance: Given the above, provenance can be seen as both metadata (Who? When? Where?) & annotation (Why?).

- All required service instances are available as Web services & stored in a registry.

- The following is subject to substantial change according to others perception of how workflow is described, found, composed & configured in myGrid: I distinguish workflow template (which documents services) from workflow (which documents service instances) from configured workflow (which documents service instances & their parameters) from WSFL (a specific language encoding of a configured workflow). For example a workflow template says "use emma", a workflow says "use emma at the EBI", a configured workflow says "use emma at the EBI with X=10", WSFL is the XML encoding of "use emma at the EBI with the BSA API & set X=10".

The story board:

1. I log onto the myGrid system to give me access to my mIR & other services.

2. I check the status of any workflows that I started previously to see if they are now finished & I can check the result. I may also have notifications detailing people or services that have annotated the contents of my mIR, e.g. my supervisor has signed off on the work I did last week with her digital signature. N.B. These annotations need not be stored in my personal mIR, however I must have given something the permission to look at my mIR. Something may have created their own annotations on the contents of my mIR, but decided to keep these private.

3. I browse through my projects, experiments, workflows.

[I imagine this browser to be analogous to those of an MP3 player where one can slice the data in different ways & choose to organise in different ways by artist, genre, album title, etc.]

4. I create a new project (myProject1) with a title & a free-text description (provenance annotation). I assume that myGrid will record provenance metadata including date, time, user name, user group, default security permissions, the host from which I created the project. The provenance is associated with myProject1 & stored in the mIR.

5. I upload some data (myData1) that I've obtained to my mIR & that will be part of my new project (myProject1) (automatically trap basic provenance metadata of this operation & the data).

6. I attach some metadata to this uploaded data, e.g. a name, a description, a type (e.g. unordered collection of protein sequences), where did it come from, how was it produced. Perhaps also some other annotation, e.g. what do I think of its quality.

[At another time, I may want to know "what projects did I use myData1 in?" & "What things are associated with myProject1?".]

7. I have an idea of what I want to do with the data as part of an in silico experiment, so I create a new experiment (myExpt1) that is associated with my project (myProject1) with a title, free text description, plus the automatically trapped basic provenance metadata.

8. Somehow I find a workflow template (WFTemplate1) that satisfies my requirements & describes the services that are to be used in this in silico experiment.

[I assume that this workflow template doesn't record service instances. I assume that each workflow template comprises a number of services that run without breakpoints, but that have parameters that need to be pre-configured. Within a workflow template, parameters for a service may have recommended values.]

9. I want to associate this workflow template (WFTemplate1) with my experiment (myExpt1), plus some annotation about why I've chosen this workflow template. I trust myGrid to have recorded in the mIR where WFTemplate1 came from, plus provenance metadata about time, date, permissions, etc.

[At another time I may want to find both "what experiments did I use WFTemplate1 in?" & "In myExpt1, what workflow templates did I use?".]

10. I need a registry service that can return specific instances of the services that the workflow template requires. The registry will return metadata about all service instances fitting the criteria. As well as criteria that the workflow template may require, e.g. a particular version of an application, I may have my own personal preferences, e.g. it has to be free.

[Is my interpretation of finding & configuring a workflow correct and/or reasonable? Before running a workflow, I personally want to know where it is running. Perhaps others have better faith? Perhaps only service instances that are trusted by my supervisor are present in the service repository & that is the basis of my trust? Will the workflow enactment engine be able to use all the service instances in the registry, i.e. will they have a suitable API?]

11. I may need to choose between alternative service instances (e.g. emma@EBI vs. emma@HGMP) using the metadata returned by the service registry & to be used in the workflow.

[Do I get to choose between different service instances? How much choice do I have? For example, I presume that the workflow enactment engine may only communicate with services using specific APIs. At what point is the check made that a service instance supports an API that the workflow enactment engine can use?]

12. I want to record why I made a choice in favour of particular service instances for this workflow, e.g. emma@EBI is free to anyone, but emma@HGMP is free to registered users only.

[To what do I attach the annotation about my choices of service instance? Is a part of the workflow annotatable? Or only the whole thing? I.e. is a service instance recognised in the mIR as a separate entity & to which metadata may be attached? I have a vision of a large XML file with multiple name spaces: how conflated is data & metadata in an XML representation of the workflow?]

13. I configure each activity of the workflow with my choice of parameters or default to those recommended in the workflow for the service instance, as well as my starting data. I want to annotate why I made those choices on the parameters.

[Some type checking by myGrid should help prevent me doing incorrect things. Do I need to have a service instance before I can do the parameter configuration?]

14. I want to store the details of my configured workflow (WFConfigured1) for this experiment (myExpt1) in the mIR.

15. I decide it's time to run my workflow.

[Do I have a choice of which workflow enactment service I use? The configured workflow is converted to a workflow enactment language suitable for my workflow enactment engine. How tightly coupled is the service registry, my choice of service instances & the workflow enactment engine? Could I end up choosing service instances with APIs that the enactment engine doesn't understand? Are only service instances with APIs that the enactment engine can understand present in the service registry?]

16. As each activity in the workflow is completed, I'd like to be notified & to have the intermediate results stored in the mIR & associated with WFConfigured1 so that I can find them easily. I expect that provenance metadata about the service instances which are run is stored along with the results: location, input parameters I specified, default parameters the service instance used, including resources that the service instance used, e.g. which version of SWISS-PROT did the BLAST server use. For each result stored in the mIR, as well as the usual provenance metadata (who, date, time), I expect that its metadata includes a syntactic & semantic type for the data (taken from where?).

[Who writes the results & provenance metadata to my mIR? - The enactment engine? The service instance? The Gateway? Who do I trust with my credentials? I might trust my lab's workflow enactment engine to write to my mIR, but I probably wouldn't trust a service instance or a public enactment engine. Is it inefficient to have everything shuttled through the Gateway?

I would expect that the results generated from a service instance & stored to the mIR should be immutable to protect against fraud - maybe use PKI to monitor if the stored results are the same as those sent originally?]

17. For each activity, I want to look at the results & record thoughts about them (i.e. annotate them). If I don't like the intermediate results & the workflow is still running, I may want to terminate it.

18. I log out of myGrid.

19. Later I return & look at my results…

20. After logging in, I am notified that my in silico experiment has finished & the results have been stored, including the final result (as R1), along with metadata & provenance, in myExpt1 of myProject1.

21. I find & select that I want to look at the final result (R1) & a viewer displays it for me.

[I expect that the myGrid & the mIR will store the type of my data as part of the metadata, e.g. a MIME type. I may have personalised myGrid so that it uses my preferred viewer. If it's a data type for which I do not have a viewer already specified, then I'd expect that myGrid would help me find & choose an appropriate one, c.f. Netscape Plug-Ins.]

22. I decide to change some of the parameters for a service instance in the workflow (WFTemplate1) from their recommended values to produce a new workflow (WFConfigured1.1) & see how this changes the final result (R1.1). I'll need to record why I've done this & have all the parameter values captured & stored in the mIR.

[I assume that this new run of the workflow is still part of myExpt1.]

23. I decide that the original result using the recommended parameters is the best & I add two annotations to the final result (R1): one is my conclusions (myNote1), the other is ideas for the next experiment (myNote2).

24. I decide the results are so good, that I'm going to share them with my supervisor & the colleague from whom I got the original data & I send them the location of my result (R1) so that they can view it.

25. Since she's my boss, my supervisor has permissions to see everything I've done. From the result (R1), she can follow the trail back to see how it was generated, using which services, from which workflow template (WFTemplate1), with which parameters (WFConfigured1) & using which starting data (myData1). At each stage, she may also see the annotation that I may have attached to the objects describing why I made particular choices, e.g. which workflow template, which service instances & which parameters. She may make some comments of her own, and/or sign off on the work. Before signing off on the work, she may want to check that I haven't altered R1 to falisify the results. She could do this either by re-running the workflow (WFConfigured1) herself with my data (myData1), or if we had a PKI for checking with a service instance that a result hasn't been tampered with.

[Having a system to detect possible fraud would be interesting, but I'm not sure about the plausability.]

26. I'm a little more paranoid about my colleague. Although I'd like her to see my final results (R1), conclusions (myNote1) and how I got there (WFTemplate1 & WFConfigured1). I don't want her to see the annotations about what I want to do next (myNote2), or possibly comments I made about my perceptions of the quality of her data attached to myData1.

[I need to be able to grant & revoke privileges to other people on selected items in my mIR for reading & writing.]

27. Following comments from my supervisor & colleague, I decide to retrieve my original conclusions (myNote1) & re-write parts of it (myNote1.1).

[I & my supervisor need to be sure that the original copy of myNote1 is still available & not overwritten or deleted by myNote1.1.]

28. I run two further in silico experiments in this project: myExpt2 & myExpt3. myExpt2 is a different workflow template (WFTemplate2) which has the same function as WFTemplate1, but uses a different methodology - I decide the result (R2) of this experiment isn't as good as the first one, which I document & have stored in the mIR with R2. The other in silico experiment, myExpt3, takes the final result (R1) of the workflow (WFConfigured1) in the first in silico experiment (myExpt1), plus some new data (myData2) & runs it through a new workflow (WFTemplate3 & WFConfigured3) to produce a final result (R3).

29. Some time later, I am in the process of writing up the paper about these experiments. I find my final result (R3). I need to know how & why that was generated all the way back to myData1, i.e. trace back through two workflows.

[The mIR needs to be able to capture the relationships that R1 was part of the input data to WFConfigured3 in myExpt3 & is also the result of WFConfigured1 in myExpt1 that also used myData1 as input. So the mIR is able to store different types of entities & the relationship between them, as well as the annotation/metadata that is associated with them - Does this sound like an OODBMS?]

30. My supervisor wants to identify what people in the group have been working on & where there's commonality.

[A text analysis engine is run over the contents of everyone's mIR (or at least parts of them such as the descriptions & annotations) to identify terms & concepts. Look for people, projects & experiments where the same concepts are being used.]

In summary:

I create a new project (myProject1) in which I have created three in silico experiments (myExpt1-3) where I have run workflows. In myExpt1, I used the same workflow template twice, but with different parameters. In myExpt2 I used a different workflow template to myExpt1, but that was expected to do a similar type of analysis. In myExpt3, I took the final result of myExpt1, combined it with some new data & ran it through a new workflow to produce a new final result. For example, myExpt1 & myExpt2 may represent different methodologies to build a multiple sequence alignment & model from a protein family (e.g. prophet & prophecy vs. hmmer & hmalign). Then myExpt3 is taking the model & a new protein sequence to determine if it contains the protein domain.

-- AlanRobinson - 22 Jan 2003

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r3 < r2 < r1 | More topic actions
 
Powered by myGrid wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding myGrid wiki? Send feedback