Introduction
myGrid is a computer science project working in the field of bioinformatics.One of the main focuses of the myGrid project is on the use of workflow technology to automate common bioinformatics processes. The term workflow has been used within the myGrid project to refer to graphs of external and local services. myGrid has developed a workflow workbench called Taverna which allows the construction and enactment of workflows. External services which can be used in Taverna include those exposed as SOAP web-services whose interface is defined through WSDL. A increasing number of services which are exposed through custom clients have been integrated into Taverna. This includes services such as BioMART.
Users of Taverna construct workflows by manipulating a graphical interface. They create instances of processors which will be used to invoke services when the workflow is enacted, create workflow inputs and outputs and connect all of these up by edges which indicate where data should flow. Taverna’s internal representation of workflow graphs is called SCUFL, and consists of an object model and an xml format to serialize objects into. When a user wishes to enact a workflow, they use enter input values into an input dialog which has been constructed by examining the names and types of workflow inputs as defined in the SCUFL model. Enacting the workflow with these inputs produces a set of results, which are visualized using another dialog constructed by introspection of the workflow definition.
Although Taverna’s interface is well-designed and usable, it is still complex and takes time to learn. For the expert bioinformatician, spending time installing and learning Taverna may well be worthwhile, as the ability to write workflows may speed up their work considerably. However, a large community of biologists exist that who are less expert in computing and therefore less likely to be willing to invest the time that it takes to learn how to author and enact workflows in Taverna. It may be the case, though, that workflows that have been written by other, more expert users, may still be useful to this community.
It therefore became apparent that by offering a simpler interface to workflow enactment and results browsing technology, myGrid may become more relevant to a much wider community. This simple interface might allow the ability to enact a limited number of registered workflows which have previously been authored in Taverna. It could include input and output forms which were specifically designed for a particualr workflow, and which can therefore be more usable than those generated by introspection of the workflow definition. myGrid has decided to develop this interface as web-pages in a web-portal. Part of the reasoning behind this is that web-based interfaces are familiar to biologists, who commonly use them already for invoking a number of services, and are not required to instal any more software than a simple web-browser.
Portal frameworks
A number of portal frameworks exist which provide functionality to make web-development easier for developers. These portal frameworks are web-applications into which content can be deployed, and they commonly provide a login system and security features to control access to the content. Content in the portal is provided through instances of portlets. A portlet is a Java class which implements a particular interface, providing methods which can be queried by the portal framework to obtain HTML to display to the user. Often, generic implementations of this interface have been written which, for example, allow a developer to specify a JSP page to be used to generate HTML for the portlet. Commonly, portal frameworks allow a user to place a number of portlets onto one HTML page, providing layout managers to specify where on the page each portlet should be located (eg a user could choose to display three portlets on one page, each taking up one column and being allocated 33% of the total page width). When a portal framework web-application is passed an HTTP request for a particular page, it makes calls on methods on each of the portlets which the user has placed in the page, obtaining a fragment of HTML from each, and then combines these fragments together to produce HTML to be returned to the calling web-browser.
All portal frameworks use the concept of a portlet to represent content, but in earlier frameworks the portal interface was not standardised. An example is the Apache Jetspeed portal framework, in which developers create a portlet by implementing interface org.apache.jetspeed.portal.Portlet. More recently, the Java community has produced a standardization attempt called JSR-168, which is associated with a set of interfaces in package javax.portlet. JSR-168 is implemented by, amongst other Gridsphere and U-portal, and will be implemented by Jetspeed-2, which is yet to be release.
Summary of work so far
Work so far has included development of an initial prototype portal capable of enacting workflows and browsing results, further investigation into how results visualization can be embedded into workflows, and the start of development on a second portal prototype which can be used to enact workflows and browse results.
The initial prototype portal was developed in the Apache Jetspeed portal framework. Jetspeed was chosen because it has a large and helpful community of users and developers. At the time of choosing, there was also some effort in the UK e-Science community to standardize on Chef, a portal framework building on Jetspeed, for eScience portal provision.
The initial prototype portal application consists of a database, with tables to store workflows represented by serialized SCUFL XML string, sets of input values to be used to enact particular workflows, and sets of results produced by these enactments. To allow a user to enact a particular worklfow stored in the database, a developer must configure the portal framework with an input JSP and an output JSP, with the input JSP being used to collect input values for an enactment, and the output JSP being used to render results of the enactment. Individual users are then allocated accounts in the portal, and can enact workflows available to them, producing results that only they can browse. This prototype is available in myGrid CVS, under directory mygrid/workflowportal, though some maintenance may be required to enable it to work with more recent releases of Taverna.
There are a number of disadvantages to working this way. Developers who wish to make a workflow availabe in the portal must be familiar with the workflow, JSP and Jetspeed. Functionality of the workflow is split between the workflow itself, which is used to produce results, and the JSP which is used to visualize them, meaning that if the workflow changes the JSP has to be updated as well. After demonstrating this prototype at the eScience All Hands meeting in September 2004, it was decided that some effort should be put into working out how the visualization stage could be embedded into the workflow itself, rather than being expressed via a separate JSP.
A set of workflows developed by Hannah Tipney at Manchester was chosen as a case-study to be used to investigate methods of embedding visualization into workflows. These workflows have been developed by Hannah in co-operation with members of the myGrid team, and involve both external services and a number of custom services developed in myGrid. Hannah has written three workflows, labelled A, B and C, which have been used to investigate Williams-Beuren Syndrome (WBS), a disease with a genetic basis. They are run sequentially, with the input values of C being taken from the output values of B and the input values of B being taken from the output values of A, and produce a large volume of results. Workflow B, in particular, can typically produce 20mb of raw text spread over roughly 10 different outputs. The volume of results produced makes them difficult to visualize, and Hannah described herself as beginning to feel swamped in them.
Much of the detail of the work that was done to add visualizations of these workflows is described in
this paper, which was submitted to www2005, but I will give a brief synopsis here.
Each workflow was modified to have outputs which, when saved to disk as part of a set of results produced by enacting the workflow in Taverna, would be valid web-pages. These web-pages can contain hyperlinks to results placed onto other outputs of the workflow, and were generated by extra processors added to the workflow. Some of these processors ran BeanShell scripts, and some generated cross-references to results produced by other processors. The web-pages produced by the modified workflows commonly contain index pages, with links which can be clicked through to obtain more detail. An example index page from workflow B contains a gene sequence rendered as a JPEG. Clicking on a region of the JPEG reveals information about the region of sequence represented by that area of the JPEG, which was extracted by processing steps in the workflow.