Table of Contents
Table of Contents
The Taverna Workbench allows users to construct complex analysis workflows from components located on both remote and local machines, run these workflows on their own data and visualise the results. To support this core functionality it also allows various operations on the components themselves such as discovery and description and the selection of personalised libraries of components previously discovered to be useful to a particular application.
Throughout this document various specialized terms will be used, unless otherwise stated the sense intended is as follows:
Workflow - A set of components and relations between them used to define a complex process from simple building blocks. Relations may be in the form of data links, transferring information from the output of one component to the input of another, or in the form of control links which state some conditions on the execution of a component. An example of a control link is the basic temporal ordering 'do not run component A until component B has completed'. In Taverna a workflow is realized by an instance of the workflow data model, this appears on disk or on the web as an XML file in the XScufl format.
Component - A component is a reusable building block which performs some well defined function within a process. In the bioinformatics domain we can regard any command line tool or PERL script as a component, the critical definition is that this component should be atomic in nature and cannot be split into smaller units. Components may consume information and may emit information, for example a BLAST job is a component which consumes a sequence and some search parameters (library, matrix, sensitivity etc) and which emits a report containing the sequence similarities found. Components may be located on any computational resource accessible via the internet or on the user's local workstation.
Service - All services are also Components, we use the term Service explicitly to refer to those components which are hosted on a computational resource external to the user's local workstation. Services have some underlying implementation technology such as SOAP (Simple Object Access Protocol) although this is hidden behind Taverna's abstraction layer as far as end users are concerned.
Enactor - A workflow enactor is the entity responsible for coordinating the invocation of Components within Workflows. It may be manifested as a Service itself, in which case it would consume a Workflow definition and some input data and emit the results, or, as is the case with this release of Taverna, as a software component within the workbench suite. The enactor manages the entire invocation process including progress reporting, data transfer between Components and any other housekeeping required.
Taverna, as with all new technologies, has a certain 'activation barrier' before it becomes truly useful in terms of time invested from the user's perspective. In order to lower this we present some possible reasons why workflow technologies might save time and effort in the long run:
Efficiency - Taverna saves users a great deal of time in various ways.
Analysis Design - Design of new analyses is accelerated over alternative approaches through a combination of easy visualisation of the current analysis and ready availability of new component with which to extend it. Users can start with something familiar and incorporate new functional modules with very little effort. For example, a trivial workflow might fetch a sequence and its associated terms in the Gene Ontology (GO), a user might then extend this to also fetch the GC concentration of the sequence - using traditional approaches such as scripting in PERL this would involve editing the code using some kind of text editor, possibly installing the GC analysis tool and then some testing to determine whether the correct results were being achieved, using Taverna this becomes a simple search for a GC service, a drag and drop operation to incorporate this tool into the workflow and a further operation within the graphical interface to connect the sequence fetch to the GC analysis.
Experiment Invocation - Most traditional, small-scale bioinformatics (excluding projects such as whole genome annotation) is performed via a combination of web browser based and traditional UNIX style command line tools. When some combination of tools is required the data is manually transferred between components through cut and paste in the case of web pages or ftp and similar tools. These manual stages are time consuming, prone to error and generally not 'scientifically interesting' to the user, they're technological barriers in the way of the science. In contrast, workflow systems such as Taverna handle all the mundane housekeeping such as data transport and tracking and can run their sets of components without any external intervention. This allows users to start a workflow invocation then go do something else (including potentially other workflows), even if the workflow itself takes significant time to complete the user is free to do other things during this time.
Component Management - By using components at remote sites such as the EBI users are freed from the requirement to keep the components up to date, install software and run complex hardware such as compute clusters. Effectively Taverna gives any user with a reasonably modern PC access to a large number of supercomputing resources across the planet with very little or no administrative requirements. Where the user is developing a novel algorithm or tool this allows the user to focus exclusively on the provision of that particular service rather than having to also support all the 'standard' services as well. This should in turn lead to higher quality tools, the time saving translating into more resources for the specific tool development. An example might be a novel secondary structure prediction algorithm; if the group developing and providing this service to users had to also provide all the ancillary functions such as public sequence database fetches for source data they would have a significant administrative overhead, by using workflow technology they can simply provide the single prediction service and rely on users accessing the other services from more suitable sources such as the major bioinformatics service providers.
Invocation Performance - Although modern workstations are significant computational resources in their own right there are significant numbers of algorithms which require industrial scale compute capacity. By using remote components the user can take advantage of whatever backing hardware the service provider has available. For example, InterproScan - a tool developed at the EBI which aggregates search results from a number of different functional and domain prediction algorithms will not run in any sensible time on a typical workstation but by accessing it directly from the EBI the user has access to hundreds of nodes in a compute farm, several orders of magnitude more powerful than the machine they are sitting in front of but with no power requirements or air conditioning, not to mention the cost of purchasing expensive cluster systems. The end result is that workflows can complete significantly faster than the equivalent scripts running entirely on the user's workstation, translating in turn to faster turnaround times for the underlying science.