Chapter 1. Users Guide.

Table of Contents

1. Workbench Overview
1.1. Definitions
1.2. Benefits of workflow technology
2. Getting started with Taverna
2.1. Installation
2.1.1. Windows
2.1.2. Mac OS
2.1.3. Linux
2.2. Configuration
2.2.1. Artefact repository location
2.2.2. Proxy configuration
2.2.3. The mygrid.properties file
2.2.4. Default services
2.2.5. Image types and the mysterious 'Error 4'
2.2.6. Setting an alternative location for the 'dot' tool
2.3. Running Taverna
2.3.1. Enacting a predefined workflow
2.3.1.1. Loading a workflow
2.3.1.2. Enacting the current workflow
2.3.1.3. Browsing results
2.3.1.4. Saving a workflow
2.3.1.5. Closing a workflow
2.3.2. Creating a (very) simple workflow
2.3.2.1. Workflow inputs and outputs
2.3.2.2. A single sequence fetch component
2.3.2.3. Connecting everything together
2.3.2.4. Describing the input
2.3.2.5. Enacting the workflow
2.4. Example worklows
2.5. Custom perspectives
3. Workbench windows in detail
3.1. Advanced Model Explorer
3.1.1. Entity table
3.1.1.1. Workflow metadata
3.1.1.2. Worfklow inputs and outputs
3.1.1.3. (Slight digression) - Some facts about using MIME types in Taverna
3.1.1.4. Connecting workflow inputs to processors
3.1.1.5. Processors (Top level node)
3.1.1.6. Individual processor nodes
3.1.1.7. Data link nodes
3.1.1.8. Control link nodes
3.2. Interactive Diagram
3.2.1. Adding new processors to a workflow
3.2.2. Adding workflow inputs and outputs
3.2.3. Connecting Data and Control links.
3.2.4. Nested workflows
3.2.5. Configuring processors
3.2.6. Removing Processors and Links
3.3. Workflow diagram
3.3.1. Toolbar
3.3.1.1. Diagram Save Options
3.3.2. Diagram configuration
3.3.3. Show types
3.3.4. Expanding nested workflows
3.3.5. Hide boring processors
3.3.6. Port display section
3.3.7. Alignment control
3.3.8. Image navigation, where the 'Fit to window checkbox' gone?
3.3.9. Processor colours
3.4. Available services
3.4.1. Service panel organisation
3.4.2. Adding instances of a service to the workflow
3.4.2.1. Creation using drag and drop
3.4.2.2. Creation using menu options
3.4.2.3. Discovering the service for a given processor
3.4.2.4. Importing a workflow from the service panel
3.4.2.5. Searching over the service panel
3.4.2.6. Fetching service descriptions
3.4.2.7. Populating the services panel
3.5. Enactor launch panel
3.5.1. Enactment status panel
3.5.1.1. Enactment status
3.5.1.2. Processor states
3.5.1.3. Inspecting intermediate results
3.5.1.4. Results browser
3.5.1.5. Process report
4. Scufl language and Workbench features
4.1. Implicit iteration
4.1.1. Implicit iteration over multiple inputs
4.2. Conditional branching
4.3. Beanshell scripting
4.3.1. Creating a new beanshell instance
4.3.2. Defining inputs and outputs
4.3.3. Configuring the script
4.3.4. Sharing and reuse of scripts
4.3.5. Depending on third party libraries
4.3.5.1. Using dependencies
4.3.5.2. Dependency classloaders
4.3.5.3. JNI-based native libraries
4.4. R-scripts with the RShell processor
4.4.1. Introduction and installation
4.4.1.1. Installing on Windows
4.4.2. Using the RServe processor
4.4.3. Connection and advanced port types
4.4.4. Graph output
4.5. Biomart query integration
4.5.1. Describing a Biomart service
4.5.2. Creating a new Biomart query
4.5.3. Configuring filters
4.5.4. Parameterised filters
4.5.5. Configuring attributes
4.5.5.1. Selecting attributes
4.5.5.2. Result modes
4.5.6. Second dataset filters
4.6. Soaplab configuration
4.6.1. Metadata display
4.6.2. Polling
4.7. WSDL processor
4.7.1. WSDL scavenger
4.7.2. XML Splitters
4.7.3. Optional elements in return data
4.7.4. Cyclic references
4.8. Breakpoints - debugging and steering workflow invocation
4.8.1. Breakpoints
4.8.2. Editing intermediate data
4.8.2.1. Effect on LSIDs
4.9. Executing a workflow without the GUI
5. Taverna plugins
5.1. Semantic search of services with Feta
5.1.1. Creating service search requests
5.1.2. Results displayed and integration into the workflow
5.2. Taverna LogBook
5.2.1. LogBook Wiki
5.3. Taverna Interaction Service
5.3.1. Using the Interaction Service from Taverna
5.4. Taverna Remote Execution
5.4.1. Specifying Remote Execution Servers
5.4.2. Running a workflow remotely
5.5. myExperiment and WHIP Plugin (beta) 0.1.3
6. Additional optional tools
6.1. API Consumer
6.1.1. Prerequisites
6.1.2. Setup
6.1.3. Usage
6.1.4. Adding methods to API definition
6.1.5. API level metadata.
6.1.6. Saving the API definition file.
6.1.7. Using the API consumer processor from Taverna
6.1.8. A word of advice
6.2. Webservice Data Proxy
6.2.1. Description
6.2.2. Initial Configuration and Installation
6.2.3. Adding Webservices
6.2.4. Configuring Webservices
6.2.5. Referencing in action.
6.2.6. Data dereferencing
6.2.7. Data housekeeping
6.2.8. Current constraints and future work.
6.3. Interaction Service Server
6.3.1. Setup
6.4. Remote Execution Server
6.4.1. Configuring the database
6.4.2. Installation
6.4.3. Administration
6.4.4. Security Considerations

1. Workbench Overview

The Taverna Workbench allows users to construct complex analysis workflows from components located on both remote and local machines, run these workflows on their own data and visualise the results. To support this core functionality it also allows various operations on the components themselves such as discovery and description and the selection of personalised libraries of components previously discovered to be useful to a particular application.

1.1. Definitions

Throughout this document various specialized terms will be used, unless otherwise stated the sense intended is as follows:

  • Workflow - A set of components and relations between them used to define a complex process from simple building blocks. Relations may be in the form of data links, transferring information from the output of one component to the input of another, or in the form of control links which state some conditions on the execution of a component. An example of a control link is the basic temporal ordering 'do not run component A until component B has completed'. In Taverna a workflow is realized by an instance of the workflow data model, this appears on disk or on the web as an XML file in the XScufl format.

  • Component - A component is a reusable building block which performs some well defined function within a process. In the bioinformatics domain we can regard any command line tool or PERL script as a component, the critical definition is that this component should be atomic in nature and cannot be split into smaller units. Components may consume information and may emit information, for example a BLAST job is a component which consumes a sequence and some search parameters (library, matrix, sensitivity etc) and which emits a report containing the sequence similarities found. Components may be located on any computational resource accessible via the internet or on the user's local workstation.

  • Service - All services are also Components, we use the term Service explicitly to refer to those components which are hosted on a computational resource external to the user's local workstation. Services have some underlying implementation technology such as SOAP (Simple Object Access Protocol) although this is hidden behind Taverna's abstraction layer as far as end users are concerned.

  • Enactor - A workflow enactor is the entity responsible for coordinating the invocation of Components within Workflows. It may be manifested as a Service itself, in which case it would consume a Workflow definition and some input data and emit the results, or, as is the case with this release of Taverna, as a software component within the workbench suite. The enactor manages the entire invocation process including progress reporting, data transfer between Components and any other housekeeping required.

1.2. Benefits of workflow technology

Taverna, as with all new technologies, has a certain 'activation barrier' before it becomes truly useful in terms of time invested from the user's perspective. In order to lower this we present some possible reasons why workflow technologies might save time and effort in the long run:

  • Efficiency - Taverna saves users a great deal of time in various ways.

  • Analysis Design - Design of new analyses is accelerated over alternative approaches through a combination of easy visualisation of the current analysis and ready availability of new component with which to extend it. Users can start with something familiar and incorporate new functional modules with very little effort. For example, a trivial workflow might fetch a sequence and its associated terms in the Gene Ontology (GO), a user might then extend this to also fetch the GC concentration of the sequence - using traditional approaches such as scripting in PERL this would involve editing the code using some kind of text editor, possibly installing the GC analysis tool and then some testing to determine whether the correct results were being achieved, using Taverna this becomes a simple search for a GC service, a drag and drop operation to incorporate this tool into the workflow and a further operation within the graphical interface to connect the sequence fetch to the GC analysis.

  • Experiment Invocation - Most traditional, small-scale bioinformatics (excluding projects such as whole genome annotation) is performed via a combination of web browser based and traditional UNIX style command line tools. When some combination of tools is required the data is manually transferred between components through cut and paste in the case of web pages or ftp and similar tools. These manual stages are time consuming, prone to error and generally not 'scientifically interesting' to the user, they're technological barriers in the way of the science. In contrast, workflow systems such as Taverna handle all the mundane housekeeping such as data transport and tracking and can run their sets of components without any external intervention. This allows users to start a workflow invocation then go do something else (including potentially other workflows), even if the workflow itself takes significant time to complete the user is free to do other things during this time.

  • Component Management - By using components at remote sites such as the EBI users are freed from the requirement to keep the components up to date, install software and run complex hardware such as compute clusters. Effectively Taverna gives any user with a reasonably modern PC access to a large number of supercomputing resources across the planet with very little or no administrative requirements. Where the user is developing a novel algorithm or tool this allows the user to focus exclusively on the provision of that particular service rather than having to also support all the 'standard' services as well. This should in turn lead to higher quality tools, the time saving translating into more resources for the specific tool development. An example might be a novel secondary structure prediction algorithm; if the group developing and providing this service to users had to also provide all the ancillary functions such as public sequence database fetches for source data they would have a significant administrative overhead, by using workflow technology they can simply provide the single prediction service and rely on users accessing the other services from more suitable sources such as the major bioinformatics service providers.

  • Invocation Performance - Although modern workstations are significant computational resources in their own right there are significant numbers of algorithms which require industrial scale compute capacity. By using remote components the user can take advantage of whatever backing hardware the service provider has available. For example, InterproScan - a tool developed at the EBI which aggregates search results from a number of different functional and domain prediction algorithms will not run in any sensible time on a typical workstation but by accessing it directly from the EBI the user has access to hundreds of nodes in a compute farm, several orders of magnitude more powerful than the machine they are sitting in front of but with no power requirements or air conditioning, not to mention the cost of purchasing expensive cluster systems. The end result is that workflows can complete significantly faster than the equivalent scripts running entirely on the user's workstation, translating in turn to faster turnaround times for the underlying science.