Taverna 2 has a new implementation of the workflow structure and enactment engine compared to Taverna 1 to address various issues with FreeFluo and Scufl. The Taverna 2 workbench comes with translators that are able to open most Taverna 1 workflows, but there are a few corner cases where the translated workflow would not have the same semantics as in Taverna 1 which are detailed in this document.
When building workflows from scratch in Taverna 2, there are certain workflow patterns from Taverna 1 that are no longer the best choice. This document also highlights these cases and what are the recommended way to achieve the same using Taverna 2.
| Note that this document doesn't deal with differences in the user interface or the APIs. We're trying to build the t2 workbench so that it's recognizable for Taverna 1 users, but we're also trying to make it easier to get started with for fresh users, in addition to support the new features of t2. If you are interested in this you might want to sign up to the Taverna 2 beta tester programme at myExperiment to try early versions of the Taverna 2 workbench.
The APIs of t2 should be easier to work with than the Taverna 1 APIs, and we'll also come back later with a detailed developer guide. If you are interested in t2 development, you might also be interested in the upcoming Taverna Platform and Extension Developers' workshop |
New workflow format
As we have reworked the workflow definition structure the workflow is no longer represented as XScufl. The semantics of enacting the new workflow definition will be largely the same as for Scufl for the default configuration, but the new structure allows for more flexibility and extension points that were not possible with Scufl, such as conditional branching and while-loops described later in this document.
For this reason, when saving a workflow from Taverna 2, it is stored in a new XML format that unfortunately cannot be opened in Taverna 1. However, most Taverna 1 workflows will open in Taverna 2 thanks to translations that builds a Taverna 2 workflow based on a Taverna 1 definition. When looking at such a workflow in Taverna 2 you might notice a few visual differences, such as default values now being represented in the workflow as string constants.
We have as a goal to be able to open and run any Taverna 1 workflow, or if this for any reason fails, display an explanation of why a particular workflow cannot be fully translated. We're still working on improving the error reporting and adding a recovery mode where you can fix the bits of the workflow that is not Taverna 2 compatible.
This document will try highlight when there will be situations where the semantics of running the workflow will be different from running a Taverna 1 workflow.
Local workers are beanshell scripts
The Local workers of Taverna 1 was a way to add simple functionality without making your own Processor implementation. Taverna comes bundled with a set of local workers that can do mainly shim-like activities like selecting an item from a list, replacing a string with another, reading files, etc.
In Taverna 2 local workers are implemented as beanshell scripts. We have provided translations of all the local workers of Taverna 1, and the local workers are still identified as such in the workbench. The user now has the ability to modify a local worker in a specific workflow in case it doesn't do exactly what's needed.
Some users have extended Taverna 1 with their own Processor implementations or their own local workers. These implementations would need to be ported to Taverna 2, and a translation piece is also needed to be able to load Taverna 1 workflows using these implementations. We're developing a developer's porting guide, and have a set of Maven arcetypes that can get you get quickly started adding your own activity to Taverna 2, including translators.
No default values
In Taverna 1, default values could be set on an input port that would be provided to the processor if no links were connected to it, as an alternative to making String constants. In Taverna 2, all default values are translated as String constants (unless there was a link connected to the port). We're improving the T2 workbench to add string constants in a similar quick way as how you could set default values in Taverna 1.
Error documents
In Taverna 1, if a processor failed on an item during an iteration, the whole processor would fail, and nothing would be output. In Taverna 2 we have introduced error documents, so if a processor fails on a specific input while iterating, the processor will still return a list. In place of the value where the service failed, there will be an error document with a message and stack trace of what went wrong.
This means semantically that the workflow will not go to a complete halt if a single error occurs, but that you will still get partial results. In fact, the iteration will continue for the rest of the input values as well. When such a list containing error documents is received at a downstream processor that is to do iteration, it will simply skip invocation on the error documents and pass them along (wrapped) to it's outputs. When inspecting the final results you can trace back through the workflow to where the error originally occurred.
If a processor expects the a list instead of a single value, it will normally fail (ie. return an error document) if the input list contains an error (or contains a list that contains an error). It will in the future be possible for some activities to tell t2 that they will accept error documents, and the activity can choose to ignore them if needed. For instance a beanshell script that picks the 3 "best values" from a list of 100 candidates could just ignore any error documents when sorting, or it could have an internal threshold that if there are more than 50% error documents than the whole thing fails.
An activity that returns lists are free to return a list that contains some real values and some error documents, which means that the activity can recognise error messages from the service and wrap them as error documents.
Pipelining
One of the main improvements of Taverna 2 is the support for pipelining. In Taverna 1, when implicit iteration was calling a service multiple times, processors expecting the output of the iterated processor would not be invoked before the full iteration was finished, even if they only needed single values of the service's outputs. The reason for this is that from the outside view, the processor takes a list as an input and produces a list of results, even if the activity within the processor deals with single values.
In Taverna 2, as soon as the first output has been returned, the processor will send it on to all downstream processors, while the iteration continues. This means that a workflow with say 4 steps in a chain, each taking 1 second of processing per input value, running with 10 inputs, would previously run in at least 4*10*1=40 seconds. In Taverna 2, step 2 can start immediately after service 1 has finished the first value, and so the processing time will be reduced to a minimum of 10*1+4 = 14 seconds.
Note, semantically that this means services might be invoked in a slightly different order than from in Taverna 1. For instance, if both the services are on the same machine, and the first iteration of the second service assumes for some reason that the full iteration is finished on the first service, the service might get confused. If this is the case, one would add an additional control link between service2 and service1 - forcing service2 to be on hold until the full iteration has finished on the first service. This behaviour would then be semantically the same as when running in Taverna 1.
Pipelining can also be provided by the activity itself. For instance BioMart activity takes a set of database query parameters, and essentially returns a list of matching database rows for the columns that have been selected in the configuration dialogue. On the wire, the BioMart service will send back row by row separated by linefeeds, which means that the BioMart activity on the Taverna side can pipeline each row as they are received. This means the downstream processors can start even before the full query results have been transferred over the network. In this case the activity is using the streaming of the protocol to do pipelining in Taverna.
Another kind of streaming would be relating to large data values themselves, such as images. This is supported indirectly in Taverna 2, as services can return references instead of values, and these references can be passed along to downstream services which can then dereference them (say by downloading a HTTP reference). In that sense data will stream directly from one service to another, given that they can understand each other's references. If this is not the case, Taverna will either do a translation of the references or as a last resort, falling back to downloading from the first service and uploading the data to the second.
Multiple links to same input port
Taverna 1 behaviour on multiple inputs
In Taverna 1, when connecting several links to the same input port, there is a choice between Select first link (the default) and Merge all data. As an example, have a look at this mock workflow: 
The choice between the two options can be made in the right click menu for the input port in question:

Select first link
The default in Taverna 1 was Select first link - which means that whichever processor finishes first (getNucSeq or btit in this example) would be the input (passed to showalign). The other output is simply ignored.
This functionality is often used in a pattern with conditional links to make sure only one branch of processors run, using the conditional Fail_if_true/Fail_if_false processors:

In this case, whoever runs the workflow can decide if they want to use getNucSeq or btit by providing the input true or false. There's a conditional link from the web services to the conditional processors - so getNucSeq won't run until Fail_if_true completes. Fail_if_true will fail if the input is true - which means then it won't complete and hence getNucSeq will never run.Fail_if_false will in this case not fail, but complete, and btit will run, ultimately providing the input to showalign.
At first glance this seems like a nice way to get conditional branching in a dataflow oriented language. However, it comes with its challenges. For instance, what happens if you give the input I don't know? Well, neither of the purple conditionals will fail, ie. they will both complete, and both the green services will run. However, only one of their outputs is passed on to showalign. You don't know in advance which one - it depends which one finished first. It is also confusing to read the workflow, as there's a conditional to a "failing processor" - so you have to do the boolean logic in your head to figure out when a processor will run.
Although some of these issues could be worked around (making a Success_on_true processor instead), there's other potential issues with this pattern, for instance it doesn't handle iteration. Completion events and failures are calculated for the outermost list (outside iteration), so if you run this workflow with the input list [true, true, false, false, true] it will execute btit twice (it will fail Fail_if_true on the first iteration), and then the workflow will stop - because Fail_if_false also fails, and then neither getNucSeq or btit has a chance to continue. A workaround for this in Taverna 1 is to put just this bit of the workflow in a nested workflow - in that case the iteration should happen outside the nested workflow, allowing Fail_if*_ to fail every time.
So Select first link in Taverna 1 is in a bit dangerous territory. Let's forget about the conditional links for now and look at the other possibility, Merge all data.
Merge all data
Merge all data will wait for both incoming data links, and then wrap them in a list. So in this case there would be a list of the inputs from both getNucSeq and btit. All well. But in what order is the data? Well, it's still the first data to arrive that's first in the list. So if getNucSeq is consistently slower than btit, it will always be last. However, in many cases the two services will complete around the same time - making the ordering nondeterministic. That means that two consequent workflow runs with services returning the same data could produce data with two different lists.
Another issue with the merging is that it's easy to construct invalid workflows by accident, say if getNucSeq produces a list, and btit produces a single value, the input to showalign would be something like [ [nucVal1, nucVal2], titVal ] - which means the list doesn't have a consistent depth (2 or 1?). This is not really legal in Taverna, and in this example this would break the iteration strategy for showalign, or the code within showalign if it accepted a list as input.
You can't really use Merge all data together with the conditional pattern from above. If one of the processors leading to the mereg never runs, the complete (merged) input list for showalign would never arrive, hence it would never run at all.
Taverna 2 approach: Always merge
Taverna 2 has removed the inconsistent and difficult to use Select first, but kept the Merge. Additionally, now the merge itself is a separate entity in the workflow, so you can even see it in the diagram:

It makes sense that the merge is outside the processor, as it is the list coming out of the merge that is used for iteration within the processor. Internally the merge always has an output that has a list depth of one more than its inputs. The validation check of a workflow will verify this before you are allowed to run the workflow.
The merge is created automatically if you connect several links to the same input in the t2 workbench. As a curiosity - you can delete one of the links to the merge, which would leave a merge with a single input - it would wrap the input in a list.
The order of the inputs within the merged list is preserved, currently it's the same order as the links were created. We also intend to build a light UI component to be shown when selecting the merge, where the user can reorder the links.
Semantically this means that a workflow with a merge might produce data that look differently (different order within the list) from when run in Taverna 1 - but now the data is consistent between runs within t2.
This also means that the conditional branching pattern of t1 does not work out of the box. You could still do it with the merge, but the workflow would still never really finish as there would be an item missing in the list - and also the semantics would be slightly different. However, the streaming of Taverna would still invoke the processors downstream as long as they don't expect the outermost list of the showalign output.
Branching by error document
Another way to do this is to use the error documents of t2 together with a merge. All of the current activities will simply skip an input containing error documents, just returning a new (wrapped) error document. This means that you could in theory "branch by error" - a programming principle that might not be very popular among Java programmers, but should work fine within Taverna 2 workflows:

In the workflow above, [t2conditional.t2flow], there's no coordination links, and the conditional branching works with iteration.
isFish returns true if the input is fish, or false otherwise. Continue_on_true will throw an exception unless the input is true, while Continue_on_false throws an exception unless the input is false. The output of these conditional checking beanshell scripts are just the input, it doesn't really matter what the output is as long as there is one. These are then put into the corresponding shouldRun inputs of run_on_true and run_on_false.
Thus the idea here is that for each input provided, either run_on_true or run_on_false should execute, but not both or neither. The outputs of those are merged, and are used for the rest of the workflow. The run_on_X processors doesn't have to be beanshell scripts, they can be nested workflows which makes it possible to show/hide what happens inside the branch and ensures isolation.
|
Native branching
Although possible to do the conditional branching using the above pattern, it does not feel very natural, and it is still a bit error-prone. For instance there is nothing ensuring that the two branches don't talk to each other, so if by mistake you had a processorC in branch A that took one of it's auxiliary inputs from a processor in branch B - processorC would never run and it's only output would be error documents. The workflow would be valid - it just wouldn't do much real work.
One easy solution to this is to put each branch in a nested workflow. Graphically they will be shown as separate boxes, you can edit and test them separately, and you can easily see what are it's links to the rest of the workflow as everything would have to be connected through a nested workflow input or output.
This leads us down to a selection between different nested workflows - or generally - a selection of activities. This is something the processor already can do, and it will be possible for us to later extend this functionality to implement branching natively.
while-loops
Taverna 1 does not directly support while-loops. Although it is often asked for by computer scientists, it's something that does not fit directly into a functional dataflow language. However, that does not mean it was not possible, and a pattern for doing loops in Taverna 1 is like this:

(Note that due to a bug in the diagram the coordination link from getResults to the nested workflow not_finished is not shown in this diagram).
The nested workflow here is checking if the submitted job has finished - if not it will fail the Fail_if_false processor. This processor has been ticked as critical within the nested workflow, which means that if it fails, the whole workflow will fail as well. The coordination link from getResults to not_finished means that getResults is not executed until the workflow completes. In the mother workflow, a retry count of say 10000 with a delay of 500 ms has been set, which means that Taverna will retry running the nested workflow for about an hour and a half, checking every half a second if the job has finished. If it does, getResults will be called.
There's some limitations in this design, for instance the loop can't feed into it's next iteration, and you are forced to set a maximum number of retries.
No Critical flag in t2 (yet!)
The Critical flag on a processor in Taverna 1 is currently not implemented in Taverna 2, because it's semantics would be slightly different.
In Taverna 1, the Critical flag fails the whole workflow directly, the workflow engine immediately stops sending messages around and does not start any new jobs. In Taverna 2 there really is not a monolith workflow engine, rather it's the different processors who are sending each-other the data references. A big "stop workflow" flag would probably need to be added to the processor dispatch stack, so to prevents jobs flowing up or down the stack and gracefully stopping the workflow. As always, actually stopping the service is usually difficult.
Another issue is that while a nested workflow in Taverna 1 internally worked so that it waited until the nested workflow was "complete" - and then picked up it's outputs and returned them to the processor. In t2 the nested workflow activity is collecting outputs as they come from the processors in the nested workflow - and when it has collected all the outputs for all the output ports, it sends them along up to the processor. However, this does not mean that the nested workflow is actually finished at that stage, like in the example workflow above the output value on status would be ready before Fail_if_false is even called, so by the time the failure happens the data has already travelled out to the parent workflow.
There are ways to implement this and a general "stop" button for Taverna, like a dispatch stack layer that checks a flag to see if the workflow should continue to run, and if not just prevents any jobs from flowing. We will look into this at a later stage and re-introduce the Critical flag at that point.
The semantics of the critical flag in t1 also meant that if a processor failed, nothing more would be executed after that. However, in Taverna 2 the streaming means that while an iteration is going on several processors downstream might already have started. In this sense it could be that the failing-processor has been used as a kind of coordination link in Taverna 1, and this semantic would also be different in t2.
Native while-loops in t2
Similar to how we can extend t2 to support conditional branching, we can also extend t2 to support while-loops "natively" without having to use the Failing and Critical hack from above.
Imagine two nested workflows, where one workflow is the workflow that is to be looped over, and the other is the conditional workflow that will determine if the loop is to continue. Add both of these as activities to a processor in t2, and activate the (not yet developed) while-layer for the processor. You will have to configure the mapping of input- and output-ports between the processor and the activities. It would even be possible to map outputs from the looped workflow to the conditional workflow's inputs, but to do this you would also need to specify what the initial value would be on the first run.
When running this loop, the while-layer would first select the conditional workflow and check it's shouldContinue output port, if it contains the value true it will run the looped workflow again. When the value is false, the last returned values of the looped workflow are returned from the processor.
One can imagine various configuration of this behaviour, for instance do-while is probably easier (guaranteeing at least one run), or returning a list of all the results instead of just the last results.There is no reason why the two activities have to be nested workflows, for many cases a beanshell script would do as the conditional and a web service call as the activity to loop.
This is another extension of t2 that we will look at building once we've got the core t2 workbench ready.
Layered iteration strategies
Iteration strategies
are somewhat of an advanced feature of Taverna 1 that is most of the time defined when the default cross-product is not what you mean, as some of the input ports are to be provided with matching pairs (or triples, etc). To specify this one will change the iteration strategy to have a dot product between those ports.
For some cases, in advanced workflows you might run into an issue where you have a list of lists going into a processor taking single values, and you want to do a cross-product (all to all) of the outermost lists, but you always want pairs of the inner-most lists.
This was a situation that was difficult to define in Taverna 1 as the iteration strategy always will do cross- and dot-product down to the granularity of the values expected by the processor, in this case single values. A workaround
for this that was proposed was to put the processor in a nested workflow, and use the echo_list local worker between the input port of the nested workflow and the processor. This would force the nested workflow to expect lists as inputs instead of single values, and an iteration strategy could then be set on the processor for the nested workflow.
In Taverna 2, it is possible to have layers of iteration strategies, where each layer "digs down" in the list values to a certain level. So the first iteration strategy could do a cross-product of the lists of lists down to lists, and the second strategy could do the dot-product. The need for the nested workflow disappears, although such a workflow should still work as expected in Taverna 2. Additionally, the echo_list local worker would not be needed as one can specify the desired depth of the lists coming in to an input port of a nested workflow.
We have not yet added the user interface for defining such complex iteration strategies to the Taverna 2 workbench, but this is something that is in the plan.
A deeper explanation from an email exchange:
For all purposes there currently is only one iteration strategy.
But - in the code there's support for a stack of iteration strategies.
Normally we only have one layer.The reason for supporting a stack is that there are some iterations
that are not possible to describe with our iteration strategy. Most
interestingly is the problem of iterating over lists of lists for a
service that takes single inputs.What is not easy to do with a single iteration strategy is to do a
cross product of the outermost lists, and a dot product of the inner
lists, but this is often something that is needed in cases where the
inner lists are vectors of the same length, and are not to be iterated
over, just matched one-to-one with each-other.See http://www.myexperiment.org/workflows/185
for a description of the
problem and how it can be solved in Taverna 1 using nested workflows.
Note the use of the "echo list" local worker, the only reason for this
is to make sure the inner nested workflow takes a list as an input -
which would prevent the outer iteration strategy on the nested
workflow from digging down the innermost list. In Taverna 2 one can do
the same without the echo list, as an input port itself describes the
desired depth.But - with native support for staged iteration, there is no need for
this nested workflow, but the end result would be the same.Each layer of the iteration strategy stack can "dig down" to a desired
depth, which would then be the inputs to the next iteration strategy
in the stack. The lowest iteration strategy will have to dig down to
the depth needed by the ports of the processor.So one could have two inputs A and B:
A(depth=2): { { a,b,c }, { d,e,f } } B(depth=2): { { g,h,i }, { j,k,l } }iter1(depth(a)=1, depth(b)=1): A x B iter2(depth(a)=0, depth(b)=0): A . BThis could then fit into a simple activity that takes two single
inputs at depth 0.iter1 is to dig a and b down to depth 1 (these depths don't need to
match, but each would be reduced by 1), and would do the cross product
(forgive me if I mixed the order of A x B vs B x A in this example):{ { A={ a,b,c }, B={ g,h,i }, A={ a,b,c }, B={ j,k,l } } { A={ d,e,f }, B={ g,h,i } A={ d,e,f }, B={ j,k,l } } }The second iteration strategy does the dot product, so we would match
each of the As with the corresponding item in B. Luckily for us in
this example both lists are of the same length.{ { A=a, B=g A=b, B=h A=c, B=i } { A=a, B=j A=b, B=k, A=c, B=l, } { { A=d, B=g A=e, B=h A=f, B=i } { A=d, B=j A=e, B=k, A=f, B=l, } }It is naturally quite difficult to build an intuitive user interface
for building such iteration strategies, which is why we've put it on
hold for now to expose this feature in the UI.
Summary
Taverna 2 is based on a new structure for defining the workflow. Most Taverna 1 workflows should be translated automatically when opened in the Taverna 2 workbench, but a few workflow patterns such as multiple links to the same input or conditional branching using Fail-if-processors and the Critical flag will not behave as in Taverna 1. Most of these cases can be rebuilt with cleaner constructs in Taverna 2, and forthcoming extensions for t2 will include conditional branching and while-loops.
When enacting a Taverna 2 workflow pipelining will skip unnecessary delays when iterating over large collections, but this could invoke services in a slightly different order than in Taverna 1. An error in invocation is no longer fatal, iterations will continue and the failure is recorded as an error document instead of the expected value.
