High Level Requirement Specification
Handling large amounts of data
| | Taverna 1.3 | Taverna 2.0 |
| Priority | | 1 |
| Rough estimate | - | MONTH(s) |
Overview
The ablilty to deal with the shipping of large-scale data. Both during enactment and during results browsing. At present, when a large blast report comes back from a blast service, Taverna hangs when trying to display the results.
Overall Goals
Design and development of a mechanism for handling large amounts of data, leading to a version of Taverna that does not fall over when dealing with large blast reports, or other sources of large data.
This will involve representing a data item token together with a referencing scheme within the data token itself representing different access methods to the data, whether it is local or remote. Where large amounts of data are involved, it would be preferable to keep the data remote together with the processes that need to operate on that data. There will also be the need to move data should processes that need to work on it are in different locations.(See Appendix)
Whether the data is local or remote, and the mechanism for retrieving/updating the data content should be encapsulated by the data token, and not be a concern to those parts of Taverna operating with or upon a data token.
Assessment
This is complete covered by the new enactment engine.
Affected Components
Taverna workbench, Provenance browser, Freefluo enactment engine.
Key Tasks
- Requires a fair bit of investigation to determine the best approach (and whether its possible within Taverna 1.3)
Appendix
Anatomy of a Data Token
The data tokens are reference types with a flexible scheme allowing for multiple
simultaneous kinds of reference scheme within the same token representing different
access methods for the same data. This description applies to the contents of the
token; the index array information is part of its context rather than the inherent value.
The intent of this is to allow data to be used in its native form wherever possible.
Consider the case we have in some of our current workflows where there are
consecutive operations hosted on the same server (some considerable distance from
the enactment engine). In the previous architecture all data was passed as a value, the
output of one process would be shipped back to the enactment engine then sent back
out to the next process this is clearly not efficient in the case where the processes
are on the same machine. As a compromise some service types in the previous
architecture allowed the passing of references but the enactment engine had no
awareness that these were inherently reference types, as far as it was concerned they
were values albeit short ones. While this worked between two instances of a particular
service type additional operations were required to explicitly resolve these references
before the data could be passed onto other types of process this explicit resolution
became part of the workflow, a bad thing as our aim is to insulate the user from
exactly this kind of complexity.
A variety of reference types exist in the wild. The most common reference, used by
almost all internet users is a URL or URI. Any data grid or storage system will also
have a way of identifying a data item within that system (this being a primary role of
such services). For each reference scheme we can introduce an implementation of the
data reference contract. This contract requires two things that a serializable form of the reference exists and is defined and that the reference may be resolved to a stream
of bytes.
As a base case we define a local cache reference type this works in conjunction
with a blob store accessible to the workflow enactment engine and is used whenever a
service requires or emits a value type.
Each data token contains a set of one or more named data reference implementations.
All operations within the process graph must declare for each input port what set of
data reference types it accepts with the additional requirement that the local cache
reference must always be accepted, that is that all processors must accept pass by
value in addition to any pass by reference schemes.
Immediately before a set of input data tokens in the form of a job is applied to a
process worker the enactor compares the data reference types in each token to those
accepted by the corresponding input port. Any cases where there are no types in
common trigger an automatic de-reference of the data within the token, this makes use
of the ability for all data references to resolve to a byte stream, pulling the byte stream
into the blob store and inserting a new local cache reference into the token in addition
to the original set of types. As all process workers must accept at least the local cache
reference type the operation now has access to all its input data.
This approach avoids de-referencing non local reference types unless absolutely
required by the process consuming the tokens, implicitly this allows the workflow
system to take advantage of any available third party transfer mechanisms while
simplifying the construction of the workflow from the users perspective (she no
longer needs to distinguish between the different reference and value schemes).
The data token itself can be serialized to XML such documents are very short.
Similarly the token has an identity independent of any reference schemes within the
reference set, it is this identity that is used to manage data provenance. In the previous
architecture these identifiers were allocated by an LSID assigning service, with the
LSID protocol used to expose them through a suitable authority implementation for
publishing. The exact naming scheme for this generation of the architecture is as yet
undecided.
Collection structures are similarly serialized to XML, it is assumed that this serialization is obvious the collections are inherently tree
structured so a trivial XML mapping exists.
(
Taverna v2 Aims and Vision, TomOinn?)