r11 - 17 May 2006 - 12:54:15 - JuneFinchYou are here: myGrid wiki >  Techreq Web  > LargeData

High Level Requirement Specification

Handling large amounts of data

Reference Techreq.LargeData
Referenced Use-cases QtlMicroarray LandscapeGenomics SystemsBiology NanoCmos MicroArray
Dependencies  
Champion StuartOwen
Status DEFERRED

  Taverna 1.3 Taverna 2.0
Priority   1
Rough estimate - MONTH(s)

Overview

The ablilty to deal with the shipping of large-scale data. Both during enactment and during results browsing. At present, when a large blast report comes back from a blast service, Taverna hangs when trying to display the results.

Overall Goals

Design and development of a mechanism for handling large amounts of data, leading to a version of Taverna that does not fall over when dealing with large blast reports, or other sources of large data.

This will involve representing a data item token together with a referencing scheme within the data token itself representing different access methods to the data, whether it is local or remote. Where large amounts of data are involved, it would be preferable to keep the data remote together with the processes that need to operate on that data. There will also be the need to move data should processes that need to work on it are in different locations.(See Appendix)

Whether the data is local or remote, and the mechanism for retrieving/updating the data content should be encapsulated by the data token, and not be a concern to those parts of Taverna operating with or upon a data token.

Assessment

This is complete covered by the new enactment engine.

Affected Components

Taverna workbench, Provenance browser, Freefluo enactment engine.

Key Tasks

  • Requires a fair bit of investigation to determine the best approach (and whether its possible within Taverna 1.3)

Appendix

Anatomy of a Data Token

The data tokens are reference types with a flexible scheme allowing for multiple simultaneous kinds of reference scheme within the same token representing different access methods for the same data. This description applies to the contents of the token; the index array information is part of its context rather than the inherent value.

The intent of this is to allow data to be used in its native form wherever possible. Consider the case we have in some of our current workflows where there are consecutive operations hosted on the same server (some considerable distance from the enactment engine). In the previous architecture all data was passed as a value, the output of one process would be shipped back to the enactment engine then sent back out to the next process – this is clearly not efficient in the case where the processes are on the same machine. As a compromise some service types in the previous architecture allowed the passing of references but the enactment engine had no awareness that these were inherently reference types, as far as it was concerned they were values albeit short ones. While this worked between two instances of a particular service type additional operations were required to explicitly resolve these references before the data could be passed onto other types of process – this explicit resolution became part of the workflow, a bad thing as our aim is to insulate the user from exactly this kind of complexity.

A variety of reference types exist in the wild. The most common reference, used by almost all internet users is a URL or URI. Any data grid or storage system will also have a way of identifying a data item within that system (this being a primary role of such services). For each reference scheme we can introduce an implementation of the data reference contract. This contract requires two things – that a serializable form of the reference exists and is defined and that the reference may be resolved to a stream of bytes.

As a base case we define a local cache reference type – this works in conjunction with a blob store accessible to the workflow enactment engine and is used whenever a service requires or emits a value type.

Each data token contains a set of one or more named data reference implementations. All operations within the process graph must declare for each input port what set of data reference types it accepts with the additional requirement that the local cache reference must always be accepted, that is that all processors must accept pass by value in addition to any pass by reference schemes.

Immediately before a set of input data tokens in the form of a job is applied to a process worker the enactor compares the data reference types in each token to those accepted by the corresponding input port. Any cases where there are no types in common trigger an automatic de-reference of the data within the token, this makes use of the ability for all data references to resolve to a byte stream, pulling the byte stream into the blob store and inserting a new local cache reference into the token in addition to the original set of types. As all process workers must accept at least the local cache reference type the operation now has access to all its input data.

This approach avoids de-referencing non local reference types unless absolutely required by the process consuming the tokens, implicitly this allows the workflow system to take advantage of any available third party transfer mechanisms while simplifying the construction of the workflow from the user’s perspective (she no longer needs to distinguish between the different reference and value schemes). The data token itself can be serialized to XML – such documents are very short. Similarly the token has an identity independent of any reference schemes within the reference set, it is this identity that is used to manage data provenance. In the previous architecture these identifiers were allocated by an LSID assigning service, with the LSID protocol used to expose them through a suitable authority implementation for publishing. The exact naming scheme for this generation of the architecture is as yet undecided.

Collection structures are similarly serialized to XML, it is assumed that this serialization is obvious – the collections are inherently tree structured so a trivial XML mapping exists.

(Taverna v2 Aims and Vision, TomOinn?)

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r11 < r10 < r9 < r8 < r7 | More topic actions
 
Powered by myGrid wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding myGrid wiki? Send feedback