r8 - 17 May 2006 - 12:30:58 - JuneFinchYou are here: myGrid wiki >  Techreq Web  > DatabaseStorage
High Level Requirement Specification

Providing Database Storage Facilities for Workflow Data

Reference Techreq.DatabaseStorage
Referenced Use-cases QtlMicroarray LandscapeGenomics
Dependencies LargeData T2Architecture
Champion Tom Oinn
Status DEFERRED

  Taverna 1.3 Taverna 2.0
Priority   3
Rough estimate MONTH MONTH

Overview

Generic mechanism to populate a relational database with the results of workflows, to provide a ‘data cache’ for high throughput experiments. This is usefull for people running high throughput experiments where large amounts of complex data are produced. Typically, this data would need further analysis and/or mining to draw useful conclusions from it.

Overall Goals

Provide a mechanism for capturing and therefore also displaying results from complex workflows in a form that can be stored or queried over. There should be a level of abstraction that does not enforse the data to be stored in an sql database, but also allows alternative mechanisms for serialization of the data.

Concideration needs to be paid that on average the cache is able to be populated as quickly as the data is gathered, otherwise a bottleneck will occur leading to a build up of data in-memory. This could lead to an out of memory exception if attention is not given to this. A suggested solution is to use a buffer, and pause between Processor units (either a full process or an iteration) until the buffer has been cleared.

Assessment

Affected Components

Taverna workbench, data store, metadata store Provenance

Key Tasks

  • Identify mechanism for capturing results.
  • Identify mechasim for storing captured results.
  • Design and implement abstract framework to provide data serialization
  • Implement a default SQL serialization framework built upon the above abstraction
  • Implement a component for querying / viewing stored results

Appendix

Explicit Data Serialization

Experience with the previous architecture has identified a common requirement for workflows to populate custom databases with intermediate and final results. This led to workflows where the analysis and storage functions were intermingled, in turn lowering the comprehensibility and reusability of those workflows.

While the original design for myGrid specified a generic data store it has become apparent that this cannot satisfy the requirements for data management coming from our users. The life sciences community in particular has a great deal of experience and effort invested in custom data management systems, we have an obligation therefore to provide a clean and well defined mechanism whereby a workflow may store data into such a management system while avoiding the current conflation of analysis and storage concerns.

To this end we introduce an output data serialization framework. This framework allows the injection of code into a container through which output data tokens from each process are routed and is configured on a per process node basis. The framework is called each time a single job invocation completes and produces a set of output, any registered serialization components within the container have full access to all result items in the set and may optionally modify their reference sets by the insertion of additional reference types if appropriate.

As an example such a component may choose to store the data to a local relational database. As the data now exists in a form that can be referenced and potentially used directly by other services the component can create a new data reference object, possibly in the form of a SQL query definition, and attach it to the token before the results of the operation are visible to the next stage in the workflow. Another more complex case would be a component which accessed a library of functions capable of parsing structured data and transforming it to RDF to be inserted into a triple store, thus augmenting the course grained data provenance with finer domain specific knowledge that would otherwise remain opaque to the result comprehension tools.

(Taverna v2 Aims and Vision, TomOinn?)

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r8 < r7 < r6 < r5 < r4 | More topic actions
 
Powered by myGrid wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding myGrid wiki? Send feedback