Suggestions for designing a future MIR
provenance publication issues
- If a data generated by Taverna is published somewhere, the provenance of the data should be associated with the data. But what should be associated? How can make this publication a configurable process?
- If a data is manipulated or a number of data is combined in a process, how to keep the provenance of the original data along with the manipulated data or integrate the provenance for the composite data object?
LSID issues
- an experiment data outcome can be a file, an atomic object or a composite object. Each type of this experiment data outcome should be allocated an LSID in myGrid. Moreover, the composite data object or the file should provide the context for the atomic data object. Thus, we need to provide an LSID container.
- mapping myGrid internal LSIDs, that is, one object should only have one persistent and consistent LSID associated with it, even though it existing in several places in the metadata repository.
- mapping myGrid internal LSIDs to the life science community LSIDs, e.g. to map the LSID for a Genbank sequence, which can be published by Taverna or the GenBank?/NcBI database providers
User requirements from Hannah
- not interested in intermediate values, but just wants easy access to results produced by enactments
- if a workflow breaks, won't examine intermediate values but will find an expert to do this
- generates large volumes of data (eg 40mb per workflow enactment excluding intermediate values)
- wants to use results from previous workflow enactments as inputs to other workflow enacments
User requirements from Andy Brass
- has a number of long-running workflows - starts them at the start of the day, and wants data to be written to storage for later examination
- wants both intermediate values and results to be saved to storage so that he can work out how results were produced
- wants to be able to reconstruct something like the Taverna enactorinvocation panel from stored data
Current taverna interface support for local file storage
- workflows loaded/saved from local filesystem
- workflow results can saved to file system in an xml document, which can be loaded into a seperate dataviewer app for later browsing
- workflow results saved to file system in one file per data item - useful for further processing in other applications
- data can be loaded from seperate files for use as input data to an enactment
Problems with taverna storage support
- no support for saving intermediate values
- if results are save to an xml document, it is difficult to use an item of data stored there as input to another workflow enactment
- user must rely on file system structures / tools to search for workflow / results
The advantages of providing remote storage services
- data held in remote storage can be backed up easier than local disk
- remote storage resources might be faster/larger/more reliable
- we can impose a constrained interface on remote storage resources - not
possible with file system, where user can mess up any structure we might create
- constrained interface can support easy, transparent querying of data
The disadvantages of providing remote storage services
- users are familiar with local storage technology and manipulating files (for example through standard file browsers)
- users may not be familiar with issues of remote storage such as temporary unavailability of storage due to network problems
- we need good gui support to help users to use remote storage services
- unless significant advantages are provided by remote storage services, unlikely to be worthwhile
Current system
- Information Model (IM) defines structure of data that needs to be passed around
- MIR provides storage for IM data types
- MIR browser displays contents of the MIR in Taverna
- MIR plugin intercepts events from enactor and saves enactment data to MIR. If current workflow has not been run before, it also saves that.
- MIR browser specifies context for plugin (which user is currently operating, which experiment design they are working with)
- many man-years development...
taverna integration problems
- MIR browser is not a real application, just a dump of mir contents. Finding any item of data using the browser involves searching through deeply-nested and confusing tree structures.
- in particular, it is not easy to find workflows or enactment data that have been stored in the MIR
- workflows are stored to mir only as a side-effect of being run. We need an explicit workflow storage operation.
- the plugin is not integrated into the interface - any exceptions (eg indicating mir unavailable) are just dumped as a stack trace to the command line, and there is no obvious way to turn the plugin off
- plugin is still buggy
- if results have previously been stored to the local file-system, there is no way to integrate them into the MIR
- no support for downloading stored results into databrowser for further manipulation
- not easy to used stored data values as input to workflow enactment - cut and paste from mir browser is only option (not good for large data values)
Infomodel problems
- info model is monolithic - attempts to model organizations, people, projects, experiments, workflows, enactment data (including results and intermediate values)
- because it is so complex, not enough attention has been paid to getting the details right. So, for example, enactment data from nested workflows can't be represented in information model (and nested workflows are used regularly by the taverna community)
- labbookview is a basic server side query mechanism, but is far too difficult to use (so never is!)
- info model complexity is one reason why the MIR browser is so complicated and difficult to navigate through
- info model complexity makes storage implementation in MIR difficult
- the heirarchical workflow storage structure imposed by the information model - programme, study, nested studies, experiment design - is no better than the heirarchical structure provided by the file system.
- in particular, workflows are only identified by a name, an lsid and a location in the heirarchy - so providing advanced tools to search the mir for workflows is difficult
- since the info model (and particularly the sections which attempt to model workflows and enactment data) are seperated from taverna, there will always be maintenance problems when taverna's idea of what a workflow is changes
mir problems
- mir is monolithic - attempts to implement data transfer, data storage, and security mechanism
- it is working reasonably now, but not as well as it could
- data transmission has to take place through soap, and this is inefficient for blocks of data
- info model is so complex that hibernate has been used - but this has caused loads of bugs
- if we're ever going to do security, we have to add it on ourselves
- no natural query mechansim - queries have to be written in SQL and are executed through OGSA-DAI WSI. External software therefore has to know about the structure of the mir database. This means that external code will be difficult to maintain. It also imposes database-centric view on mir programming. We shouldn't need to even know that a database is the implementation mechanism.
- client-side api is poor - eg if we have used it to download an ExperimentDesign? object, not possible to call a simple method to get all Operation objects.
- lack of transaction control will cause problems with multiple users
Solutions
- maintenance problem can be ameliorated by storing workflows, enactment inputs, enactment outputs, intermediate values as whole files rather than fracturing into objects - though taverna will need to be modified to output a file of intermediate results
- if we're not going to fracture into object, and we're ditching organizational and experimental modelling entities from info model, then we no longer have the need for an implementation using a relational database, as there won't be that many relations betweeen objects
- as an alternative, we can use SRB to store these files, as we can then take advantage of fast SRB file transfer mechanism
- SRB also gives us users/roles/groups and a security mechanism (username/password or GSI)
- attach metadata to these files to make it easier to search for them
- provide interface support in taverna to allow searches to be constructed easily (see Feta as an example - it does this well)
Software development strategy
- Several stages, each building on the previous stage, and with each stage aiming to build something useful
- stage 1 - workflow storage functionality. provide functionality in taverna to store workflows to SRB and to search for workflows that have been stored in SRB.
- stage 2 - results storage functionality. add functionality to save results from a workflow enactment to SRB, to search for results, and to load results back into dataviewer.
- stage 3 - intermediate storage functionality - add functionality to store intermediate results, and to reconstruct enactor dialog view
- THEN abstract out information model so that different types of remote store can be brought in. Hopefully, info model will then be a useful abstraction, rather than an imposed monolithic solution