4.4. R-scripts with the RShell processor

4.4.1. Introduction and installation

R is a popular scripting language oriented towards statistical computing, and with the addition of the BioConductor module, suitable for biological data analysis. Taverna comes with support executing R-scripts as part of a workflow. The functionality is very similar to how to use the Beanshell processor, so this section will only cover what is special about the RShell processor.

First of all, R and required R packages, such as the BioConductor, must be installed locally on the machines that will be executing the workflow. This is outside the scope of this manual, we refer to the FAQ for R on how to install. Once you have R installed, you can start it either on the command line with the command R or using the appropriate application shortcut, where you should get a shell that looks somewhat like this:

: stain@mira ~;R

R version 2.4.1 (2006-12-18)
Copyright (C) 2006 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> 
> sin(pi)
[1] 1.224606e-16

If this is working, it should be quite easy to install required R modules such as the BioConductor:

> source("http://bioconductor.org/biocLite.R")
> 
> biocLite()
Running biocinstall version 1.9.9 with R version 2.4.1 
Your version of R requires version 1.9 of Bioconductor.
Will install the following packages:
 [1] "affy"        "affydata"    "affyPLM"     "annaffy"     "annotate"   
 [6] "Biobase"     "Biostrings"  "DynDoc"      "gcrma"       "genefilter" 
[11] "geneplotter" "hgu95av2"    "limma"       "marray"      "matchprobes"
[16] "multtest"    "ROC"         "vsn"         "xtable"     
Please wait...
(..)
The downloaded packages are in
        /tmp/RtmpNXlF02/downloaded_packages
> 

Taverna communicates with the local R installation using the RServe protocol. This is a network based service that allows you to submit a script to be run within an R environment. In our setup that means that the R script will be executed by the RServe server process, and not the Taverna workbench. The service can be configured to allow different network users identified with passwords, but since they would be able to basically execute any code on that machine, for security reason we recommend that you stick with the default, which is to only listen on localhost, without requring a password. If your machine has multiple users we recommend you to enable usernames and passwords to make sure only you can access the RServe service.

Follow the installation instructions for RServe for information on how to install and start the RServe service. Here's the short version for version 0.4.3:

: stain@mira ~/Desktop;curl -fO http://rosuda.org/Rserve/dist/Rserve_0.4-3.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 86336  100 86336    0     0   239k      0 --:--:-- --:--:-- --:--:--  465k

: stain@mira ~/Desktop;R CMD INSTALL Rserve_0.4-3.tar.gz
* Installing *source* package 'Rserve' ...
checking for gcc... gcc-4.0 -arch i386
(..)
** building package indices ...
* DONE (Rserve)

: stain@mira ~/Desktop;cd /tmp
: stain@mira /tmp;sudo -u nobody R CMD Rserve
R version 2.4.1 (2006-12-18)
(..)
Rserv started in daemon mode.

: stain@mira /tmp;

Notice that you will have to execute R CMD Rserve to start the service again if you reboot the computer. For security reasons, we recommend you to use a separate, non-privileged user account on your machine for running RServe, so that if there is a security problem, the R script won't be able to access your files and can be easily isolated.

The RServe documentation describes several RServe clients, note that the RShell processor is based on the JRclient library.

4.4.1.1. Installing on Windows

Up to R-2.4.0

Download the corresponding binary Rserve.exe from http://rosuda.org/Rserve/dist/rserve-win.html in the same directory where R.dll is located (by default C:\Program files\R\R-2.4.0\bin).

Start Rserve by double clicking on Rserve.exe located in the bin directory.

You can now use the R processor in Taverna.

R-2.4.1 and later version

  • Install Rserve: From R menu, select Packages  Install package(s) (select a mirror)  Rserve.

  • Load Rserve: From R menu, select Packages  Load package  Rserve

  • Start Rserve: From R workspace type: Rserve (port = 6311)

You can now use the R processor in Taverna.

Note that you will need to start again step 2 and 3 anytime you are starting Rserve.

4.4.2. Using the RServe processor

To add an RServe processor to a workflow, locate RShell under Local Services in the service scavenger panel. Either drag the processor to the Advanced model explorer, or right click and select Add to model.

Right click on the processor and select Configure RShell to bring up the RShell configure dialogue.

The first tab of the dialogue lets you type in the script, similar to the editor of beanshell processors. In addition you can open an existing script from a file. For this example we'll do a rather trivial sinus function.

Just like the beanshell inputs and outputs are accessed through variable names, the RShell processor makes input ports available as variables named after the port, and output ports read their named variable after executing the script. That is, the last assigned value to the variable will be the one returned from the processor. So for this script to make sense we have to make an input port x and an output port y. Flip to the tab Input ports and click Create input port, specify the port name x. Next, we'll have to specify the type this variable will have within the R-script. Although Taverna normally operates by passing around text strings, R is a typed language and you need to specify that in this case x is to be parsed a double, for example 0.45.

Create the output port y in the same way in the Output ports tab, and remember to also set it's type to double. You should now be able to build the workflow, connect the ports, and run it with an example input 0.5 which should give you an output 0.479425538604203.

4.4.3. Connection and advanced port types

If you configured your RServe to use a different port, or to require username and password, you can flip to the Connection settings tab to configure these connection parameters. In addition, you can tick off Keep session alive, which will re-use the same connection each time you execute the script. This means that if the script assigns objects to other variable names, say z=x+1337, z will be available in the R namespace for the next execution, like in an iteration. However, we generally recommend transferring such state through the workflow instead of keeping it in the R environment.

The input and output port type R-expression can be used to link several R processors together without regarding the internal data type. This is useful when passing complex R objects from one R script to another, however, as the whole object will be serialised this is not recommended for very large structures, for those situations it might be better to use the Keep session alive option and share a global variable.

If you select the array datatypes such as double[], integer[] and string[], the processor input port will consume a full list of values of the specified type, which is useful if the R-script is to do array indexing or statistical analysis on a vector of items. Similary an array output port can be used if you want to return more than one value. The port types are:

boolean

true or false (1 and 0 also allowed)

double

a floating point number

integer

a natural number

R-expression

R-expression to pass between RServ processors

string

string value

double[]

a list of doubles

integer[]

a list of integers

string[]

a list of strings

PNG-image

an image created by the plotting device (for outputs), see section Graph output below

Text-file

(unknown)

4.4.4. Graph output

In the interactive R environment you might be used to creating fancy graphs. You are able to create graphs in R through Taverna as well, but instead of the graphs popping up directly on your screen you will have to return them as image data to the workflow. The graphs can then be viewed as part of the workflow output. Make a new output port called g, and set its type to PNG-image, and in your R script, use png(g) to enable PNG output to a variable called g, and dev.off(); when you are finished plotting. Example:

png(g);
plot(rnorm(1:100));
dev.off();