History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: TAV-709
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Critical Critical
Assignee: David Withers
Reporter: Stuart Owen
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
myGrid

T2 Enactment Error with pauls workflow

Created: 2008-01-23 14:59   Updated: 2008-01-31 13:27
Component/s: None
Affects Version/s: 1.7
Fix Version/s: 1.7.1

Time Tracking:
Not Specified


 Description  « Hide
There is a workflow of Pauls attached to TAV-706 (phenotype_to_pubmed.xml - takes the input "african trypanosomiasis AND mouse") that fails to run in T2, although its processor types and construction indicates it should. The workflow "sticks" when the queuesize of the nested workflow is 89 - I've found this is consistent when first running the workflow, but not necessarily when clicking Reset and re-running.

Very difficult to determine what the problem may be due to a lack of decent error reporting and monitoring.



 All   Comments   Work Log   Change History   Version Control      Sort Order: Ascending order - Click to sort in descending order
David Withers - 2008-01-24 12:57
This seems to be a problem with the monitor. I'm getting lots of stack traces similar to the one below.
Exception in thread "net.sf.taverna.t2.workflowmodel.processor.dispatch.events.DispatchJobEvent@3f0f86" 
  java.lang.IllegalStateException: Timer already cancelled. 
  at java.util.Timer.sched(Timer.java:354) 
  at java.util.Timer.schedule(Timer.java:170) 
  at net.sf.taverna.t2.monitor.impl.MonitorImpl.deregisterNode(MonitorImpl.java:136) 
  at net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.Invoke$1.receiveResult(Invoke.java:196) 
  at net.sf.taverna.t2.activities.wsdl.WSDLActivity$1.run(WSDLActivity.java:142) 
  at java.lang.Thread.run(Thread.java:613)

If I turn the monitoring off this workflow completes and I get the same results as taverna 1.


David Withers - 2008-01-24 16:57
The sequence of events that causes this is:
  1. The invoke layer calls MonitorImpl.deregisterNode() which schedules nodeRemovalTimer to call monitorTree.removeNodeFromParent(nodeToRemove).
  2. monitorTree.removeNodeFromParent() throws IllegalArgumentException: node does not have a parent; this kills the timer thread.
  3. The next call to MonitorImpl.deregisterNode() results in nodeRemovalTimer.schedule() throwing IllegalStateException: Timer already cancelled
  4. This exception propagates back to the invoke layer so the activity invocation doesn't happen.

There are several problems here:

  1. nodeToRemove doesn't have a parent. Not too sure why but I think it's a timing issue: the parent gets removed before the child? The parent node always seems to be DataflowActivity.
  2. The scheduled TimerTask shouldn't allow an exception to kill the timer thread.
  3. Calls to MonitorImpl.deregisterNode() shouldn't allow monitoring exceptions to stop the activity invocation.

I think a solution would be to separate the monitoring and invocation code; perhaps by adding a monitor layer before the invoke layer in the dispatch stack.


David Withers - 2008-01-31 13:27
The root of this bug is child nodes being removed from the monitor tree after their parents have already been removed. I've checked in changes to DataflowActivity and WorkflowInstanceFacadeImpl to fix this.