QSAR WORLD
Home | About QSAR World | Strand Life Sciences | Contact Us
Google Custom Search

Workflow and Pipelining in Cheminformatics

Differences between pipelining and workflow

Data pipelining is a specific form of workflow. In InforSense’s workflow methodology, task 1 is completed then the data are handed off to task 2 which is completed before the data are handed on to task 3 and so on. In pipelining, task 1 is completed on compound 1 and the data are passed to task 2. Task 1 can then start on the next compound. All the data are never passed all at one time and there might only be a few records in memory at any one time. The process can scale without impact on memory and efficiency is gained if a downstream operation can be commenced on some records while an upstream operation is still working on others. With a workflow method, it is easier to "cache" the data (save them for security reasons or for future use) at the end of each task.

Like InforSense, KNIME also adopts a table-by-table approach, rather than a row-by-row approach (where a row represents one compound and its data, in a table of compounds), so KNIME too needs to store the result of each node in a file in most cases, whereas Pipeline Pilot behaves like a real pipeline. It is not necessary, however, to write to disk in KNIME: the default is to write to memory making the node execution faster. In terms of reproducing experiments and wrapping up workflows with data, caching is a plus from a legal standpoint and could have appeal to chemistry management. From the technical viewpoint, there are pros and cons for both methods: if you are doing clustering and generating a pairwise distance matrix then you need all the rows; if you are calling an external program then you probably do not want to call it multiple times.

The developers of KNIME say that table-by-table processing offers substantial benefits such as multiple iterations over the same data, which is rather important for many data mining algorithms; the ability always to view intermediate results on the connections between nodes even after the workflow has been executed; and the ability to restart the workflow at any intermediate node if the user, for example, changes some settings in the middle. The penalty is the need to store the data somewhere. KNIME tries to be smart about this and only stores the differences between consecutive nodes, but it ultimately stores the data on disk. SciTegic has pointed out that in data pipelining, a cache (of all the data) can be added to the components as a "finish here and resume" component.

This is not the place to delve deeper into computer science issues. The tabular data structure approach in KNIME was singled out for mention because of the current commercial interest in KNIME. The reader who wishes to go more deeply into workflow technologies should also look into the Collection-Oriented Modeling and Design (COMAD) paradigm in Kepler, and into Taverna’s rich iteration semantics which allow some complex operations to be expressed very simply.

Page 1 | 2 | 3 | 4 | 5 | 6 | 7
Have any Questions?
Name:
Email:
Enter your query/comment here
 

    Facilitated by
    Strand Life Sciences Pvt. LtdStrandls Logo