Workflow and Pipelining in Cheminformatics
Wendy Warr, editorial advisor of QSARWorld, writes on the workflow paradigm as a mechanism to integrate different data resources, softwares and alogrithms, web services etc. Read on...
Download PDF Version
The workflow paradigm is a generic mechanism to integrate different data resources, software applications and algorithms, Web services and shared expertise. Such technologies allow a form of integration and data analysis that is not limited by the restrictive tables of a conventional database system. They enable scientists to construct their own research data processing networks (sometimes called "protocols") for scientific analytics and decision making by connecting various information resources and software applications together in an intuitive manner, without any programming. Therefore, software from a variety of vendors can be assembled into something that is the ideal workflow for the end user. These are, purportedly, easy-to-use systems for controlling the flow and analysis of data. In practice, certainly in chemical analyses, they are not generally used by novices: best use of the system can be made by allowing a computational chemist to set up the steps in a protocol then "publish" it for use by other scientists on the Web.
Until recently, in the cheminformatics field, only two solutions were in common use for capturing and executing such multi-step procedures, processing entire data sets in real time through "pipelines" or "workflows". The two technologies, both of them commercial solutions, are from InforSense,1 which uses a workflow paradigm in its InforSense platform, and SciTegic2 (now part of Accelrys) which uses data pipelining in Pipeline Pilot. New entrants to the market now open up many more options.
Pipeline Pilot
In Pipeline Pilot, users can graphically compose protocols, using hundreds of different configurable components for operations such as data retrieval, manipulation, computational filtering, and display. There are three options for the interface: the Professional Client, the Lite Client, and the Web Port Client. SciTegic offers collections of components covering chemistry, ADME/Tox, chemically-intelligent text mining, decision trees, gene expression, materials, modeling, R statistics, reporting, imaging, sequence analysis, text analytics, and the software packages Catalyst, and CHARMm.
The Integration Collection provides mechanisms to link external applications and databases into a Pipeline Pilot data processing protocol. After a tool is integrated and added to the library as a new component, end-users can employ it as they would any other Pipeline Pilot component, regardless of where the external application resides or how the integration works behind the scenes. For controlling Pipeline Pilot protocol execution from proprietary or third-party applications, SciTegic provides three different software development kits (SDKs): a Java SDK, a .NET SDK, and a JavaScript SDK.
Spotfire3 and SciTegic have coupled Spotfire DecisionSite’s interactive visual analytics with Pipeline Pilot’s data processing protocols. Researchers can embed Pipeline Pilot computations in DecisionSite (without any scripting or programming) and deploy these throughout the enterprise. DecisionSite users can run analyses in Pipeline Pilot without leaving the DecisionSite environment. Pipeline Pilot is supported on Linux and Windows and is used by over 200 pharmaceutical, biotechnology, and chemicals companies. Applications have been reported in the cheminformatics literature.4,5 SciTegic has recently announced a free academic version of Pipeline Pilot to facilitate dissemination of scientific innovations to industry.
|