Workflow and Pipelining in Cheminformatics
Other open source solutions
At this point it should be admitted that this article has a cheminformatics bias. Scientists in a variety of disciplines (e.g., biology, ecology and astronomy) have made extensive use of Kepler17, but Kepler has had little impact on cheminformatics. Furthermore, open source software is commonly used in bioinformatics but users of cheminformatics systems have tended to take a more "traditional" stance. Some pharmaceutical industry users are now beginning to think that it is important to lower the barriers between bio- and cheminformatics applications; initiatives such as the "druggable genome" do require closer integration5. The extensibility and flexibility of the more open solutions may provide something here that the more monolithic systems cannot.
The SOMA2 open source modeling environment18,19 has been developed at the Finnish IT Center for Science (CSC). A workflow program, Grape, and XML descriptions of scientific programs allow researchers to link molecular modeling software into workflows. SOMA2 collects the data calculated in the workflow and stores the information in Chemical Markup Language (CML)20-26 format. An extranet interface is used for user authentication, building the program interfaces and workflows, and sorting, filtering, and visualizing the results. The SOMA2 interface also uses third-party applications. For example, molecular structures are visualized using ChemAxon’s Marvin software package.
Another open source package is Taverna27,28 which aims to provide language and software tools to facilitate use of workflow and distributed computing technology within the e-science community. It allows a bioscientist with a limited computing background and limited technical resources and support to construct complex analyses over public and private data and computational resources, all from a standard PC, UNIX box or Apple computer. Taverna is now funded as part of the Open Middleware Infrastructure Institute UK (OMII-UK) which has a mandate to ensure the existence of a supported and sustainable foundation upon which other projects can build. Development is hosted at the EBI and Manchester University in a model that is closer to that of a software house than an academic group, so, unlike many open source projects, Taverna has some guarantee of continued existence.
One commercial user has integrated the Schrödinger tools into Taverna, running a system on the North-West grid in the UK, and was able to do this relatively quickly because of the open nature of the platform. Integrating a cheminformatics tool is no different from integrating a bioinformatics tool. Taverna does not supply tools: it provides links for tools in an easily extensible environment (as do some other open source workflow projects). Taverna also supports internal or external sharing and reuse of workflows, for example the sharing of portals in the myExperiment project.
The CDK-Taverna29 solution combines three open-source projects: Taverna as a workflow container, Christoph Steinbeck’s Chemistry Development Kit (CDK)30,31 as a basic chemo- and bioinformatics library of more than 100 components, and Bioclipse as an Eclipse-based result viewer. Potential workflows provided by the CDK-Taverna solution address data filtering, migration and transformation, information retrieval, QSAR/QSPR or pharmacophore related tasks, data analysis (statistics, clustering, computational intelligence), analytical and spectroscopical support, and molecular modeling. The CDK-Taverna solution aims to provide a “best of both worlds”: to be as flexible and extensible as software libraries such as CDK, and as user-friendly as professional, industrial IT systems.
Rajarshi Guha at Indiana University is working on a flexible and generalizable approach32 to the deployment of predictive models, based on a Web service infrastructure using R. The infrastructure allows users to access the functionality of these models using a variety of approaches ranging from Web pages to workflow tools. This approach is lower level, and requires programming, but is more general than an approach such as KNIME. Guha feels that one of the disadvantages of KNIME is lack of support for Web service nodes. Such nodes would allow very easy integration of much functionality, especially in bioinformatics.
|