Chemistry is an important and high-value vertical in the modern world and the “semantification” of chemistry will be crucial for further rapid innovation not only in the discipline itself, but also in related areas such as drug discovery, medicine and materials design. This article provides a short overview over the current technological state of the art in semantic chemistry and also discusses some obstancles, which have, so far, impeded the widespread uptake of chemistry in the domain.
Chemistry is arguably the most central of the physical sciences and at the heart of many fundamental industries: developments in chemical science very directly affect sectors such as the pharmaceutical and medical industry, the producers and processors of modern materials such as polymers and, of course, the chemical industry itself.
In modern science, it is important to realise, that most of the truly exciting scientific and technological progress now happens at the interfaces between two or more scientific and technical disciplines. As such, the development of new knowledge or products can, from an informatician’s point of view, be considered to be an exercise in the integration of data from different scientific domains. Chemistry overlaps with almost all domains of modern science, from pharmacology, biochemistry, toxicology to genetics and materials science.
As such, it is of prime importance to develop a comprehensive semantic apparatus for the discipline, which can contribute to the “data integration” process. This short article is subdivided into three parts. In the first part, it will discuss the state of the art in semantic chemistry at the time of writing (early 2009), the second part will look at current efforts in “semantification” within the domain of chemistry and the third part will discuss some of the technical and “socio-political” obstacles semantic chemistry is facing today.
The general semantic web toolkit in common use today consists of three major components: XML dialects, RDF(S) vocabularies and OWL ontologies (Figure 1).
Let us look at each of these components in turn and how they have been applied to the field of chemistry.
In terms of markup languages, the foremost and most relevant markup language pertaining to the realm of chemistry is Chemical Markup Language (CML), developed over the last decade by Murray-Rust, Rzepa and others.[1-7]
CML is designed to hold a large variety of chemical information, such as molecular structures (the spatial location of and the connectivity between the atoms that make up a molecule), materials structures (in particular polymers) as well as spectroscopic and other analytical data and also crystallographic and computational information. An example of the CML-based representation of the molecular structure (i.e. the atomic composition of a molecule and the spatial arrangement and connectivity of the atoms making up a molecule) is shown in figure 2.
The CML document describes an entity of type <molecule>. The <molecule> is a data container for two further data containers called <atomArray> and <bondArray>. The <atomArray> element contains a list of all the atoms present in the molecule, together with IDs, element types and, in this case, 2D coordinates specifying the spatial arrangement of atoms in the molecule. The <bondArray> element by analogy contains a list of bonds, bond IDs, a specification which atoms are connected by the bond and the bond order (is it a single, double, triple or any other type of bond?). Furthermore, CML can hold many different types of other annotations on atoms, bonds and associated chemical data. CML was recently extended to deal with fuzzy materials such as polymers, which also introduced the notion of introducing free variables into an otherwise purely declarative language, by injecting XSLT into specifications of CML and evaluating expressions in a lazy manner.[7]
Scientific and chemical information in free unstructured text such as scientific papers, theses and reports can be marked up in an analogous manner. Figure 3 shows the markup of a sentence contained in a scientific paper using a mixture of SciXML and a technology developed by our group here in Cambridge.
Figure 3: An abstract (ref-39) (A) prior to markup, (B) after markup with OSCAR 3.
The first part of the figure (A) presents the title of the paper and the first sentence of the abstract in and (B) shows the same sentence after (automated) markup through the OSCAR 3 natural language processing system.[8] In this example, chemical entities such as “oleic acid” or “magnetite” are marked up as chemical moieties (type=”CM”) and additional information, such as in-line representations of chemical structure (SMILES and InCHI) as well as ontology terms and other information can be added.
Other markup language of relevance to chemistry include Analytical Markup Language (AnIML),[9] ThermoML – a markup language for thermochemical and thermophysical property data,[10] MathML[11] (Mathematical Markup Language) and SciXML.[8, 12] Furthermore, Indian researchers have recently reported the development of an alternative to CML for the markup of chemical reaction information.[13, 14]
While the ecosystem for markup languages in chemistry is relatively well developed, the same cannot currently be said for the availability of RDF vocabularies for the domain. The most notable efforts were reported by Frey et al. as part of the CombeChem project.[15-17]The proposed vocabulary provides the basic mechanism to describe both state-independent (e.g. identifiers, molecular weights etc.) and condition-dependent (e.g. experimentally determined physical properties where the property is dependent on, for example, measurement or environmental conditions) entities associated with molecules, as well as provenance information for both molecules and data (Figure 4).
Furthermore, the same authors have also modelled a synthetic chemistry experiment in RDF.[18] There are sporadic efforts to model aspects of molecular structure in both RDF and OWL, but this must be considered to be developing work at this stage.[19, 20] In further studies, RDF has been exploited to the purposes of publishing in the chemical domain[21, 22] and for developing technologies which could lead to the generation of “research interest” (social) networks for chemists.[22, 23]
While at least some RDF vocabularies for chemistry are therefore available, what is decidedly missing is the availability of mashup examples. This can be explained by the difficulty associated with getting hold of chemical data: unlike biology or physics, chemistry has not (yet) developed a culture of data sharing and is extremely conservative in its adoption of a more open culture. We will discuss this further below.
Ontologies are computable conceptualisations of a knowledge domain and thus crucially important for adding “meaning” to data. To date, only few attempts have been made to construct formal ontologies for chemistry. Very early attempts predate the arrival of the semantic web and indeed the internet: in the 1980s, Gordon considered the the syntax, semantics and history of structural formulae as well as the semantics and formal attributes of chemical transformations in a set of papers, which led to a formalised language for relational chemistry.[24-26] Somewhat later, van der Vet published construction rules for the some very fundamental chemical concepts, such as “pure substance”, “phase” and “heterogeneous system” as the basis for the development of further axiomatisations relevant to chemistry.[27]
The currently most widely used chemical ontology is the European Bioinformatics Institute’s (EBI) “Chemical Entities of Biological Interest” (ChEBI) ontology.[28] ChEBI combines information from three main sources, namely IntEnz,[29] COMPOUND and the Chemical Ontology (CO)[30] and contains ontological associations which specify chemical relationships (e.g. “chloroform is a chloroalkane”), biological roles and uses and applications of the molecules contained in the ontology. ChEBI is stored in a relational database, but can be exported to OBO format and translated into OWL. Other ontologies currently maintained by the EBI are REX[31] and FIX[32]. REX terms describe physicochemical processes, whereas FIX mainly describes physicochemical measurement methods. Again, both ontologies are available in the OBO format. There have been other attempts to model aspects of chemistry, such as chemical structure,[19] laboratory processes[15-18] chemical reactions,[13, 14] and polymers[33] but these are isolated and somewhat small-scale efforts. There is currently no discernible community effort to develop a formalisation of chemical concepts.
A significant amount of chemical data is currently tied up in unstructured sources such as scientific papers, theses and patents. As such, natural language processing (NLP) of these sources is often required to extract relevant information and data and to add metadata . While there is considerable activity in processing text in the biological, biochemical and medical literature by both companies (e.g. Temis, Linguamatics and others) and academic groups (e.g. GENIA, PennBioIE) chemistry is sadly lagging behind in this area, although a number of reports have appeared in the literature over the past several years.[34-37] The principal open tool for the extraction and semantic markup of chemical entities at the moment is the OSCAR 3 system, which is currently being developed by Corbett and Murray-Rust.[8] OSCAR 3 is part of the SciBorg project[38] for the deep parsing and analysis of scientific texts, but can also be used in a standalone or integrated with other NLP systems. A typical example of OSCAR’s output has been provided in figure 2.
So far, we have only discussed the technical aspects of semantic chemistry. And while the field is in many ways still in its infancy (note the absence of a significant body of RDF vocabulary and ontologies), this situation is currently being addressed by a number of academic groups as well as commercial entities and it is reasonable to expect that a substantial amount of work will become available over the next several years, The real challenges associated with semantic chemistry are not so much of a technological nature, but rather “socio-cultural”. We have already alluded to the fact that, unlike other scientific, technical and medical fields, chemistry has not evolved a culture of data and knowledge sharing. Rather chemistry has ceded the dissemination of data and knowledge almost entirely to commercial entities in the form of publishing businesses. However, as is the case in mainstream publishing, the internet is currently in the process of destroying the business model associated with scientific publishing (publishers justifying subscriptions and revenue by organisuing manuscript collection, peer review, editorial work, printing and distribution to subscribers of the journal issue). As a consequence, scientific publishers are increasingly shifting their value proposition to content, i.e. scientific data and seem to attempt to prevent the automatic extraction of data (i.e. non-copyrightable facts) from their journals. For obvious reasons, disciplines which have already evolved both the technological as well as cultural mechanisms for data sharing are less severely impacted by this than chemistry, which currently has neither the technological nor indeed the cultural wherewithal for data sharing. Sooner or later, this will adversely affect the progress of science as a whole – the biosciences, for example, are crucially dependent on chemical data and without the ability to mash up data from both sources, progress in biology etc. will undoubtedly be impeded.
The crucial task for anyone interested in the use of semantics in the chemical domain, therefore, is to not only develop the necessary technology, but first and foremost to make a contribution towards changing hearts and minds in the discipline and to create “data awareness” in practicing scientists, which are not also informaticians. The Open Access movement is making slow and steady progress in this (several very significant universities have recently adopted open access publishing mandates) and the current generation of undergraduate and postgraduate students is keenly aware of the possibilities and the promise of semantic technologies. Therefore, there is considerable reason for optimism that we will see the transition from “chemistry” to “semantic chemistry” and full participation of the discipline in the semantic web in the not too distant future.
Chemistry is a conservative discipline which is nevertheless staring to participate in the semantic web. There is a considerable and useful infrastructure of markup languages available for the dissemination and exchange of chemical data. While not currently highly developed, some first drafts of RDF vocabularies and ontologies are also coming on-stream and good progress in the extraction of chemical entities from unstructured sources is also being made. The main obstacle that is currently holding up both the further development of semantics in the chemical domain and its further adoption as a technology is socio-cultural in nature: to date, chemistry has not evolved a culture of data sharing and therefore neither the cultural nor the technical mechanisms are in place, which results in a scarcity of available data sets. Nevertheless, the increasing adoption of open access and the further penetration of semantic technology into chemistry will force change to occur and there is every reason to remain optimistic.
Comments
Post new comment