Using Semantic Web Standards for Improved Text Mining

Better text mining makes it possible to connect information in a variety of sources. The technology can connect information in CRM databases with consumer e-mails and help desk reports to provide a more complete view of the customer. Text mining can also be used in national security applications to better identify terrorists and security threats; it can assist in marketing to mine reviews for feedback on products such as movies, books and music. It can help in scientific research by providing a way to better connect scientific articles.
While there are many applications for text mining, there are also big challenges in its use. Text is written for people to read and currently, there are no programs available that can “read” text and understand its full meaning and we will not have such for the foreseeable future. With this challenge in mind, can Semantic Web technologies help improve text mining? The answer is yes, but the reason may come as a surprise for many.
Broadly speaking, the algorithms used for text mining fall into two categories: linguistic analysis and statistical pattern matching:

  • Linguistic analysis examines the structure of sentences. This means that the software must be extended to support different languages. Among many vendors in this space are Temis, Stratify and Attensity.
  • Statistical pattern matching relies on mathematical techniques that are language independent; no attempts are made to recognize what the text means – it is all symbols to the software. Bayesian belief networks are often used. The most notable vendor is Autonomy.

Linguistic Analysis Remains the Cornerstone of Text Mining
Today, it is likely that a given text mining software supports some combination of both approaches, but linguistic analysis remains to be the cornerstone of text mining – the most reliable way to get accurate results. Mathematical techniques can be useful as a way to supplement linguistic techniques. On their own, however, they don’t work reliably and have a big issue of being a “black box.” The software needs to be trained to process particular content and once trained, it behaves a certain way. If the training set is changed, the behavior may change as well. Exactly why the software produces the results that it produces and how to improve these results when they are not satisfactory is hard to pinpoint.
Linguistic analysis does not require investment in training the software, works much more reliably and accurately, is easy to understand, improve and extend, but it requires investment in developing so-called vocabulary cartridges. Cartridges are domain specific vocabularies. For example:

  • A  vendor like Calais which specializes in recognizing names, places, companies and roles, developed vocabularies of geographic locations, organizations and names. The vocabulary contains information that United Kingdom and England are synonymous and that they are countries. And that Lockheed Martin, Lockheed and LMCO refers to the same thing.
  • Attensity invested time working with life sciences companies. They have vocabularies that include different terms for the biological and chemical elements.
  • Stratify after several years of positioning their software as a general solution now specializes in analyzing legal documents

Of course, vocabularies are not the only thing that makes text mining possible. For example, there are modules for identifying words with the same root, which is called stemming. Stemming deals with prefixes and suffixes. There are also modules for recognizing sentence structure, so that the software can detect when a word is used as a noun as opposed to a verb, proverb or adjective. Such modules are well developed, are commonly used in spell checking and are available as open source.
There certainly have been no breakthroughs in the text mining approaches and algorithms in years. If you have a pretty well developed vocabulary or knowledge model of the domain, you get pretty good results, otherwise results can be mediocre. Though true, this presents a challenge as many would say that vocabulary creation is an awful lot of work!
Vocabulary/knowledge model approach to text mining has proven to work well and reliably, but it is often dismissed or used in a limited way because it is so labor intensive. Much effort has been extended on finding a more “scalable” approach —  an approach based on some magical algorithm. So far, the magic has not been discovered and there have been no fundamentally new ideas on what it could be.
Text Mining using Semantic Web Standards: Crowd-sourcing Vocabulary Creation
RDF or OWL do not offer anything specific to the text mining, except for one key feature – a standard way to describe vocabularies and distribute them over the web. What does this mean?
Suddenly, vocabulary creation can be crowd sourced and, thus, be very scalable. A single company does not need to develop and systematize all the knowledge and can leverage the work of others — vendors, research programs, end user organizations, standards bodies, individuals — anyone and everyone. Described in RDF, vocabularies developed by different parties can come together and can refer to each other. The work of one party can be extended, clarified, specialized and used by another party – in its entirety or selectively.
Vocabularies in RDF already started to appear on the web – from simple and generic to more complex and domain specific. On the simple side, TopQuadrant includes with the TopBraid Composer (and makes available on the web) a vocabulary of country names including links to Dbpedia: DBpedia already contains, for each country, synonyms and language specific names and because of that, TopQuadrant decided not to replicate this knowledge since it’s readily available on Dbpedia. For an example on how to use it, go to
Domain specific vocabularies available in RDF include AGROVOC from the United Nation’s Food and Agriculture Organization and the Economic Thesaurus from the German National Library of Economics.
The holy grail of the text mining – the accurate concept extraction from unstructured text — will soon come although, it will likely not stem from the advances in the mathematical algorithms or some super smart artificial intelligence. Instead, it is coming as a result of advances in the web, enabling us to better use human intelligence.


Text-mining comes to the aid of the Semantic Web

Nice post Irene. I agree with Seth Grimes’s point of view, and yours. Both perspectives are part of the solution to achieve the Semantic Web.
You are right saying that standards promoted by W3C (RDF/OWL/SPARQL and others) are great blue prints. They allow interlinking and sharing of knowledge. They allow leveraging knowledge stored in remote repositories (ontologies). Semantic Web technologies which implements these standards are not longer reserved to early adopters and are becoming main stream with RDF & SPARQL end points driven by powerful RDF triple stores. These are the tools.
Sure that text-mining solutions are starting to used them but text-mining is more than that, more than standards and knowledge repositories which use them. Text-mining solutions like the Text Mining Engine (TME) developed by Nstein are able to add semantic annotations (keywords, categories, named entities, sentiment analysis, events) to the unstructured textual content which clutter up the Web even more each day. Without these semantic annotations, the Semantic Web vision is a pipe dream.These semantic annotations, as their richness and relevancy are improved each year, allow to query the web in a different way, closer to the natural language (and thinking). Semantic annotations allow using high level criteria for better quality and less noise in the result. Searching the Web only with terms and keywords which return thousands of irrelevant documents will be legacy. Text-mining solutions are able to annotate millions of documents per day, with consistency and accuracy.These semantic annotators are bringing the building blocks for the Semantic Web. Text analytic solutions will take over and leverage these annotations, aggregate and interlink them to offer something pretty close to Tim Berners Lee’s vision formulated 20 years ago.

Martin Brousseau
Product Designer at Nstein Technologies

You have it backwards

I think you have it backwards. It’s text mining that’s leading to advances in the Semantic Web. Manual annotation, and sometimes generation of vocabularies etc., is too expensive and cumbersome without automated methods, i.e., text mining. The best process is often a hybrid one that combines automatic methods with manual curation, but this is still overall a text-analytics process.

My 2 cents.

"It’s text mining that’s

“It’s text mining that’s leading to advances in the Semantic Web.” This is quite an intriguing comment. So far I have not heard anyone in the Semantic Web community make such statements. Can you explain?

My article is really about the power of crowdsourcing – an idea that the work that is too expensive or time consuming to do for an individual or even a single company, can, in fact, be done quickly and effectively when it is crowd sourced. Witness Wikipedia, but there are also a growing number of other examples.

Semantic Web standards enable crowdsourcing of vocabulary creation.

Everyone knows the term AI (Artificial Intelligence). A colleague told me that some time ago Carnegie Mellon University coined a term IA (Intelligence Assisted). This is about using technology to harness human intelligence. Amazon Mechanical Turk can be thought of as one example IA. Using social networking, in general, can be also put in this category.

Btw, I’ve learned about this book on crowdsourcing: I did not read it yet, but it looks interesting.