The current information management tools and techniques have not kept pace with the dramatic growth of data within the enterprise. Much of this new data is represented in an unstructured or semi-structured format. The volume of the data makes it unmanageable by humans and the structure of the data makes it unavailable for machine processing. This has created a situation where information is now hidden or lost within the enterprise. This lost information has a significant business impact in the form of unmanaged risk and lost opportunities for revenue or savings.
The introduction of semantic technology within the information management landscape provides an exciting opportunity to dramatically change the state of the art for enterprise applications and to allow organizations to fully leverage the information that is available to them.
The enterprise information landscape has rapidly evolved into a collection of structured, unstructured, analytical and collaborative data. The lack of management of this data has a significant business impact in the form of lost revenue, unmanaged risk, and missed savings opportunities. While companies continue to invest in enterprise solutions, much of the risk and opportunity remains hidden within unstructured or semi-structured content.
The explosion of information within the enterprise is not unlike the explosion of information on the web. It’s no longer possible for humans to leverage all the intelligence available in enterprise data manually or with existing tools; new tools and techniques are needed to leverage this information. The introduction of semantic technology within the information management landscape provides an exciting opportunity to dramatically change the state of the art for enterprise applications and to allow organizations to fully leverage the information that is available to them.
One example of information that is lost in the system is the text of an organization’s legal contract documents. The formal relationship between an organization and its business partners is defined by the text of a legal contract document. While operational details such as product, price, quantity, and delivery terms are typically extracted from these documents to enable transactions, many of the relationship-related terms, such as obligations, rights, and opportunities defined in the contract, remain hidden in the text. Although the text is often stored in document management systems, in reality, only the individuals involved in any given negotiation maintain any knowledge of the relationship defined by the words of the contract. If the individuals involved move on to another project or another company, the knowledge of the relationship is lost. A typical large enterprise might manage 50,000 or more active contracts in any given year.
This example highlights both the problem and the opportunity in better understanding the information an organization has available. You’ll be able to imagine similar examples in human resources, in supplier and customer relationships, and in the sales and operations applications in your company today.
As with any new technology, there is a great deal of hype, promise, and disagreement on the topic of semantic technology. Even the definition of the terms is disputed. What does semantic mean? What is semantic technology or the semantic web? At the heart of the discussion is the term “semantic,” which describes how we understand the meaning or intent of information. While humans can have a complex understanding of written language, this written communication is often ambiguous or imprecise, making it difficult for machines to understand the meaning or intent of the written words. The semantic web is an evolution of our technology that provides a framework for representing and interoperating with data. Similar to the way in which HTML provided a common framework for how data is presented and navigated, the semantic web provides a framework for how data is understood and integrated.
It would be easy to say that semantic technology might be worth investigating in the future. However, our inability to effectively manage and leverage the information we have today is having a significant impact on our business. While the semantic web is still an emerging and rapidly changing set of technologies, there are strong indications that now is the time to develop a strategy and start leveraging the power and the promise of the semantic web. Specifically, the semantic web has:
• Momentum. Standards organizations, academic researchers, and industry continue to refine and standardize the framework and architecture for semantic technologies.
• An evolutionary approach that extends our existing web technologies. It’s a layered approach that allows each layer of the technology stack to evolve independently of the other layers. This also enables the introduction of technology in an incremental fashion.
To develop a semantic strategy for your enterprise, you must:
1) Understand your current information landscape
2) Develop a domain-specific data model
3) Represent data from your environment in the new model
4) Query information using the new model
The first step in developing a semantic strategy for your organization is to understand your current information landscape. Identify the data sources you have available within the organization, such as the data warehouse, ERP transaction data, or an enterprise document management system. Next, identify the data sources available outside the organization. These may include rating or credit agencies, discussion or product review forums, supplier catalogs, or diversity status information. Finally, consider the business benefit of integrating these disparate data sources into a comprehensive view. It’s useful in developing a semantic strategy to maintain a comprehensive view of your data landscape, but to focus on a very specific portion to prove the technology and provide immediate business value. Consider the “what if” questions that are critical to your organization. Possible questions include:
1) What are the unmet financial obligations of existing contacts? Which obligations can I cancel or defer? What is the penalty for cancelling the obligation? Which contracts can I use to expand or increase my purchasing? What are the rebates or additional discounts available to me under the increased purchasing level?
2) How is our product perceived in the market? What are the most used features? How are we different from our competitors?
3) What are the critical skills required in my industry? What are the top skills of my current workforce? Who in our current talent pool is available and has the skills to work on a special project?
Working through a few key scenarios, you will identify a candidate for semantic technology, providing an important advantage to your organization. For example, let’s assume we are interested in extracting additional intelligence from our contract documents. In order to do this, we need to understand what specific intelligence would bring additional value to the organization. The next step is to use a common data model and to develop a domain-specific vocabulary or ontology that can be used for describing the contracts domain.
In order to define an ontology or domain-specific vocabulary that can be understood by computer programs, we need a language or mechanism to express the data elements, their constraints, and their relationships in a machine-readable form. OWL is the Web Ontology Language that is endorsed by the W3C and is an active part of its semantic web activity. The data model is known as an Owl Ontology. Owl Ontologies are most often serialized or represented in RDF/XML. XML has been used as a markup language on the web for many years. It has also been used as a data exchange format that allows multiple systems to share a common representation of data. Although XML provides us syntax for transferring or transforming data elements, there is no formal mechanism to transfer meaning or intent. The understanding or intent is defined in the RDF and Owl layers represented in an XML format. The Resource Description Framework (RDF) provides the primitives for us to build our data model.
The fundamental idea in RDF is a statement. Statements are sometimes referred to as triples because they contain three parts [subject PREDICATE object].
The following are some examples of statements we could make regarding the types of contracts we are storing in our contract management system:
Figure 1 provides an example of how we might classify the types of commercial contracts using the RDF triples.
Figure 1: Types of Commercial Contracts
For this example, let’s focus on one of the simplest types of contracts in our system: the non-disclosure agreement (NDA) and the specific instance “NDA123”.
Figure 2 illustrates how NDA123 could be modeled using our contracts ontology.
Figure 2: NDA123 instance of employee NDA
This simple example demonstrates the power of having a semantic understanding of the data in our information landscape. To elaborate further on this example, let’s assume that all employee profiles are stored in our HR system and that all NDAs are stored by Legal in our contract management system. How can we write the query that says: “Show all employees who do not have a signed NDA on file with Legal”?
In a traditional environment, we might add a checkbox to the employee profile and require the HR representative to check the box when the employee has returned the NDA. Unfortunately, this approach is a static solution that cannot evolve or scale as business needs change. Let’s add some complexity and assume that we have just started the top secret joint project “Gemstone” with one of our partners, and all employees assigned to work on project Gemstone must sign a Gemstone-specific NDA. Our information model might be updated to include:
W Haubner – Chart 3
Figure 3 illustrates the Gemstone project NDA456.
Figure 3: NDA456 instance of Gemstone NDA
If the information in the system is modeled in this way, we can run a query joining information from the HR, Project Management, and Contract Management systems that asks questions such as “Show all the employees assigned to the Gemstone project who have not signed a Gemstone NDA”. The Gemstone NDA also covers the normal employee NDA, so if Joe has signed the Gemstone NDA, he doesn’t need to sign the standard employee NDA. Now try to run our original query to show all employees who have not signed an employee NDA. In a standard relational join, you would have to join all the different types of NDAs, but a semantic query can infer that a Gemstone NDA also meets the criteria of an employee NDA.
Now that we have a data model, we can focus on the challenge of making our existing data available for processing within this new model. In some ways, this is similar to the way we would make our information available to a full-text search engine, with the additional task of making the metadata that describes the meaning available. For transactional and analytical systems, a great deal of the metadata we need may already exist or be encoded in the logic of the target application. In these scenarios, we can leverage many of our traditional integration techniques to make the data and metadata available either dynamically or in a batch extract.
For unstructured data, we have two basic choices: we can manually tag the data with the metadata, or we can use tools to extract the metadata we need from the unstructured content. These tools can be as simple as pattern-matching tools that simply extract key items from the document, such as name, effective date, and contract type. Alternatively, they can be sophisticated tools that use Natural Language Processing (NLP) or statistical analysis to derive the metadata from the written words. Many of these tools have become quite powerful and commercially available over the past few years. Finally, there is no reason why we shouldn’t consider doing a better job in creating this unstructured content in the first place. Many of the advances in word processing software and the use of templates and components can easily be integrated to allow markup language such as clause types or key entities or variables to be marked within the document as it is being created. The document carries the metadata along with the original content.
Now that our data is represented in useful ontology, it is available to perform queries such as “Show all the employees on project Gemstone with a Gemstone-specific NDA.” While end users rarely resort to hand-writing SQL to query databases, the tools they use often generate the SQL to query the relational database. Similarly, the W3C is recommending a new query language called SPARQL. SPARQL is specifically designed to query the RDF triples discussed previously. SPARQL has been implemented in many languages and tools.
The challenge of managing information within the enterprise is not a new problem. However, we have been lacking a clear strategy for addressing the issue. The momentum of the standards community to provide a uniform set of technologies now affords us a framework to build an enterprise semantic strategy and to leverage the tools and support of a large and growing community.