Controlled vocabularies, taxonomies and thesauri have been in use in a wide variety of organizations for decades. With the information explosion fueled by the internet, the importance of these organization structures has become more and more apparent. The problem isn't where to find vocabularies or how to build them; on the contrary, enterprises typically find that they have several mini-vocabularies, each tuned to a special purpose or business need, just as so-called "folksonomies" have appeared in popular websites. The problem enterprises are facing today is how to manage all these vocabularies in a coherent way and eventually to integrate them so that they can make cross-references from one to another.
It is easy to think of the idea of tagging information for retrieval as a modern, Web 2.0 idea. Sites like del.icio.us and flickr gained considerable traction when they allowed their users to collaborate on information organization by 'tagging' articles and photos. This sort of collaborative organization is a form of 'crowdsourcing' - allowing large groups of people to create a useful information asset.
But this idea is not new to the Web 2.0 generation. As early as the 19th century, West Publishing saw the value in tagging legal briefs to make them easier to find and created the West Key Number System. Librarians have been 'tagging' books for their catalogs for generations. The idea of using a little bit of work from a large number of interested or informed people to index information has a distinguished history. But these systems learned very quickly the same lesson that del.icio.us and flickr learned: a pure crowdsourcing solution to tagging quickly gets out of control. Left unchecked, collaborative tagging systems result in a proliferation of inconsistent tags. The result is a system in which systematic search is frustrating at best. This is a situation that is barely acceptable on popular sites like flickr and del.icio.us, but is completely inappropriate for a controlled enterprise environment.
Enter the notion of a controlled vocabulary. If all the taggers are familiar with a pre-defined set of tags, then they can tag the materials more consistently. In the case of the West Key Number System, the idea of a controlled vocabulary started quite early - in the late 19th century. The maintenance of the controlled vocabulary became a business process in its own right, deciding what terms would be used, and documenting how they would be used.
A controlled vocabulary has a number of audiences:
- "Taggers" - knowledge workers for a content provider who index items with certain controlled terms
- Searchers - customers of a repository who want to find the correct items, find them easily, and know that they found them
- Vocabulary managers - knowledge workers who determine what concepts the vocabulary covers, and how the various terms describe them.
In an enterprise setting, controlled vocabularies are used for a variety of business purposes, including:
- Enhancing the 'findability' of data and information assets
- Communicating and integrating data across the supply chain
- Standardizing publication of company product information to external partners and customers
Forms of controlled vocabularies
In this article, we use the general term "controlled vocabulary" for any system of terminology that mediates a crowdsourced tagging system. But as you might imagine for a practice that has been in place for generations, there are a number of subtleties surrounding the structure, deployment, maintenance and management of a controlled vocabulary, resulting in different types of vocabularies. Some of the more common kinds of controlled vocabularies are:
- Taxonomies. A taxonomy has some sort of hierarchical or tree-like structure, with some terms having broader scope than others. There is a whole field of study on the nature of taxonomies, including considerations about how the terms in the hierarchy relate to one another, and how terms relate to external knowledge.
- Thesauri. A thesaurus is usually considered to be a more powerful structure than a taxonomy, including non-hierarchical relationships between terms. Some thesauri are rather vague about this relationship, others more specific.
- Ontologies. An ontology is the most powerful (and least well-defined) controlled vocabulary. An ontology can include detailed logical constraints between terms - "an Elizabethan play is one written by a playwright whose career spanned the reign of Elizabeth I"
The wide variety of forms of controlled vocabularies is one contributor to the main problem that enterprises are facing with vocabularies today - how to understand, best manage and utilize the large number of vocabularies that the organization already has and uses.
Sources of controlled vocabularies
Where do controlled vocabularies come from? Most organizations don't have the venerable tradition of the West Key Number System, where they have been managing a controlled vocabulary for over a century. Most organizations have vocabularies in one or more of the following forms:
- Spreadsheets. By far the most common way to organize a vocabulary. At some point in the organization's past, someone saw the need for an organized list of terminology, so they started a spreadsheet. If there was some structure to the vocabulary, they put that into the spreadsheet. Depending on the experience of the person putting it together, this might be very orderly or just a mess. In any case, since there is no standard for vocabularies in spreadsheets, it is idiosyncratic.
- Business applications. Some applications (e.g., CRM applications) have identified the need for a controlled vocabulary in the application itself, and provide a means for organizing these. While these vocabularies are represented in an orderly fashion, the vocabularies themselves are local to the application.
- Dedicated vocabulary management tools. More advanced organizations will have invested in a vocabulary management tool. Such tools usually provide a wide range of import / export filters for various vocabulary standards.
- External vocabulary sources. More and more industries are developing industry standard vocabularies that promise to allow interoperability between enterprises. The West Key Number System vocabulary is one example, but many others can be found, often supported by international organizations like the United Nations. UNSPSC (e-commerce) and Agrovoc (food and agriculture) are two examples.
This variety of sources and requirements results in a number of challenges for managing vocabularies at the enterprise level. Some of the major concerns include:
- Vocabularies from different systems are available in contrasting and often idiosyncratic forms. In the case of dedicated vocabulary management systems, there is often a variety of export filters, but this is not the case for many vocabularies that are part of some other work process.
- Different groups don't communicate while developing vocabulary, resulting in overlapping coverage and differences in representation. This makes it difficult for an organization to move toward a 'single version of the truth', since there are established business processes around the various versions of the truth manifest in each group's vocabulary.
- External vocabulary resources are valuable, but typically are not designed in a way that is compatible with enterprise needs. Published vocabularies usually aim for greatest coverage, including all information that might be needed for any use. This makes them too cumbersome to be used conveniently for any particular use.
- Spreadsheets are a common tool but have a number of technical limitations including size, structure, and linkage.) Spreadsheet users show great ingenuity in encoding vocabulary structure in spreadsheets. It often requires a dedicated IT task to move the vocabulary into any other form.
- Taxonomy tools are often built into other apps (CRM, BPM, etc.), with limited and disconnected functionality. A dedicated vocabulary management system will typically support re-use of terms and distributed editing. Such detailed management functionality is not typical in systems that are not dedicated to vocabularies.
- Vocabulary Management is treated as an isolated activity, though it has impact throughout the enterprise. Categories used in CRM are useful for supply chain management or elsewhere in the business process, but vocabulary development is not done with global application in mind.
- Terms from one vocabulary often need to link to another. Two different business processes have legitimate reasons for using terminology in different ways. An enterprise vocabulary solution needs to be able to manage the fact that different terms may be used for the same thing in different business contexts, or that the same term will be used in different ways. How does my notion of "Customer" relate to yours?
Enterprise Vocabulary Management
This set of challenges identifies a new need in the enterprise that we call Enterprise Vocabulary Management (EVM). A common approach to vocabulary management on an enterprise scale is to create a single centralized reference vocabulary for the whole enterprise. While this is an attractive idea, it is simply not tenable in a modern enterprise setting. Different business units and different business processes have legitimate and well-entrenched reasons to use terminology in certain, typically conflicting, ways. It simply does not make business sense to tell one unit that they have to change the way they communicate in their every day process to conform to a corporate standard. Even in the cases where such efforts succeed, what with mergers and acquisitions and modifications of product lines and new industries, enterprises today are so dynamic that a central reference is obsolete before it is complete.
In light of such dynamism, the situation may seem hopeless - how can any information management system hope to keep up? The answer is inspired by the web - the one information system that thrives in highly dynamic environments. Instead of trying to create a single vocabulary system for the enterprise, an EVM solution must create a web of vocabulary structures linking them together. This approach puts a few requirements on the underlying technology of a solution:
- Global naming - in order to talk about something, you have to agree on a name for it. This doesn't mean you have to agree on how to use any particular word, but if we want to talk about how CRM's notion of "Customer" relates to AR's notion of "Customer", we have to have a way to refer to these two notions of the word "Customer".
- Linking - a solution has to be able to talk about how terms are related. Terms within a vocabulary can be related to one another (like taxonomies and thesauri), but also terms between vocabularies are related to one another.
- Flexibility - the vocabulary network must be able to represent information typically in spreadsheets, databases, XML feed, business systems, and vocabulary systems.
- Distribution - the solution must be distributed and networked. Vocabularies will be maintained by different systems in different locations organized by different IT departments.
While it is possible for a number of technologies to support such a solution, Semantic Web technology was designed with web distribution in mind. In particular:
- The Semantic Web standards Resource Description Format (RDF) and RDF Schema provide the infrastructure for creating a web of information, whether on the public internet or within enterprise intranets. Like the familiar World Wide Web, the Semantic Web is easily extensible, breaking out of information silos once and for all, as vocabularies are available as resources on the web for anyone to reference and use. RDF makes use of the same system for global naming (the URI) that is already successful on the web.
- The Semantic Web standard Simple Knowledge Organization System (SKOS) provides an extensible language for describing structure terminology systems like taxonomies and thesauri. Based on RDF, it inherits the power of flexibility and distribution. SKOS extends RDF with specific notions of linkage between structured vocabularies, making it particularly useful for enterprise vocabulary management.
The unique features of Semantic Web technology put it into a unique position as a key enabling technology for enterprise vocabulary management. While a complete solution will also require comprehensive enterprise integration capabilities (e.g., connection to LDAP services and web application platforms) as well as flexible user interfaces, the most promising EVM solutions will start with an underlying technology that is capable of meeting its core needs.
Comments
Post new comment