Executive Summary   

XML Keyword Search is still a popular academic subject. It has not reached or been recognized by XML and Internet commercial products yet. The concepts involved are also very important to the semantic web. The semantics industry today with its work on higher level semantics like ontologies and taxonomies has overlooked the importance of utilizing the semantics of hierarchical structured data like XML. When working with hierarchically structured data, the first level of handling semantic understanding must be recognizing the hierarchical structure and its (lower level) hierarchical semantics. This is then used to eliminate false keyword search results that can show up as matches in hierarchical structures; otherwise they will go undetected to the higher level semantic processing which will also not detect them since they are not concerned with the structure of the data. This will cause unmeaningful results to be returned. 


Entity Extraction is the process of automatically extracting document metadata from unstructured text documents.  Extracting key entities such as person names, locations, dates, specialized terms and product terminology from free-form text can empower organizations to not only improve keyword search but also open the door to semantic search, faceted search and document repurposing.  This article defines the field of entity extraction, shows some of the technical challenges involved, and shows how RDF can be used to store document annotations. It then shows how new tools such as Apache UIMA are poised to make entity extraction much more cost effective to an organization.