This article discusses ways of semantic processing in a search engine. Major search engines have one main type of result – a list of links to matching Web sources. There are many enhancements on top of it, but the core premise remains the same: a search result consists of a set of individual pages. The user is expected to drill down into individual sources. An alternative type of result is suggested: an essay compiled of a number of relevant and ordered sentences. The search engine in this case parses Web sources, understands their semantics, and creates an overall summary of the topic. The idea is to save the user time by providing a quick overview of the topic. A Web sentiment analysis application based on semantic analysis is introduced.
Semantic Web is most often perceived as the Web of data. In order to be visible on Semantic Web, a webpage needs to be described by tags – using RDFS, microformats, or another mechanism. When a semantic-aware application accesses the page, it is able to understand it and serve the relevant content to the user.
But with tons of legacy pages, and the 24/7 cycle of new content creation on the Web, a great number of pages that the user comes across will not be semantically tagged. In these vast areas of the Web “where no RDF has gone before”, the pages are not pre-tagged. In order to assess the page’s relevance to the user, a search engine needs to understand the page – by using some sort of semantic analysis. The engine can apply a taxonomy or ontology, either specific to a particular field (e.g., medical or legal or social), or generic. The analysis can be performed on-the-fly, or offline – pre-tagging the crawled pages for future matching. This semantic analysis of a webpage will be the focus of this article.
First of all, let’s understand which types of user queries will benefit from such semantic analysis. There are a number of different types of queries that users make when searching on the Web. A widely referred to classification distinguishes between transactional, informational, and navigational queries, although the borders between the types are often fuzzy. (A thorough analysis can be found here: http://ist.psu.edu/faculty_pages/jjansen/academic/pubs/jansen_user_intent.pdf). A typical navigational query is made when a user is looking for a particular Web site, e.g., “Microsoft”. A transactional query is one in which the user is looking for a site to perform a transaction, e.g., buy a product or make a reservation (“Domino’s Pizza in Astoria”).
An informational query is the one in which the user’s goal is to understand a concept, or to form an opinion on a subject. Examples could be: Afghan war, British elections, mesothelioma, iPad, global warming, etc.
One can say that in the case of an informational query, the user is searching for an answer, even when the query is formulated as a combination of keywords and not as a human-language question. A query like “When was Barack Obama born?” is looking for a direct answer, whereas the query “Barack Obama” is probably looking for an answer in the form of an overview of the topic, answering a few questions at once: who, when, what is he famous for, etc.
One option of handling this query is to find a page that best answers the question. Ask.com has tried that, as well as some smaller search engines. It is easier to find an answer to a specific question, where an engine can attempt to find affirmative statements expressed similarly to the question. However, in many cases the question cannot be formulated concisely and clearly – for instance, when the user is trying to understand a concept, or grasp a meaning of a subject. In this case, the results often fall short of satisfying the user.
In general, major search engines have only one main type of result – a list of links to matching Web sources. There are many enhancements on top of it, including federated search, or recent Google additions in the left sidebar. However, the core premise remains the same: a search result consists of a set of individual pages. In order to get an answer to the query, the user needs to dig into individual results one by one, eventually constructing a full picture in his mind.
An alternative type of result addressing this problem is an essay compiled of a number of relevant and ordered sentences. The search engine in this case parses the Web sources, understands them semantically, and creates an overall summary of the topic constructed from fragments of multiple pages.
One recent application that started providing results in this format is Cpedia (www.cpedia.com), by automatically generating encyclopedia-style answers to user queries. This approach has been pioneered by our semantic search engine SenseBot (www.sensebot.net) since early 2007. (Author’s note: a patent detailing the approach has been filed with USPTO.)
The idea of creating a digest of Web sources seems much closer to satisfying the user’s actual needs than the list of “10 blue links”. Naturally, the implementation is difficult, and it is unlikely that the quality of a summary will match an expert human-prepared essay. The difficulty is further complicated by the variety of Web content sources, written in different styles and targeting different audiences. Yet we believe that solid NLP/text mining and semantic techniques paired with smart heuristic rules can reach an acceptable level of customer satisfaction. The benefit will be in drastically reducing the time needed for the user to find a satisfactory answer.
Let’s look at some examples of essays as an answer. The first example shows search results returned by a major search engine on a query “Galileo Galilei”.
Example of search results returned by a major search engine on a query “Galileo Galilei”
As you can see,
• Only two facts noted on this page (underlined in items 3 and 7);
• All 8 sites look like they can have detailed info on our topic. (Of those, 5 sites indeed have detailed info when you browse to them: 1, 4, 5, 6, 7);
• The user would have to click on a number of links and read through them in order to get the answers.
The next example shows a summary on the same query generated by SenseBot. The essay is constructed out of the same sources that were returned before. The sources are referenced after each sentence.
Example of a summary on the same query generated by SenseBot
In this case,
• Most sentences contain facts (underlined);
• The summary attempts to lay out the topic in a coherent way:
a – The beginning introduces Galileo on a high level;
b – The next couple of sentences bring biographical data;
c – The summary continues and brings more detail about Galileo’s achievements and biography.
• The user would learn a lot about Galileo just from this one-page summary. To study the topic in detail, he would probably go to 7 or 1; for more scientific discussion, he would likely consider 5, 8, or 1.
Not all users are willing or have time to read the text, though – therefore a concise presentation of results in a semantic tag cloud can be helpful. Here is an example of a summary on the topic of “Financial bailout” prefaced with a cloud of semantic concepts extracted from the sources. The cloud serves as an overview of the subject, as well as a navigational means. Clicking on a particular concept will regenerate the summary, focusing on the selected aspect of the subject.
Example of a summary on the topic of “Financial bailout” prefaced with a cloud of semantic concepts extracted from the sources
The summary described in the examples above can be seen as a different type of search result altogether. The essence of the top relevant sources is automatically extracted and presented to the user, in many cases obviating the need to drill down into individual sources. The “lightweight” sources are excluded from the summary, even if they are highly ranked by major search engines.
The primary goal of a system producing this type of result is to save the user time by giving him a quick overview of the topic. Even when the overview is imperfect, the benefits are there – you can get perhaps 80% of the answer in a few seconds, and in many cases that 80% will suffice. Typically, queries of the informational type are the ones that benefit the most from a summary.
In this case, a search engine is going beyond information search and retrieval – to information synthesis. In addition to Web search, a number of other applications can utilize this type of semantic analysis. For example, LinkSensor (www.LinkSensor.com) uses it for contextual ad-matching and in-text discovery. Another promising field of application is sentiment analysis.
Web sentiment analysis tools have been spreading recently, targeting enterprises and social sites. In particular, Twitter analysis became quite common. However, the overall footprint of the sentiment analysis tools is still small. The reason may be insufficient quality when applied to generic, all-Web content as opposed to a particular subject domain. We believe that semantic-backed sentiment analysis has a better chance of satisfying the Web and enterprise users. The 2 main benefits that semantics can provide are:
• Higher accuracy of attribution of sentiment to the right entity;
• Explaining the sentiment trend by exposing the drivers of sentiment in public opinion.
www.OpinionCrawl.com is our new site that allows visitors to assess Web sentiment on a subject – a person, an event, a company or product. A pie chart expressing current real-time sentiment is displayed, along with a list of the latest news headlines, thumbnail images, and a tag cloud of key semantic concepts that the public associates with the subject. The concepts allow users to see what issues or events drive the sentiment in a positive or negative way.
To view the sentiment trend, the user can go to the blog section of the site: http://www.OpinionCrawl.com/blog. All posts in the blog are generated automatically. The Web crawlers bring fresh content from blogs, news, forums, etc., and the sentiment is recalculated on a regular basis. The blog posts show the trend of sentiment over time, as well as Positive-to-Negative ratio. For example, you can see how the public sentiment on US economy changes over time.
Watching the cloud of concepts over time, it is possible to see the change in the issues and key topics that the Web public associates with the subject, and how it affects the sentiment. The concepts are mined from the Web pages using SenseBot. Semantic analysis helps to attribute the sentiment to the right entity, and to identify key concepts defining the public opinion.
For companies or agencies, we produce in-depth sentiment reports on a subscription basis. The reports cover subjects selected by the client, selected from a large number of specific sources that are of interest to the client. The reports can be used to monitor the reputation of a brand or a product, or a company’s competitors. Both the sentiment and the issues that investors or consumers associate with the topic are presented on the reports.
The author will be speaking at Semtech 2010 on this topic in more detail.