The goal of Probase is to make machines "aware" of the mental world of human beings, so that machines can better understand human communication. We do this by giving certain general knowledge or certain common sense to machines.
A Little Knowledge Goes a Long Way
Our goal is to enable machines to better understand human communication. An important question is, what does the word “understand” mean here? Consider the following example. For human beings, when we see “25 Oct 1881”, we recognize it as a date, although most of us do not know what it is about. However, if we are given a little more context, say the date is embedded in the following piece of short text “Pablo Picasso, 25 Oct 1881, Spain”, most of us would have guessed (correctly) that the date represents Pablo Picasso’s birthday. We are able to do this because we possess certain knowledge, and in this case, “one of the most important dates associated with a person is his birthday.”
As another example, consider a problem in natural language processing. Humans do not find sentences such as “animals other than dogs such as cats” ambiguous, but machine parsing can lead to two possible understandings: “cats are animals” or “cats are dogs.” Common sense tells us that cats cannot be dogs, which renders the second parsing improbable.
It turns out what we need in order to act like a human in the above two examples is nothing more than knowledge about concepts (e.g., persons and animals) and the ability to conceptualize (e.g., cats are animals). This is not a coincidence. Psychologist Gregory Murphy began his highly acclaimed book with the statement “Concepts are the glue that holds our mental world together”. Nature magazine book review pointed out “Without concepts, there would be no mental world in the first place”. Doubtless to say, having concepts and the ability to conceptualize is one of the defining characteristics of humanity. The question is then: How do we pass human concepts to machines, and how do we enable machines to conceptualize?
Probase: Using the World as its Model
Knowledge in Probase is harnessed from billions of web pages and years worth of search logs -- these are nothing more than the digitized footprints of human communication. In other words, Probase uses the world as its model.
Figure 1: A snippet of Probase's core taxonomy
Figure 1 shows what is inside Probase. The knowledgebase consists of concepts (e.g. emerging markets), instances (e.g., China), attributes and values (e.g., China's population is 1.3 billion), and relationships (e.g., emerging markets, as a concept, is closely related to newly industrialized countries), all of which are automatically derived in an unsupervised manner.
But Probase is much more than a traditional ontology/taxonomy. Probase is unique in two aspects. First, Probase has an extremely large concept/category space (2.7 million categories). As these concepts are automatically acquired from web pages authored by millions of users, it is probably true that they cover most concepts in our mental world (about worldly facts). Second, data in Probase, as knowledge in our mind, is not black or white. Probase quantifies the uncertainty. These serve as the priors and likelihoods that become the foundations of probabilistic reasoning in Probase.
Our mental world contains many concepts about worldly facts, and Probase tries to duplicate them. The core taxonomy of Probase alone contains above 2.7 million concepts. Figure 2 shows their distribution. The Y axis is the number of instances each concept contains(logarithmic scale), and on the X axis are the 2.7 million concepts ordered by their size. In contrast, existing knowledge bases have far fewer concepts (Freebase  contains no more than 2,000 concepts, and Cyc  has about 120,000 concepts), which fall short of modeling our mental world. As we can see in Figure 2, besides popular concepts such as “cities” and “musicians”, which are included by almost every general purpose taxonomy, Probase has millions of long tail concepts such as “anti-parkinson treatments”, "celebrity wedding dress designers” and “basic watercolor techniques”, which cannot be found in Freebase or Cyc. Besides concepts, Probase also has a large data space (each concept contains a set of instances or sub-concepts), a large attribute space (each concept is described by a set of attributes), and a large relationship space (e.g.,“locatedIn”, "friendOf”, "mayorOf”, as well as relationships that are not easily named, such as the relationship between apple and Newton.)
Figure 2: Frequency distribution of the 2.7 million concepts
Table 1: Scale of concept dimension
|name||# of concepts|
We make a bold claim that Probase is a knowledgebase about concepts in our mental world because the concepts in Probase are harnessed from billions of web pages authored by millions of people (see Section 3). With such a rich concept space, Probase has much better chance to understand text in natural language (see Section 4). Indeed, we studied 2 years’ worth of Microsoft’s Bing search log, and found that 85% of the searches contain concepts and/or instances that exist in Probase. It means Probase can be a powerful tool to interpret user intention behind search.
Another feature of Probase is that it is probabilistic, which means every claim in Probase is associated with some probabilities that model the claim’s correctness, typicality, ambiguity, and other characteristics. The probabilities are derived from evidences found in web data, search log data, and other existing taxonomies. For example, for typicality (between concepts and instances), Probase contains the following probabilities:
- P(C=company|I=apple): How likely people will think of the concept “company” when they see the word “apple”.
- P(I=steve jobs|C=ceo): How likely “steve jobs” will come into mind when people think about the concept “ceo”.
Probase also has typicality scores for concepts and attributes. Another important score in Probase is the similarity between any
two concepts y1 and y2 (e.g., celebrity and famous politicians). Thus Probase can tell that natural disasters and politicians are very different concepts, endangered species and tropical rainforest plants have certain relationships, while countries and nations are almost the same concepts.
These probabilities serve as priors and likelihoods for Bayesian reasoning on top of Probase. In addition, the probabilistic nature of Probase also enables it to incorporate data of varied quality from heterogeneous sources. Probase regards external data as evthese probabilities serve as priors and likelihoods for Bayesian reasoning on top of Probase.
The goal of Probase to enable machines to better understand human communication. For example, in natural language processing and speech analysis, knowledgebases can help reduce the ambiguities in language. As Probase has a knowledgebase as large as the concept space (of wordly facts) in a human mind, it has unique advantages in these applications.
Besides, with the probabilistic knowledge provided by Probase, we build several interesting applications, such as topic search, web table search and document understanding, shown in Figure 3.
Figure 3: Overview of Probase and Its Applications
Please refer to our release notes.
Zhongyuan Wang (zhy.wang @ microsoft . com)