In this work done in collaboration with the Max Planck Institute of Informatics, we focused on discovering commonsense knowledge, i.e. knowledge which is shared by most human, from various sources.
Commonsense is very important in modern AI as it helps understand objects, human behaviours, and general concepts. However, it is challenging to capture as it is rarely expressed, and it is hard to distinguish from contextual knowledge.
For example, an elephant is grey, but this fact is rarely mentioned as so. On Google, a quick search gives three times more estimated results for “pink elephant” (7.3 million) than for “grey elephants” (2 million).
To tackle this problem, we devised novel ways of tapping into search-engine query logs and QA forums, and we confirmed our facts using statistical evidence coming from different sources, such as encyclopedias, image tags, and books. Our pipeline can be represented as follow:
Our main idea to get fact candidates was to consider questions instead of statements. We leveraged human curiosity to extract salient knowledge about the world. Indeed, depending on the way one writes a question, it implies facts about the world. For instance, “why is the sky blue” implies that the sky is actually blue.
So we constructed a dataset of questions. One of our sources were QA forums such as Reddit or Answer.com, where we extracted all questions. Then, we used the autocompletion from a search engine like Google or Bing to simulate access to the search engine query log. We mainly focused on “why” and “how” questions (such as “why are elephants grey” or “how do birds fly”).
In the end, we obtained a knowledge base ten times bigger than ConceptNet, a handmade knowledge base, and TupleKB, a knowledge base focusing on high precision facts. Intrinsic and extrinsic evaluations proved to performances of our approach.
We also published a demo at CIKM 2020, available at http://quasimodo.r2.enst.fr.