question Answering with Subgraph Embeddings
Introduction
Teaching machines are you answer questions automatically in a natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref name=one>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. "Freebase: a collaboratively created graph database for structuring human knowledge." In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.
open QA techniques can be classified into two main categories:
- Information retrieval based: retrieve a broad set of answers be first query the API of the KBs then narrow down the answer using heuristics<ref>O. Kolomiyets and M.-F. Moens. "A survey on question answering technology from an information retrieval perspective." Information Sciences, 181(24):5412–5434, 2011.</ref><ref>C. Unger, L. B¨uhmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano "Template-based Question Answering over RDF Data" In Proceedings of the 21st international conference on World Wide Web, 2012.</ref><ref>X. Yao and B. Van Durme. "Information extraction over structured data: Question answering with freebase." In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref>.
- Semantic parsing based: focus on the correct interpretation of the query. Querying the interpreted question from the KB should return the correct answer <ref>J. Berant, A. Chou, R. Frostig, and P. Liang. [http://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf "Semantic parsing on Freebase from
question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. "Scaling Semantic Parsers with On-the-fly Ontology Matching." In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. "Semantic parsing via paraphrasing." In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. "Open Question Answering Over Curated and Extracted Knowledge Bases." In Proceedings of KDD’14. ACM, 2014.</ref>
Both of these approaches require negligible interventions (hand-craft lexicons, grammars and KB schemas) to be effective.
<ref name=five>A. Bordes, J. Weston, and N. Usunier. "Open question answering with weakly supervised embedding models." In Proceedings of ECML-PKDD’14. Springer, 2014.</ref> proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of <ref name=five/> specifically with the contributions of:
- A more sophisticated inference procedure that is more efficient and can consider longer paths.
- A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.
Task Definition
Motivation is to provide a system for open QA able to be trained as long as:
- A training set of questions paired with answers.
- A KB providing a structure among answers.
WebQuestions [1] was used for evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.
- WebQuestions: the dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers was allowed to only use Freebase as the querying tool).
- Freebase: is a huge database of general facts that are organized in triplets (
subject
,type1.type2.predicate
,object
). The form of the data from Freebase does not correspond to a structure found in natural language and so the questions were converted using the following format: "What is thepredicate
of thetype2 subject
"? Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). - ClubWeb Extractions: The team also used ClueWeb extractions as per [1] and [10]. ClueWeb has the format (
subject
, "text string",object
) and it was ensured that both thesubject
andobject
was linked to Freebase. These triples were also converted into questions using simple patters and Freebase types. - Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team made supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: [6] harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.
Table 2 shows some examples sentences from each dataset category.
Embedding Questions and Answers
We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let q denote a question and a denote an answer. Learning embeddings is achieved by learning a score function S(q, a) so that S generates a high score if a is the correct answer to q, and a low score otherwise.
- [math]\displaystyle{ S(q, a) = f(q)^\mathrm{T} g(a) \, }[/math]
Let [math]\displaystyle{ \mathbf{W} }[/math] be a matrix of [math]\displaystyle{ \mathbb{R}^{k \times N} }[/math], where k is the dimension of the embedding space and N is the dictionary of embedding to be learned. The function [math]\displaystyle{ f(\cdot) }[/math] which maps the questions into the embedding space [math]\displaystyle{ \mathbb{R}^{k} }[/math], is defined as [math]\displaystyle{ f(q) = \mathbf{W}\phi(q) }[/math], where [math]\displaystyle{ \phi(q) \in \mathbb{N}^N }[/math], is a sparse vector indicating the number of times each word appears in the question q. Likewise, the function [math]\displaystyle{ g(\cdot) }[/math] which maps the answers to the same embedding space [math]\displaystyle{ \mathbb{R}^{k} }[/math], is defined as [math]\displaystyle{ g(a) = \mathbf{W}\psi(a) }[/math], where [math]\displaystyle{ \psi(a) \in \mathbb{N}^N }[/math], is a sparse vector representation of the answer a. Figure 1 below depicts the subgraph embedding model.
Representing Candidate Answers
Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it.
- (i) Single Entity: The answer is represented as a single entity from Freebase. [math]\displaystyle{ \psi(a) }[/math] is a 1-of-[math]\displaystyle{ N_S }[/math] coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.
- (ii) Path Representation: The answer is represented as a path from the entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a [math]\displaystyle{ \psi(a) }[/math] which is 3-of-[math]\displaystyle{ N_S }[/math] or 4-of-[math]\displaystyle{ N_S }[/math].
- (iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity.
The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, we adopted the subgraph approach.
Training and Loss Function
The model was trained using a margin-based ranking loss function. Let [math]\displaystyle{ D = {(q_i, a_i) : i = 1,..., |D|} }[/math] be the training set of questions [math]\displaystyle{ q_i }[/math] paired with their correct answer [math]\displaystyle{ a_i }[/math]. The loss function we minimize is
- [math]\displaystyle{ \sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\}, }[/math]
where m is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix [math]\displaystyle{ \mathbf W }[/math] so the score of a question paired with a correct answer is greater than any incorrect answer [math]\displaystyle{ \overline{a} }[/math] by at least m.
Multitask Training of Embeddings
Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, we also multi-task the training of our model with task of phrase prediction. We do this by alternating the training of S with another scoring function defined as [math]\displaystyle{ S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2) }[/math] which uses the same embedding matrix [math]\displaystyle{ \mathbf{W} }[/math] and makes the same embeddings of a pair of questions if they are similar to each other if they are paraphrases and make them different otherwise.
Inference
Once [math]\displaystyle{ \mathbf{W} }[/math] is trained, at test time, for a given question q the model predicts the answer with:
- [math]\displaystyle{ \hat{a} = argmax_{a^' \in A(q)} S(q, a') }[/math]
where [math]\displaystyle{ A(q) }[/math] is the candidate answer set. For speed and precision issues, we create a candidate set [math]\displaystyle{ A(q) }[/math] for each question.
[math]\displaystyle{ A(q) }[/math] is first populated with all triples involving this entity in Freebase. This allows us to answer simple questions which are directly related to the answer. Let us denote this strategy as [math]\displaystyle{ C_1 }[/math].
A system that answer only such questions would be limited so we also consider 2-hops candidates. 1-hop candidates are weighted by 1.5. This strategy denoted [math]\displaystyle{ C_2 }[/math] is used by default.
Experiments
Table 3 below indicates that their approach outperformed [14], [1] and [5], and performs similarly as [2].
Conclusion
This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.