question Answering with Subgraph Embeddings

From statwiki
Jump to: navigation, search


Teaching machines to answer questions automatically in natural language has been a long standing goal in AI. There has been a rise in large scale structured knowledge bases (KBs), such as Freebase <ref name=one>K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. "Freebase: a collaboratively created graph database for structuring human knowledge." In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008.</ref>, to tackle the problem known as open-domain question answers (or open QA). However, the scale and difficulty for machines to interpret natural language still makes this problem challenging.

Open QA techniques can be classified into two main categories:

question-answer pairs."] . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref name=six>T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. "Scaling Semantic Parsers with On-the-fly Ontology Matching." In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 2013.</ref><ref>J. Berant and P. Liang. "Semantic parsing via paraphrasing." In Proceedings of the 52nd Annual Meeting of the ACL, 2014.</ref><ref>A. Fader, L. Zettlemoyer, and O. Etzioni. "Open Question Answering Over Curated and Extracted Knowledge Bases." In Proceedings of KDD’14. ACM, 2014.</ref>.

Both of these approaches require non-negligible interventions (hand-crafted lexicons, grammars and KB schemas) to be effective.

Bordes et al.<ref name=five>A. Bordes, J. Weston, and N. Usunier. "Open question answering with weakly supervised embedding models." In Proceedings of ECML-PKDD’14. Springer, 2014.</ref> proposed a vectorial feature representation model to this problem. The goal of this paper is to provide an improved model of Bordes et al.<ref name=five/> specifically with the contributions of:

  • A more sophisticated inference procedure that is more efficient and can consider longer paths.
  • A richer representation of of the answers which encodes the question-answer path and surround subgraph of the KB.

Task Definition

The motivation is to provide a system for open QA able to be trained as long as:

  • A training set of questions paired with answers.
  • A KB providing a structure among the answers.

WebQuestions <ref name=one/> was used as an evaluation benchmark. WebQuestions only contains a few samples, so it was not possible to train the system on only this dataset. The following describes the data sources used for training.

  • WebQuestions: The dataset built using Freebase as the KB and contains 5810 question-answer pairs. It was created by crawling questions through the Google Suggest API and then obtaining answers using Amazon Mechanical Turk (Turkers were allowed to only use Freebase as the querying tool).
  • Freebase is a huge database of general facts that are organized in triplets (subject, type1.type2.predicate, object). The form of the data from Freebase does not correspond to a structure found in natural language and so each triplet was converted into a question using the following format: "What is the predicate of the type2 subject?" Note that all data from Freebase will have a fixed format and this is not realistic (in terms of a NL). Note also that only triplets from Freebase containing at least one entity found in the WebQuestions and ClueWeb datasets (see the next point) were used. This results in a set of 14 million Freebase triplets that were used to generate training questions.
  • ClubWeb Extractions: The team also used ClueWeb extractions as per <ref name=one/> and <ref name=ten>T. Lin, O. Etzioni, et al. "Entity Linking at Web Scale." In Proceedings of the Joint

Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 84–88. Association for Computational Linguistics, 2012.</ref>. ClueWeb has the format (subject, "text string", object) and it was ensured that both the subject and object was linked to Freebase. These triples were also converted into questions using simple patterns and Freebase types.

  • Paraphrases: automatically generated sentences have a rigid format and semi-automatic wording which does not provide a satisfactory modelling of natural language. To overcome this, the team supplemented their data with paraphrases collected from WikiAnswers. Users on WikiAnswers can tag sentences as a rephrasing of each other: <ref name=six/> harvest 2M distinct questions from WikiAnswers which were grouped into 350k paraphrase clusters.

Table 2 shows some examples sentences from each dataset category.

Embedding Questions and Answers

We wish to train our model such that representations of questions and their corresponding answers are close to each other in the joint embedding space. Let q denote a question and a denote an answer. Learning embeddings is achieved by learning a scoring function S(q, a) such that S generates a high score if a is the correct answer to q, and a low score otherwise.

[math] S(q, a) = f(q)^\mathrm{T} g(a) \,[/math]

Let [math]\mathbf{W}[/math] be a matrix of [math]\mathbb{R}^{k \times N}[/math], where k is the dimension of the embedding space and N is the dictionary of embeddings to be learned. The function [math]f(\cdot)[/math] which maps the questions into the embedding space [math]\mathbb{R}^{k}[/math], is defined as [math]f(q) = \mathbf{W}\phi(q)[/math], where [math]\phi(q) \in \mathbb{N}^N[/math], is a sparse vector indicating the number of times each word appears in the question q. Likewise, the function [math]g(\cdot)[/math] which maps the answers to the same embedding space [math]\mathbb{R}^{k}[/math], is defined as [math]g(a) = \mathbf{W}\psi(a)[/math], where [math]\psi(a) \in \mathbb{N}^N[/math], is a sparse vector representation of the answer a. Figure 1 below depicts the subgraph embedding model.

Representing Candidate Answers

Let us now consider possible feature representations for a single candidate answer. We consider three different representations corresponding to different subgraphs of Freebase around it. All of these approaches make use of a procedure that identifies a single entity in Freebase that is present in the question being asked. If the question mentions multiple entities, the entity that is present in the largest number of Freebase triplets is chosen. This entity is used to restrict the number of candidate answers that are considered by the model when it is used to perform inference, as explained below. The three approaches to feature representation are:

(i) Single Entity: The answer is represented as a single entity from Freebase. [math]\psi(a)[/math] is a 1-of-[math]N_S[/math] coded vector with 1 corresponding to the entity of the answer, and 0 elsewhere.
(ii) Path Representation: The answer is represented as a path from the target entity in the question to the answer entity. Only 1- or 2-hops paths were considered in the experiments which resulted in a [math]\psi(a)[/math] which is 3-of-[math]N_S[/math] or 4-of-[math]N_S[/math]. Notably, only the entity drawn from the question, the answer entity, and relations are included in the representation (i.e. the intermediate entity in the path between the question entity and the answer entity is not included in the feature representation). One interesting consequence of this choice is that the feature representation is not able to distinguish between paths with identical start and end points that traverse the same relation types in a different order through different intermediary nodes.
(iii) Subgraph Representation: We encode both the path representation from (ii), and the entire subgraph of entities that connect to the answer entity. This subgraph includes every entity that is directly connected to the answer entity. Both the surrounding entities and the relation types that connect them to the answer entity are included in the feature representation. Distinct features are used to represent the entities and relations present in the path representation and those present in the subgraph, which increases the size of the embedding matrix [math]\mathbf{W}[/math].

The hypothesis is that the more information that we include about the answer in its representation space, the better the results, and hence, the authors adopted the subgraph approach.

Training and Loss Function

The model was trained using a margin-based ranking loss function. Let [math]D = {(q_i, a_i) : i = 1,..., |D|}[/math] be the training set of questions [math]q_i[/math] paired with their correct answer [math]a_i[/math]. The loss function we minimize is

[math]\sum_{i \mathop =1}^{|D|} \sum_{\overline{a} \in \overline{A} (a_i)} max\{0,m - S(q_i, a_i) + S(q_i, \overline{a})\},[/math]

where m is the margin (fixed to 0.1). Minimizing the loss function learns the embedding matrix [math]\mathbf W[/math] so the score of a question paired with a correct answer is greater than any incorrect answer [math]\overline{a}[/math] by at least m. [math]\overline{a}[/math] is sampled from the set of incorrect answers [math]\overline{A}[/math] such that 50% of the time the answer entity sampled is an incorrect answer from the candidate set (i.e, related to the freebase entity in the question) and 50% of the time it is a random incorrect answer entity.

The function is optimized using stochastic gradient descent with the constraint that for every column [math]\,w_i[/math] of [math]\mathbf W[/math], [math]||w_i||_2 \leq 1[/math].

Multitask Training of Embeddings

Since many of the questions in the training cases were synthetically created, they do not adequately cover the range of syntax used in natural language. Hence, the authors also multi-task the training of the model with the task of phrase prediction. They do this by alternating the training of S with another scoring function defined as [math]S_{prp}(q_1, q_2) = f(q_1)^\mathrm{T} f(q_2)[/math] which uses the same embedding matrix [math]\mathbf{W}[/math] and makes the embeddings for a pair of questions similar to each other if they are paraphrases and makes them different otherwise.


Once [math]\mathbf{W}[/math] is trained, at test time, for a given question q the model predicts the answer with:

[math]\hat{a} = argmax_{a^' \in A(q)} S(q, a')[/math]

where [math]A(q)[/math] is the candidate answer set. For speed and precision issues, we create a candidate set [math]A(q)[/math] for each question.

[math]A(q)[/math] is first populated with all triples involving the selected Freebase entity from the question under consideration. This allows the model to answer questions whose answer is located within 1 hop of the selected Freebase entity. Call this strategy [math]C_1[/math]. Note that this strategy is actually quite limiting, because it amounts to restricting the search for an answer to the subgraph of Freebase that contains entities directly connected to the question entity.

Given this limitation, the authors also consider 2-hop candidates answers. Any entity connected to the question entity by a distance of no more than 2 relations is a potential candidate, but since this results in a very large set of candidates, the authors employ beam search to restrict the set of candidate entities that are connected to the question entity by a path that includes at least one of the ten relation types most likely expressed in the question. An answer is then selected from this pruned candidate set using the previously discussed scoring method, with the scores of 1-hop candidates weighted by a factor of 1.5 to compensate for the fact that they included fewer elements that contribute to the magnitude of the dot product used to score each candidate answer (when compared to 2-hop candidates). This overall strategy, denoted [math]C_2[/math], is used by default.
A prediction [math]a'[/math] can commonly actually be a set of candidate answers, not just one answer, for example for questions like "Who are David Beckham's children?". This is achieved by considering a prediction to be all the entities that lie on the same 1-hop or 2-hops path from the entity found in the question. Hence, all answers to the above question are connected to david_beckham via the same path (david_beckham, people.person.children, *). The feature representation of the prediction is then the average over each candidate entity's features, i.e.
[math] \psi_{all}(a') = \frac{1}{|a'|}\sum_{a'_{j}:a'}\psi(a'_{j}) [/math]
where [math]a'_{j}[/math] are the individual entities in the overall prediction [math]a'[/math] . In the results, we compare to a baseline method that can only predict single candidates, which understandably performs poorly.


Table 3 below indicates that their approach outperformed <ref name=four/>, <ref name=one/> and <ref name=five/>, and performs similarly as <ref name=two/>.


This paper presents an embedding model that learns to perform open QA through training data of question and answer pairs with a KB to provide logical structure among answers. The results have shown that the model can achieve promising performance on the competitive WebQuestions benchmark.


<references />