A Knowledge-Grounded Neural Conversation Model: Difference between revisions
(Created page with " == A Knowledge-Grounded Neural Conversation Model == == Introduction == == Grounded Response Generation == == Datasets == == Experimental Setup == == Results == == Dis...") |
|||
Line 12: | Line 12: | ||
== Results == | == Results == | ||
In order to assess the effectiveness of the multi-tasking neural conversation model over the SEQ2SEQ model, the perplexity, BLEU scores, lexical diversity, appropriateness, and informativeness of the output datum were evaluated. Perplexity is measured by how well the model is able to evaluate the input data. Lexical diversity is the ratio of unique words to the total number of words in the input data. The BLEU score measures the quality of the encoded text. Appropriateness is how fitting the output response is to the input data. Informativeness is how useful and actionable the output response is in terms of the input data. Perplexity, lexical diversity, and BLEU scores were all carried out automatically, whereas appropriateness and informativeness were analyzed by human judges. | |||
In terms of perplexity, the MTASK and SEQ2SEQ models all performed almost equally as low when using general data. When using grounded data, the perplexity of all the models increase, however the increase in perplexity for MTASK and MTASK-R are lower than that for SEQ2SEQ and SEQ2SEQ-S. This suggests that the MTASK models perform better than the SEQ2SEQ models in terms of evaluating the inputs. In terms of BLEU scores, MTASK-R produces a high value of 1.08, indicating that it greatly outperforms the other MTASK and SEQ2SEQ models. In terms of lexical diversity, MTASK-RF has the highest percentage of word diversity among all the models. In terms of appropriateness, the performance of the MTASK models were only slightly better than the SEQ2SEQ models. In terms of informativeness, the MTASK-R model generally outperforms the other MTASK and SEQ2SEQ models. It was also discovered that MTASK-F is highly informative but struggles with appropriateness of the conversation. On the other hand, MTASK-R is able to produce appropriate responses but is not significantly better in terms of informativeness. | |||
== Discussion == | == Discussion == |
Revision as of 13:08, 19 March 2018
A Knowledge-Grounded Neural Conversation Model
Introduction
Grounded Response Generation
Datasets
Experimental Setup
Results
In order to assess the effectiveness of the multi-tasking neural conversation model over the SEQ2SEQ model, the perplexity, BLEU scores, lexical diversity, appropriateness, and informativeness of the output datum were evaluated. Perplexity is measured by how well the model is able to evaluate the input data. Lexical diversity is the ratio of unique words to the total number of words in the input data. The BLEU score measures the quality of the encoded text. Appropriateness is how fitting the output response is to the input data. Informativeness is how useful and actionable the output response is in terms of the input data. Perplexity, lexical diversity, and BLEU scores were all carried out automatically, whereas appropriateness and informativeness were analyzed by human judges.
In terms of perplexity, the MTASK and SEQ2SEQ models all performed almost equally as low when using general data. When using grounded data, the perplexity of all the models increase, however the increase in perplexity for MTASK and MTASK-R are lower than that for SEQ2SEQ and SEQ2SEQ-S. This suggests that the MTASK models perform better than the SEQ2SEQ models in terms of evaluating the inputs. In terms of BLEU scores, MTASK-R produces a high value of 1.08, indicating that it greatly outperforms the other MTASK and SEQ2SEQ models. In terms of lexical diversity, MTASK-RF has the highest percentage of word diversity among all the models. In terms of appropriateness, the performance of the MTASK models were only slightly better than the SEQ2SEQ models. In terms of informativeness, the MTASK-R model generally outperforms the other MTASK and SEQ2SEQ models. It was also discovered that MTASK-F is highly informative but struggles with appropriateness of the conversation. On the other hand, MTASK-R is able to produce appropriate responses but is not significantly better in terms of informativeness.