# Difference between revisions of "Semantic Relation Classification——via Convolution Neural Network"

Line 11: | Line 11: | ||

from the first one. Then a mapped feature factor | from the first one. Then a mapped feature factor | ||

$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$ | $$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$ | ||

− | is produced. | + | is produced. |

− | + | ||

The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked. | The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked. | ||

With different weight filter, different mapped feature vectors can be obtained. Finally, the original | With different weight filter, different mapped feature vectors can be obtained. Finally, the original | ||

sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters, | sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters, | ||

− | then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>. | + | then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>. |

+ | |||

Then, the score vector | Then, the score vector | ||

$$s(x)=W^{classes}r_{x}$$ | $$s(x)=W^{classes}r_{x}$$ | ||

is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as | is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as | ||

− | the one with the highest score. The <math> W^{classes} </math> here is the model being trained. | + | the one with the highest score. The <math> W^{classes} </math> here is the model being trained. |

− | + | ||

To improve the performance, “Negative Sampling" was used. Given the trained data point | To improve the performance, “Negative Sampling" was used. Given the trained data point | ||

− | <math> \tilde{x} | + | <math> \tilde{x} </math>, and its correct class <math> \tilde{y} </math>. Let <math> I=Y\setminus\{\tilde{y}\} </math> represent the |

incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative | incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative | ||

distance (negative margin plus the second largest score) should be minimized. So the loss function is | distance (negative margin plus the second largest score) should be minimized. So the loss function is |

## Revision as of 16:02, 22 November 2020

After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length [math] N [/math], which looks like $$e=[e_{1},e_{2},\ldots,e_{N}]$$ and each entry represents a token of the word. Also, to apply convolutional neural network, the subsets of features $$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$ is given to a weight matrix [math] W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}[/math] to produce a new feature, defiend as $$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$ This process is applied to all subsets of features with length [math] k [/math] starting from the first one. Then a mapped feature factor $$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$ is produced.

The max pooling operation is used, the [math] \hat{c}=max\{c\} [/math] was picked. With different weight filter, different mapped feature vectors can be obtained. Finally, the original sentence [math] e [/math] can be converted into a new representation [math] r_{x} [/math] with a fixed length. For example, if there are 5 filters, then there are 5 features ([math] \hat{c} [/math]) picked to create [math] r_{x} [/math] for each [math] x [/math].

Then, the score vector $$s(x)=W^{classes}r_{x}$$ is obtained which represented the score for each class, given [math] x [/math]'s entities' relation will be classified as the one with the highest score. The [math] W^{classes} [/math] here is the model being trained.

To improve the performance, “Negative Sampling" was used. Given the trained data point [math] \tilde{x} [/math], and its correct class [math] \tilde{y} [/math]. Let [math] I=Y\setminus\{\tilde{y}\} [/math] represent the incorrect labels for [math] x [/math]. Basically, the distance between the correct score and the positive margin, and the negative distance (negative margin plus the second largest score) should be minimized. So the loss function is $$L=log(1+e^{\gamma(m^{+}-s(x)_{y})}+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))}$$ with margins [math] m_{+} [/math], [math] m_{-} [/math], and penalty scale factor [math] \gamma [/math]. The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, and 49,600 of them are unique.