# Difference between revisions of "Semantic Relation Classification——via Convolution Neural Network"

Line 1: | Line 1: | ||

− | After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length | + | After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length <math> N </math>, which looks like |

$$e=[e_{1},e_{2},\ldots,e_{N}]$$ | $$e=[e_{1},e_{2},\ldots,e_{N}]$$ | ||

and each entry represents a token of the word. Also, to apply | and each entry represents a token of the word. Also, to apply | ||

convolutional neural network, the subsets of features | convolutional neural network, the subsets of features | ||

$$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$ | $$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$ | ||

− | is given to a weight matrix | + | is given to a weight matrix <math> W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}</math> to |

produce a new feature, defiend as | produce a new feature, defiend as | ||

$$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$ | $$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$ | ||

− | This process is applied to all subsets of features with length | + | This process is applied to all subsets of features with length <math> k </math> starting |

from the first one. Then a mapped feature factor | from the first one. Then a mapped feature factor | ||

$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$ | $$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$ | ||

is produced.\\ | is produced.\\ | ||

\\ | \\ | ||

− | The max pooling operation is used, the | + | The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked. |

With different weight filter, different mapped feature vectors can be obtained. Finally, the original | With different weight filter, different mapped feature vectors can be obtained. Finally, the original | ||

− | sentence | + | sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters, |

− | then there are 5 features ( | + | then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>.\\ |

Then, the score vector | Then, the score vector | ||

$$s(x)=W^{classes}r_{x}$$ | $$s(x)=W^{classes}r_{x}$$ | ||

− | is obtained which represented the score for each class, given | + | is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as |

− | the one with the highest score. The | + | the one with the highest score. The <math> W^{classes} </math> here is the model being trained.\\ |

\\ | \\ | ||

To improve the performance, “Negative Sampling" was used. Given the trained data point | To improve the performance, “Negative Sampling" was used. Given the trained data point | ||

− | + | <math> \tilde{x}$, and its correct class <math> \tilde{y} </math>. Let <math> I=Y\setminus\{\tilde{y}\} </math> represent the | |

− | incorrect labels for | + | incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative |

distance (negative margin plus the second largest score) should be minimized. So the loss function is | distance (negative margin plus the second largest score) should be minimized. So the loss function is | ||

$$L=log(1+e^{\gamma(m^{+}-s(x)_{y})}+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))}$$ | $$L=log(1+e^{\gamma(m^{+}-s(x)_{y})}+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))}$$ | ||

− | with margins | + | with margins <math> m_{+} </math>, <math> m_{-} </math>, and penalty scale factor <math> \gamma </math>. |

The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, | The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, | ||

and 49,600 of them are unique. | and 49,600 of them are unique. |

## Revision as of 15:00, 22 November 2020

After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length [math] N [/math], which looks like $$e=[e_{1},e_{2},\ldots,e_{N}]$$ and each entry represents a token of the word. Also, to apply convolutional neural network, the subsets of features $$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$ is given to a weight matrix [math] W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}[/math] to produce a new feature, defiend as $$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$ This process is applied to all subsets of features with length [math] k [/math] starting from the first one. Then a mapped feature factor $$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$ is produced.\\ \\ The max pooling operation is used, the [math] \hat{c}=max\{c\} [/math] was picked. With different weight filter, different mapped feature vectors can be obtained. Finally, the original sentence [math] e [/math] can be converted into a new representation [math] r_{x} [/math] with a fixed length. For example, if there are 5 filters, then there are 5 features ([math] \hat{c} [/math]) picked to create [math] r_{x} [/math] for each [math] x [/math].\\ Then, the score vector $$s(x)=W^{classes}r_{x}$$ is obtained which represented the score for each class, given [math] x [/math]'s entities' relation will be classified as the one with the highest score. The [math] W^{classes} [/math] here is the model being trained.\\ \\ To improve the performance, “Negative Sampling" was used. Given the trained data point [math] \tilde{x}$, and its correct class \lt math\gt \tilde{y} [/math]. Let [math] I=Y\setminus\{\tilde{y}\} [/math] represent the incorrect labels for [math] x [/math]. Basically, the distance between the correct score and the positive margin, and the negative distance (negative margin plus the second largest score) should be minimized. So the loss function is $$L=log(1+e^{\gamma(m^{+}-s(x)_{y})}+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))}$$ with margins [math] m_{+} [/math], [math] m_{-} [/math], and penalty scale factor [math] \gamma [/math]. The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, and 49,600 of them are unique.