http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Q26deng&feedformat=atom
statwiki - User contributions [US]
2024-03-29T14:17:26Z
User contributions
MediaWiki 1.41.0
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=F18-STAT841-Proposal&diff=42409
F18-STAT841-Proposal
2018-12-12T18:32:49Z
<p>Q26deng: </p>
<hr />
<div><br />
'''Use this format (Don’t remove Project 0)'''<br />
<br />
'''Project # 0'''<br />
Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
'''Title:''' Making a String Telephone<br />
<br />
'''Description:''' We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1'''<br />
Group members:<br />
<br />
Weng, Jiacheng<br />
<br />
Li, Keqi<br />
<br />
Qian, Yi<br />
<br />
Liu, Bomeng<br />
<br />
'''Title:''' RSNA Pneumonia Detection Challenge<br />
<br />
'''Description:''' <br />
<br />
Our team’s project is the RSNA Pneumonia Detection Challenge from Kaggle competition. The primary goal of this project is to develop a machine learning tool to detect patients with pneumonia based on their chest radiographs (CXR). <br />
<br />
Pneumonia is an infection that inflames the air sacs in human lungs which has symptoms such as chest pain, cough, and fever [1]. Pneumonia can be very dangerous especially to infants and elders. In 2015, 920,000 children under the age of 5 died from this disease [2]. Due to its fatality to children, diagnosing pneumonia has a high order. A common method of diagnosing pneumonia is to obtain patients’ chest radiograph (CXR) which is a gray-scale scan image of patients’ chests using x-ray. The infected region due to pneumonia usually shows as an area or areas of increased opacity [3] on CXR. However, many other factors can also contribute to increase in opacity on CXR which makes the diagnose very challenging. The diagnose also requires highly-skilled clinicians and a lot of time of CXR screening. The Radiological Society of North America (RSNA®) sees the opportunity of using machine learning to potentially accelerate the initial CXR screening process. <br />
<br />
For the scope of this project, our team plans to contribute to solving this problem by applying our machine learning knowledge in image processing and classification. Team members are going to apply techniques that include, but are not limited to: logistic regression, random forest, SVM, kNN, CNN, etc., in order to successfully detect CXRs with pneumonia.<br />
<br />
<br />
[1] (Accessed 2018, Oct. 4). Pneumonia [Online]. MAYO CLINIC. Available from: https://www.mayoclinic.org/diseases-conditions/pneumonia/symptoms-causes/syc-20354204<br />
[2] (Accessed 2018, Oct. 4). RSNA Pneumonia Detection Challenge [Online]. Kaggle. Available from: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge<br />
[3] Franquet T. Imaging of community-acquired pneumonia. J Thorac Imaging 2018 (epub ahead of print). PMID 30036297<br />
<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 3'''<br />
Group members:<br />
<br />
Hanzhen Yang<br />
<br />
Jing Pu Sun<br />
<br />
Ganyuan Xuan<br />
<br />
Yu Su<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:'''<br />
<br />
Our team chose the [https://www.kaggle.com/c/quickdraw-doodle-recognition Quick, Draw! Doodle Recognition Challenge] from the Kaggle Competition. The goal of the competition is to build an image recognition tool that can classify hand-drawn doodles into one of the 340 categories.<br />
<br />
The main challenge of the project remains in the training set being very noisy. Hand-drawn artwork may deviate substantially from the actual object, and is almost definitively different from person to person. Mislabeled images also present a problem since they will create outlier points when we train our models. <br />
<br />
We plan on learning more about some of the currently mature image recognition algorithms to inspire and develop our own model.<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 4'''<br />
Group members:<br />
<br />
Snaith, Mitchell<br />
<br />
'''Title:''' Exploring Kuzushiji-MNIST, a new classification benchmark<br />
<br />
'''Description:''' <br />
<br />
The paper *Deep Learning for Classical Japanese Literature* presents a new classification dataset intended to act as a drop-in replacement for MNIST. The paper authors believe that this dataset is significantly more difficult that MNIST for typical classification methods, while not "capping" performance due to indiscernible objects like Fashion-MNIST might. <br />
Goals are to: <br />
<br />
- perform survey of typical machine-learning algorithms on Kuzushiji-MNIST compared to both MNIST and Fashion-MNIST<br />
<br />
- investigate relevant differences in the structures of the datasets<br />
<br />
- assess whether Fashion-MNIST does indeed seem to have a performance cap that can be overcome with Kuzushiji-MNIST<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 5'''<br />
Group members:<br />
<br />
Pei Wei, Wang<br />
<br />
Daoyi, Chen<br />
<br />
Yiming, Li<br />
<br />
Ying, Chi<br />
<br />
'''Title:''' Kaggle Challenge: Airbus Ship Detection Challenge<br />
<br />
'''Description:''' <br />
<br />
Classification has become a more and more eye-catching, especially with the rise of machine learning in these years. Our team is particularly interested in machine learning algorithms that optimize some specific type image classification. <br />
<br />
In this project, we will dig into base classifiers we learnt from the class and try to cook them together to find an optimal solution for a certain type images dataset. Currently, we are looking into a dataset from Kaggle: Airbus Ship Detection Challenge.<br />
<br />
For us, as machine learning students, we are more eager to help getting a better classification method. By “better”, we mean find a balance between simplify and accuracy. We will start with neural network via different activation functions in each layer and we will also combine base classifiers with bagging, random forest, boosting for ensemble learning. Also, we will try to regulate our parameters to avoid overfitting in training dataset. Last, we will summary features of this type image dataset, formulate our solutions and standardize our steps to solve this kind problems <br />
<br />
Hopefully, we can not only finish our project successfully, but also make a little contribution to machine learning research field.<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 6'''<br />
Group members:<br />
<br />
Ngo, Jameson<br />
<br />
Xu, Amy<br />
<br />
'''Title:''' Kaggle Challenge: [https://www.kaggle.com/c/PLAsTiCC-2018 PLAsTiCC Astronomical Classification ]<br />
<br />
'''Description:''' <br />
<br />
We will participate in the PLAsTiCC Astronomical Classification competition featured on Kaggle. We will explore how possible it is classify astronomical bodies based on various factors such as brightness.<br />
<br />
These bodies will vary in time and size. Some are unknown! There are over 100 classes that these bodies may be and it will be our job to find the predicted probability for an image to be each class.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 7'''<br />
Group members:<br />
<br />
Qianying Zhao<br />
<br />
Hui Huang<br />
<br />
Meiyu Zhou<br />
<br />
Gezhou Zhang<br />
<br />
'''Title:''' Quora Insincere Questions Classification<br />
<br />
'''Description:''' <br />
Our group will participate in the featured Kaggle competition of Quora Insincere Questions Classification. For this competition, we should predict wether a question asked on Quora is sincere or not. If the question is insincere, it intends to be a statement rather than look for useful answers, and identified as (target = 1). <br />
We will analyze the Quora question text to predict the characteristics of questions and define they are sincere or insincere using Rstudio. Our presentation report will include not only how we've concluded by classifying and analyzing provided data with appropriate models, but also how we performed in the contest.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 8'''<br />
Group members:<br />
<br />
Jiayue Zhang<br />
<br />
Lingyun Yi<br />
<br />
Rongrong Su<br />
<br />
Siao Chen<br />
<br />
<br />
'''Title:''' Telecom Customer Churn Prediction<br />
<br />
<br />
'''Description:''' <br />
Traditional telecommunication industry is made up of telecommunication companies and internet service providers, which play important role in daily life. It is crucial for the telecommunication companies to analyze and maintain their relationship with existing customers, as well as winning new customers with marketing strategies. However, it costs 5 times as much to attract a new customer than to keep an existing one. Therefore, retaining existing customers and building a loyal relationship are the key concerns for traditional telecommunication companies to stay strong in the competition. This project aims to provide insights for the telecom companies in predicting the chance of a customer leaving the company. We will be applying different classification models such as Random Forest, Gradient boosting, Logistic Regression and XGBoost, and then compare each model's performance. <br />
<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 9'''<br />
Group members:<br />
<br />
Hassan, Ahmad Nayar<br />
<br />
McLellan, Isaac<br />
<br />
Brewster, Kristi<br />
<br />
Melek, Marina Medhat Rassmi <br />
<br />
<br />
'''Title:''' Kaggle Compeition: Quora Insincere Questions Classification<br />
<br />
'''Description:''' <br />
<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 10'''<br />
Group members:<br />
<br />
Lam, Amanda<br />
<br />
Huang, Xiaoran<br />
<br />
Chu, Qi<br />
<br />
Sang, Di<br />
<br />
'''Title:''' Kaggle Competition: Human Protein Atlas Image Classification<br />
<br />
'''Description:'''<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 11'''<br />
Group members:<br />
<br />
Bobichon, Philomene<br />
<br />
Maheshwari, Aditya<br />
<br />
An, Zepeng<br />
<br />
Stranc, Colin<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 12'''<br />
Group members:<br />
<br />
Huo, Qingxi<br />
<br />
Yang, Yanmin<br />
<br />
Cai, Yuanjing<br />
<br />
Wang, Jiaqi<br />
<br />
'''Title:''' <br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 13'''<br />
Group members:<br />
<br />
Ross, Brendan<br />
<br />
Barenboim, Jon<br />
<br />
Lin, Junqiao<br />
<br />
Bootsma, James<br />
<br />
'''Title:''' Expanding Neural Netwrok<br />
<br />
'''Description:''' The goal of our project is to create an expanding neural network algorithm which starts off by training a small neural network then expands it to a larger one. We hypothesize that with the proper expansion method we could decrease training time and prevent overfitting. The method we wish to explore is to link together input dimensions based on covariance. Then when the neural network reaches convergence we create a larger neural network without the links between dimensions and using starting values from the smaller neural network. <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 14'''<br />
Group members:<br />
<br />
Schneider, Jason <br />
<br />
Walton, Jordyn <br />
<br />
Abbas, Zahraa<br />
<br />
Na, Andrew<br />
<br />
'''Title:''' Application of ML Classification to Cancer Identification<br />
<br />
'''Description:''' The application of machine learning to cancer classification based on gene expression is a topic of great interest to physicians and biostatisticians alike. We would like to work on this for our final project to encourage the application of proven ML techniques to improve accuracy of cancer classification and diagnosis. In this project, we will use the dataset from Golub et al. [1] which contains data on gene expression on tumour biopsies to train a model and classify healthy individuals and individuals who have cancer.<br />
<br />
One challenge we may face pertains to the way that the data was collected. Some parts of the dataset have thousands of features (which each represent a quantitative measure of the expression of a certain gene) but as few as twenty samples. We propose some ways to mitigate the impact of this; including the use of PCA, leave-one-out cross validation, or regularization. <br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 15'''<br />
Group members:<br />
<br />
Praneeth, Sai<br />
<br />
Peng, Xudong <br />
<br />
Li, Alice<br />
<br />
Vajargah, Shahrzad<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction [1] - A Kaggle Competition<br />
<br />
'''Description:''' Guess which cabin class in airlines is the most profitable? One might guess economy - but in reality, it's the premium classes that show higher returns. According to research conducted by Wendover productions [2], despite having less than 50 seats and taking up more space than the economy class, premium classes end up driving more revenue than other classes.<br />
<br />
In fact, just like airlines, many companies adopt the business model where the vast majority of revenue is derived from a minority group of customers. As a result, data-intensive promotional strategies are getting more and more attention nowadays from marketing teams to further improve company returns.<br />
<br />
In this Kaggle competition, we are challenged to analyze a Google Merchanidize Store's customer dataset to predict revenue per customer. We will implement a series of data analytics methods including pre-processing, data augmentation, and parameter tuning. Different classification algorithms will be compared and optimized in order to achieve the best results.<br />
<br />
'''Reference:'''<br />
<br />
[1] Kaggle. (2018, Sep 18). Google Analytics Customer Revenue Prediction. Retrieved from https://www.kaggle.com/c/ga-customer-revenue-prediction<br />
<br />
[2] Kottke, J (2017, Mar 17). The economics of airline classes. Retrieved from https://kottke.org/17/03/the-economics-of-airline-classes<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 16'''<br />
Group members:<br />
<br />
Wang, Yu Hao<br />
<br />
Grant, Aden <br />
<br />
McMurray, Andrew<br />
<br />
Song, Baizhi<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction - A Kaggle Competition<br />
<br />
The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.<br />
<br />
GStore<br />
<br />
RStudio, the developer of free and open tools for R and enterprise-ready products for teams to scale and share work, has partnered with Google Cloud and Kaggle to demonstrate the business impact that thorough data analysis can have.<br />
<br />
In this competition, you’re challenged to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data.<br />
<br />
we will test a variety of classification algorithms to determine an appropriate model.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 17'''<br />
Group Members:<br />
<br />
Jiang, Ya Fan<br />
<br />
Zhang, Yuan<br />
<br />
Hu, Jerry Jie<br />
<br />
'''Title:''' Humpback Whale Identification<br />
<br />
'''Description:''' We analyze Happywhale’s database of over 25,000 images, gathered from research institutions and public contributors to classify each whale to its identification based on its tail image.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 18'''<br />
Group Members:<br />
<br />
Zhang, Ben<br />
<br />
Mall, Sunil<br />
<br />
'''Title:''' Two Sigma: Using News to Predict Stock Movements<br />
<br />
'''Description:''' Use news analytics to predict stock price performance. This is subject to change.<br />
<br />
----------------------------------------------------------------------<br />
'''Project # 19'''<br />
Group Members:<br />
<br />
Yan Yu Chen<br />
<br />
Qisi Deng<br />
<br />
Hengxin Li<br />
<br />
Bochao Zhang<br />
<br />
'''Description:''' Our team presents the Unsupervised Lexicon-Based Sentiment Topic Model (ULSTM) as a sentiment analysis model for reviews on the popular crowd-sourced review forum Yelp. The model applies an unsupervised learning since the supervised method has many constraints. Furthermore, instead of employing an existing sentiment lexicon, we developed a sentiment dictionary using the linguistic corpus WordNet; the self-defined lexicon allows more targeted scoring towards the evaluated dataset. Finally, the ULSTM adopts the Latent Dirichlet Allocation model to find the most mentioned topics in reviews for individual businesses.<br />
<br />
'''Dataset''': Yelp Review Dataset from Kaggle<br />
----------------------------------------------------------------------<br />
'''Project # 20'''<br />
Group Members:<br />
<br />
Dong, Yongqi (Michael)<br />
<br />
Kingston, Stephen<br />
<br />
Hou, Zhaoran<br />
<br />
Zhang, Chi<br />
<br />
'''Title:''' Kaggle--Two Sigma: Using News to Predict Stock Movements <br />
<br />
'''Description:''' The movement in price of a trade-able security, or stock, on any given day is an aggregation of each individual market participant’s appraisal of the intrinsic value of the underlying company or assets. These values are primarily driven by investors’ expectations of the company’s ability to generate future free cash flow. A steady stream of information on the state of macro and micro-economic variables which affect a company’s operations inform these market actors, primarily through news articles and alerts. We would like to take a universe of news headlines and parse the information into features, which allow us to classify the direction and ‘intensity’ of a stock’s price move, in any given day. Strategies may include various classification methods to determine the most effective solution.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 21'''<br />
Group members:<br />
<br />
Xiao, Alexandre<br />
<br />
Zhang, Richard<br />
<br />
Ash, Hudson<br />
<br />
Zhu, Ziqiu<br />
<br />
'''Title:''' Image Segmentation with Capsule Networks using CRF loss<br />
<br />
'''Description:''' Investigate the impact in changing loss function/regularizers on image segmentation tasks with capsule networks.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 22'''<br />
Group Members:<br />
<br />
Lee, Yu Xuan<br />
<br />
Heng, Tsen Yee<br />
<br />
'''Title:''' Wine Rating Prediction<br />
<br />
'''Description:''' Predict the rating of the bottles of wine with the help of machine learning. With the variables from the datasets of the wine review which we found in kaggle, we are able to show that different points, price and the year of the production of the wine are very crucial in determining the value of the bottle of wine. The formula of finding the price increased per point for the wine is found from www.vivino.com. From the information we have, we are able to determine which wine is worth to buy!<br />
<br />
<br />
-------------------------------------------------------------------------<br />
<br />
'''Project # 23'''<br />
Group Members:<br />
<br />
Bayati, Mahdiyeh<br />
<br />
Malek Mohammadi, Saber<br />
<br />
Luong, Vincent<br />
<br />
<br />
'''Title:''' Human Protein Atlas Image Classification<br />
<br />
<br />
'''Description:''' The Human Protein Atlas is a Sweden-based initiative aimed at mapping all human proteins in cells, tissues and organs.<br />
<br />
-------------------------------------------------------------------------<br />
<br />
'''Project # 24'''<br />
Group Members:<br />
<br />
Wu Yutong, <br />
<br />
Wang Shuyue,<br />
<br />
Jiao Yan<br />
<br />
'''Title:''' Kaggle Competition: Quora Insincere Questions Classification<br />
<br />
'''Description:''' Quora is a question-and-answer website where users can ask questions and share opinions. For the company, one key challenge is to identify those insincere questions, which are defined as those founded upon false premises, or that intend to make a statement rather than look for helpful answers. This report is about classifying Quora questions into ``Sincere" and ``Insincere". The data used in this project was prepared by Quora and can be found on kaggle website \cite{data_kaggle}. We tried Bi-GRU and Capsule Network model, along with blend of LSTMs and CNN model. Experiments have demonstrated that they have the similar performance.</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=F18-STAT841-Proposal&diff=42408
F18-STAT841-Proposal
2018-12-12T18:32:12Z
<p>Q26deng: </p>
<hr />
<div><br />
'''Use this format (Don’t remove Project 0)'''<br />
<br />
'''Project # 0'''<br />
Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
'''Title:''' Making a String Telephone<br />
<br />
'''Description:''' We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1'''<br />
Group members:<br />
<br />
Weng, Jiacheng<br />
<br />
Li, Keqi<br />
<br />
Qian, Yi<br />
<br />
Liu, Bomeng<br />
<br />
'''Title:''' RSNA Pneumonia Detection Challenge<br />
<br />
'''Description:''' <br />
<br />
Our team’s project is the RSNA Pneumonia Detection Challenge from Kaggle competition. The primary goal of this project is to develop a machine learning tool to detect patients with pneumonia based on their chest radiographs (CXR). <br />
<br />
Pneumonia is an infection that inflames the air sacs in human lungs which has symptoms such as chest pain, cough, and fever [1]. Pneumonia can be very dangerous especially to infants and elders. In 2015, 920,000 children under the age of 5 died from this disease [2]. Due to its fatality to children, diagnosing pneumonia has a high order. A common method of diagnosing pneumonia is to obtain patients’ chest radiograph (CXR) which is a gray-scale scan image of patients’ chests using x-ray. The infected region due to pneumonia usually shows as an area or areas of increased opacity [3] on CXR. However, many other factors can also contribute to increase in opacity on CXR which makes the diagnose very challenging. The diagnose also requires highly-skilled clinicians and a lot of time of CXR screening. The Radiological Society of North America (RSNA®) sees the opportunity of using machine learning to potentially accelerate the initial CXR screening process. <br />
<br />
For the scope of this project, our team plans to contribute to solving this problem by applying our machine learning knowledge in image processing and classification. Team members are going to apply techniques that include, but are not limited to: logistic regression, random forest, SVM, kNN, CNN, etc., in order to successfully detect CXRs with pneumonia.<br />
<br />
<br />
[1] (Accessed 2018, Oct. 4). Pneumonia [Online]. MAYO CLINIC. Available from: https://www.mayoclinic.org/diseases-conditions/pneumonia/symptoms-causes/syc-20354204<br />
[2] (Accessed 2018, Oct. 4). RSNA Pneumonia Detection Challenge [Online]. Kaggle. Available from: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge<br />
[3] Franquet T. Imaging of community-acquired pneumonia. J Thorac Imaging 2018 (epub ahead of print). PMID 30036297<br />
<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 3'''<br />
Group members:<br />
<br />
Hanzhen Yang<br />
<br />
Jing Pu Sun<br />
<br />
Ganyuan Xuan<br />
<br />
Yu Su<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:'''<br />
<br />
Our team chose the [https://www.kaggle.com/c/quickdraw-doodle-recognition Quick, Draw! Doodle Recognition Challenge] from the Kaggle Competition. The goal of the competition is to build an image recognition tool that can classify hand-drawn doodles into one of the 340 categories.<br />
<br />
The main challenge of the project remains in the training set being very noisy. Hand-drawn artwork may deviate substantially from the actual object, and is almost definitively different from person to person. Mislabeled images also present a problem since they will create outlier points when we train our models. <br />
<br />
We plan on learning more about some of the currently mature image recognition algorithms to inspire and develop our own model.<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 4'''<br />
Group members:<br />
<br />
Snaith, Mitchell<br />
<br />
'''Title:''' Exploring Kuzushiji-MNIST, a new classification benchmark<br />
<br />
'''Description:''' <br />
<br />
The paper *Deep Learning for Classical Japanese Literature* presents a new classification dataset intended to act as a drop-in replacement for MNIST. The paper authors believe that this dataset is significantly more difficult that MNIST for typical classification methods, while not "capping" performance due to indiscernible objects like Fashion-MNIST might. <br />
Goals are to: <br />
<br />
- perform survey of typical machine-learning algorithms on Kuzushiji-MNIST compared to both MNIST and Fashion-MNIST<br />
<br />
- investigate relevant differences in the structures of the datasets<br />
<br />
- assess whether Fashion-MNIST does indeed seem to have a performance cap that can be overcome with Kuzushiji-MNIST<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 5'''<br />
Group members:<br />
<br />
Pei Wei, Wang<br />
<br />
Daoyi, Chen<br />
<br />
Yiming, Li<br />
<br />
Ying, Chi<br />
<br />
'''Title:''' Kaggle Challenge: Airbus Ship Detection Challenge<br />
<br />
'''Description:''' <br />
<br />
Classification has become a more and more eye-catching, especially with the rise of machine learning in these years. Our team is particularly interested in machine learning algorithms that optimize some specific type image classification. <br />
<br />
In this project, we will dig into base classifiers we learnt from the class and try to cook them together to find an optimal solution for a certain type images dataset. Currently, we are looking into a dataset from Kaggle: Airbus Ship Detection Challenge.<br />
<br />
For us, as machine learning students, we are more eager to help getting a better classification method. By “better”, we mean find a balance between simplify and accuracy. We will start with neural network via different activation functions in each layer and we will also combine base classifiers with bagging, random forest, boosting for ensemble learning. Also, we will try to regulate our parameters to avoid overfitting in training dataset. Last, we will summary features of this type image dataset, formulate our solutions and standardize our steps to solve this kind problems <br />
<br />
Hopefully, we can not only finish our project successfully, but also make a little contribution to machine learning research field.<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 6'''<br />
Group members:<br />
<br />
Ngo, Jameson<br />
<br />
Xu, Amy<br />
<br />
'''Title:''' Kaggle Challenge: [https://www.kaggle.com/c/PLAsTiCC-2018 PLAsTiCC Astronomical Classification ]<br />
<br />
'''Description:''' <br />
<br />
We will participate in the PLAsTiCC Astronomical Classification competition featured on Kaggle. We will explore how possible it is classify astronomical bodies based on various factors such as brightness.<br />
<br />
These bodies will vary in time and size. Some are unknown! There are over 100 classes that these bodies may be and it will be our job to find the predicted probability for an image to be each class.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 7'''<br />
Group members:<br />
<br />
Qianying Zhao<br />
<br />
Hui Huang<br />
<br />
Meiyu Zhou<br />
<br />
Gezhou Zhang<br />
<br />
'''Title:''' Quora Insincere Questions Classification<br />
<br />
'''Description:''' <br />
Our group will participate in the featured Kaggle competition of Quora Insincere Questions Classification. For this competition, we should predict wether a question asked on Quora is sincere or not. If the question is insincere, it intends to be a statement rather than look for useful answers, and identified as (target = 1). <br />
We will analyze the Quora question text to predict the characteristics of questions and define they are sincere or insincere using Rstudio. Our presentation report will include not only how we've concluded by classifying and analyzing provided data with appropriate models, but also how we performed in the contest.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 8'''<br />
Group members:<br />
<br />
Jiayue Zhang<br />
<br />
Lingyun Yi<br />
<br />
Rongrong Su<br />
<br />
Siao Chen<br />
<br />
<br />
'''Title:''' Telecom Customer Churn Prediction<br />
<br />
<br />
'''Description:''' <br />
Traditional telecommunication industry is made up of telecommunication companies and internet service providers, which play important role in daily life. It is crucial for the telecommunication companies to analyze and maintain their relationship with existing customers, as well as winning new customers with marketing strategies. However, it costs 5 times as much to attract a new customer than to keep an existing one. Therefore, retaining existing customers and building a loyal relationship are the key concerns for traditional telecommunication companies to stay strong in the competition. This project aims to provide insights for the telecom companies in predicting the chance of a customer leaving the company. We will be applying different classification models such as Random Forest, Gradient boosting, Logistic Regression and XGBoost, and then compare each model's performance. <br />
<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 9'''<br />
Group members:<br />
<br />
Hassan, Ahmad Nayar<br />
<br />
McLellan, Isaac<br />
<br />
Brewster, Kristi<br />
<br />
Melek, Marina Medhat Rassmi <br />
<br />
<br />
'''Title:''' Kaggle Compeition: Quora Insincere Questions Classification<br />
<br />
'''Description:''' <br />
<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 10'''<br />
Group members:<br />
<br />
Lam, Amanda<br />
<br />
Huang, Xiaoran<br />
<br />
Chu, Qi<br />
<br />
Sang, Di<br />
<br />
'''Title:''' Kaggle Competition: Human Protein Atlas Image Classification<br />
<br />
'''Description:'''<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 11'''<br />
Group members:<br />
<br />
Bobichon, Philomene<br />
<br />
Maheshwari, Aditya<br />
<br />
An, Zepeng<br />
<br />
Stranc, Colin<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 12'''<br />
Group members:<br />
<br />
Huo, Qingxi<br />
<br />
Yang, Yanmin<br />
<br />
Cai, Yuanjing<br />
<br />
Wang, Jiaqi<br />
<br />
'''Title:''' <br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 13'''<br />
Group members:<br />
<br />
Ross, Brendan<br />
<br />
Barenboim, Jon<br />
<br />
Lin, Junqiao<br />
<br />
Bootsma, James<br />
<br />
'''Title:''' Expanding Neural Netwrok<br />
<br />
'''Description:''' The goal of our project is to create an expanding neural network algorithm which starts off by training a small neural network then expands it to a larger one. We hypothesize that with the proper expansion method we could decrease training time and prevent overfitting. The method we wish to explore is to link together input dimensions based on covariance. Then when the neural network reaches convergence we create a larger neural network without the links between dimensions and using starting values from the smaller neural network. <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 14'''<br />
Group members:<br />
<br />
Schneider, Jason <br />
<br />
Walton, Jordyn <br />
<br />
Abbas, Zahraa<br />
<br />
Na, Andrew<br />
<br />
'''Title:''' Application of ML Classification to Cancer Identification<br />
<br />
'''Description:''' The application of machine learning to cancer classification based on gene expression is a topic of great interest to physicians and biostatisticians alike. We would like to work on this for our final project to encourage the application of proven ML techniques to improve accuracy of cancer classification and diagnosis. In this project, we will use the dataset from Golub et al. [1] which contains data on gene expression on tumour biopsies to train a model and classify healthy individuals and individuals who have cancer.<br />
<br />
One challenge we may face pertains to the way that the data was collected. Some parts of the dataset have thousands of features (which each represent a quantitative measure of the expression of a certain gene) but as few as twenty samples. We propose some ways to mitigate the impact of this; including the use of PCA, leave-one-out cross validation, or regularization. <br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 15'''<br />
Group members:<br />
<br />
Praneeth, Sai<br />
<br />
Peng, Xudong <br />
<br />
Li, Alice<br />
<br />
Vajargah, Shahrzad<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction [1] - A Kaggle Competition<br />
<br />
'''Description:''' Guess which cabin class in airlines is the most profitable? One might guess economy - but in reality, it's the premium classes that show higher returns. According to research conducted by Wendover productions [2], despite having less than 50 seats and taking up more space than the economy class, premium classes end up driving more revenue than other classes.<br />
<br />
In fact, just like airlines, many companies adopt the business model where the vast majority of revenue is derived from a minority group of customers. As a result, data-intensive promotional strategies are getting more and more attention nowadays from marketing teams to further improve company returns.<br />
<br />
In this Kaggle competition, we are challenged to analyze a Google Merchanidize Store's customer dataset to predict revenue per customer. We will implement a series of data analytics methods including pre-processing, data augmentation, and parameter tuning. Different classification algorithms will be compared and optimized in order to achieve the best results.<br />
<br />
'''Reference:'''<br />
<br />
[1] Kaggle. (2018, Sep 18). Google Analytics Customer Revenue Prediction. Retrieved from https://www.kaggle.com/c/ga-customer-revenue-prediction<br />
<br />
[2] Kottke, J (2017, Mar 17). The economics of airline classes. Retrieved from https://kottke.org/17/03/the-economics-of-airline-classes<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 16'''<br />
Group members:<br />
<br />
Wang, Yu Hao<br />
<br />
Grant, Aden <br />
<br />
McMurray, Andrew<br />
<br />
Song, Baizhi<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction - A Kaggle Competition<br />
<br />
The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.<br />
<br />
GStore<br />
<br />
RStudio, the developer of free and open tools for R and enterprise-ready products for teams to scale and share work, has partnered with Google Cloud and Kaggle to demonstrate the business impact that thorough data analysis can have.<br />
<br />
In this competition, you’re challenged to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data.<br />
<br />
we will test a variety of classification algorithms to determine an appropriate model.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 17'''<br />
Group Members:<br />
<br />
Jiang, Ya Fan<br />
<br />
Zhang, Yuan<br />
<br />
Hu, Jerry Jie<br />
<br />
'''Title:''' Humpback Whale Identification<br />
<br />
'''Description:''' We analyze Happywhale’s database of over 25,000 images, gathered from research institutions and public contributors to classify each whale to its identification based on its tail image.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 18'''<br />
Group Members:<br />
<br />
Zhang, Ben<br />
<br />
Mall, Sunil<br />
<br />
'''Title:''' Two Sigma: Using News to Predict Stock Movements<br />
<br />
'''Description:''' Use news analytics to predict stock price performance. This is subject to change.<br />
<br />
----------------------------------------------------------------------<br />
'''Project # 19'''<br />
Group Members:<br />
<br />
Yan Yu Chen<br />
<br />
Qisi Deng<br />
<br />
Hengxin Li<br />
<br />
Bochao Zhang<br />
<br />
Our team presents the Unsupervised Lexicon-Based Sentiment Topic Model (ULSTM) as a sentiment analysis model for reviews on the popular crowd-sourced review forum Yelp. The model applies an unsupervised learning since the supervised method has many constraints. Furthermore, instead of employing an existing sentiment lexicon, we developed a sentiment dictionary using the linguistic corpus WordNet; the self-defined lexicon allows more targeted scoring towards the evaluated dataset. Finally, the ULSTM adopts the Latent Dirichlet Allocation model to find the most mentioned topics in reviews for individual businesses.<br />
<br />
Dataset: Yelp Review Dataset from Kaggle<br />
----------------------------------------------------------------------<br />
'''Project # 20'''<br />
Group Members:<br />
<br />
Dong, Yongqi (Michael)<br />
<br />
Kingston, Stephen<br />
<br />
Hou, Zhaoran<br />
<br />
Zhang, Chi<br />
<br />
'''Title:''' Kaggle--Two Sigma: Using News to Predict Stock Movements <br />
<br />
'''Description:''' The movement in price of a trade-able security, or stock, on any given day is an aggregation of each individual market participant’s appraisal of the intrinsic value of the underlying company or assets. These values are primarily driven by investors’ expectations of the company’s ability to generate future free cash flow. A steady stream of information on the state of macro and micro-economic variables which affect a company’s operations inform these market actors, primarily through news articles and alerts. We would like to take a universe of news headlines and parse the information into features, which allow us to classify the direction and ‘intensity’ of a stock’s price move, in any given day. Strategies may include various classification methods to determine the most effective solution.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 21'''<br />
Group members:<br />
<br />
Xiao, Alexandre<br />
<br />
Zhang, Richard<br />
<br />
Ash, Hudson<br />
<br />
Zhu, Ziqiu<br />
<br />
'''Title:''' Image Segmentation with Capsule Networks using CRF loss<br />
<br />
'''Description:''' Investigate the impact in changing loss function/regularizers on image segmentation tasks with capsule networks.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 22'''<br />
Group Members:<br />
<br />
Lee, Yu Xuan<br />
<br />
Heng, Tsen Yee<br />
<br />
'''Title:''' Wine Rating Prediction<br />
<br />
'''Description:''' Predict the rating of the bottles of wine with the help of machine learning. With the variables from the datasets of the wine review which we found in kaggle, we are able to show that different points, price and the year of the production of the wine are very crucial in determining the value of the bottle of wine. The formula of finding the price increased per point for the wine is found from www.vivino.com. From the information we have, we are able to determine which wine is worth to buy!<br />
<br />
<br />
-------------------------------------------------------------------------<br />
<br />
'''Project # 23'''<br />
Group Members:<br />
<br />
Bayati, Mahdiyeh<br />
<br />
Malek Mohammadi, Saber<br />
<br />
Luong, Vincent<br />
<br />
<br />
'''Title:''' Human Protein Atlas Image Classification<br />
<br />
<br />
'''Description:''' The Human Protein Atlas is a Sweden-based initiative aimed at mapping all human proteins in cells, tissues and organs.<br />
<br />
-------------------------------------------------------------------------<br />
<br />
'''Project # 24'''<br />
Group Members:<br />
<br />
Wu Yutong, <br />
<br />
Wang Shuyue,<br />
<br />
Jiao Yan<br />
<br />
'''Title:''' Kaggle Competition: Quora Insincere Questions Classification<br />
<br />
'''Description:''' Quora is a question-and-answer website where users can ask questions and share opinions. For the company, one key challenge is to identify those insincere questions, which are defined as those founded upon false premises, or that intend to make a statement rather than look for helpful answers. This report is about classifying Quora questions into ``Sincere" and ``Insincere". The data used in this project was prepared by Quora and can be found on kaggle website \cite{data_kaggle}. We tried Bi-GRU and Capsule Network model, along with blend of LSTMs and CNN model. Experiments have demonstrated that they have the similar performance.</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=38418
stat441F18/TCNLM
2018-11-08T16:47:57Z
<p>Q26deng: /* Model Evaluation */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [https://en.wikipedia.org/wiki/Autoencoder variational autoencoder] framework, coupled with the probability of topic usage, are further trained in a MoE model. <br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of [[#Topic Model| latent topics]], weighted by the topic-usage probabilities, yields an effective prediction for the sentences<sup>[[#References|[1]]]</sup>. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
[[File:Screen_Shot_2018-11-08_at_10.35.41_AM.png|thumb|center|700px|alt=model architecture.|[[#Model Architecture|Overall architecture]]]]<br />
<br />
==Topic Model==<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
===LDA===<br />
<br />
A common example of a topic model would be [https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation latent Dirichlet allocation] (LDA)<sup>[[#References|[4]]]</sup>, which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as the following:<br />
<br />
<center><br />
<math><br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) = \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
</math><br />
</center><br />
<br />
===Neural Topic Model===<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
<center><br />
[[File:Screen Shot 2018-11-08 at 10.28.02 AM.png]]<br />
</center><br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to [[#LDA|LDA]], the neural topic model parameterized the multinational document topic distribution<sup>[[#References|[7]]]</sup>. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\begin{align}<br />
&\boldsymbol{\theta} \sim N(\mu_0, \sigma_0^2) \\<br />
&\boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}}) \\<br />
&z_n \sim Discrete(\boldsymbol t) \\<br />
&w_n \sim Discrete(\boldsymbol \beta_{z_n})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Where <math>\boldsymbol{\hat W}</math> and <math> \boldsymbol{\hat b}</math> are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) &= \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n p(w_n | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) p(\boldsymbol d | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t<br />
\end{align}<br />
</math><br />
</center><br />
<br />
The first equality of the equation above holds since we can marginalize out the sampled topic words <math>z_n</math>:<br />
<center><br />
<math><br />
p(w_n|\boldsymbol \beta, \boldsymbol t) = \sum_{z_n} p(w_n | \boldsymbol \beta) p(z_n | \boldsymbol t)<br />
</math><br />
</center><br />
<br />
====Re-Parameterization Trick====<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick<sup>[[#References|[3]]]</sup>. The update for the parameters is derived from variational lower bound will be discussed in the section [[#Model Inference|model inference]].<br />
<br />
====Diversity Regularizer====<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer<sup>[[#References|[6]]]</sup><sup>[[#References|[7]]]</sup> to reduce it. The idea is to regularize the row-wise distance between each paired topics. <br />
<br />
First, we measure the '''distance''' between pair of topics with:<br />
<br />
<center><br />
<math><br />
a(\boldsymbol \beta_i, \boldsymbol \beta_j) = \arccos(\frac{|\boldsymbol \beta_i \cdot \boldsymbol \beta_j|}{||\boldsymbol \beta_i||_2||\boldsymbol \beta_j||_2})<br />
</math><br />
</center><br />
<br />
Then, '''mean''' angle of all pairs of T topics is <br />
<br />
<center><br />
<math><br />
\phi = \frac{1}{T^2} \sum_i \sum_j a(\boldsymbol \beta_i, \boldsymbol \beta_j)<br />
</math><br />
</center><br />
<br />
and '''variance''' is <br />
<br />
<center><br />
<math><br />
\nu - \frac{1}{T^2} \sum_i \sum_j (a(\boldsymbol \beta_i, \boldsymbol \beta_j) - \phi)^2<br />
</math><br />
</center><br />
<br />
Finally, we identify the topic diversity regularization as <br />
<br />
<center><br />
<math> <br />
R = \phi - \nu<br />
</math><br />
</center> <br />
<br />
which will be used in the [[#Model Inference|model inference]].<br />
<br />
==Language Model==<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> \boldsymbol h_{m} </math>. <br />
<br />
<center><br />
<math><br />
p(y_{m}|y_{1:m-1})=p(y_{m}|\boldsymbol h_{m})<br />
\\<br />
\boldsymbol h_{m}= f(\boldsymbol h_{m-1}x_{m})<br />
<br />
<br />
</math><br />
</center><br />
<br />
===Recurrent Neural Network===<br />
Recurrent Neural Networks ([https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]s) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory ([https://en.wikipedia.org/wiki/Long_short-term_memory LSTM]) or Gated Recurrent Unit ([https://en.wikipedia.org/wiki/Gated_recurrent_unit GRU]) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
===Neural Language Model===<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. <br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(\boldsymbol V \boldsymbol h_{m})\\<br />
\boldsymbol h_{m} &= \sigma(W(t)\boldsymbol x_{m} + U(t)\boldsymbol h_{m-1})\\<br />
\end{align}<br />
</math><br />
</center><br />
<br />
where <math> \boldsymbol W(t) </math> and <math> \boldsymbol U(t) </math> are defined as: <br />
<br />
<center> <math> \boldsymbol W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], \boldsymbol U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math> </center><br />
<br />
====LSTM Architecture====<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate <math> i_{m} </math>, forget gate <math> f_{m} </math>, output gate <math> o_{m} </math>, and memory stage <math> \tilde{c}_{m} </math> respectively. Recall a typical LSTM cell, model can be parametrized as follows:<br />
<center><br />
[[File:neurallanguage.png|right]]<br />
</center><br />
<br />
<center><br />
<math><br />
\begin{align}<br />
\boldsymbol i_{m} &= \sigma(\boldsymbol W_{i}(t) \boldsymbol x_{i,m-1} + \boldsymbol U_{i}(t) \boldsymbol h_{i,m-1})\\<br />
\boldsymbol f_{m} &= \sigma(\boldsymbol W_{f}(t) \boldsymbol x_{f,m-1} + \boldsymbol U_{f}(t)\boldsymbol h_{f,m-1})\\<br />
\boldsymbol o_{m} &= \sigma(\boldsymbol W_{o}(t) \boldsymbol x_{o,m-1} +\boldsymbol U_{o}(t)\boldsymbol h_{o,m-1})\\<br />
\tilde{\boldsymbol c}_{m} &= \sigma(\boldsymbol W_{c}(t) \boldsymbol x_{c,m-1} + \boldsymbol U_{c}(t)\boldsymbol h_{c,m-1})\\<br />
\boldsymbol c_{m} &= \boldsymbol i_{m} \odot \tilde{\boldsymbol c}_{m} + \boldsymbol f_{m} \cdot \boldsymbol c_{m-1}\\<br />
\boldsymbol h_{m} &= \boldsymbol o_{m} \odot tanh(\boldsymbol c_{m})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math>\boldsymbol W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, \boldsymbol W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> \boldsymbol W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by ''Gan et al.'' (2016) and ''Song et al.'' (2016) for semantic concept detection RNN. Mathematically,<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
\boldsymbol W(\boldsymbol t) &= W_{a} \cdot diag(\boldsymbol W_{b} \boldsymbol t) \cdot \boldsymbol W_{c} \\<br />
&= \boldsymbol W_{a} \cdot (\boldsymbol W_{b} \boldsymbol t \odot \boldsymbol W_{c}) \\<br />
\end{align}<br />
</math><br />
</center><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
==Model Inference==<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
<center> <math> p(y_{1:M},\boldsymbol d|\mu_{0},\boldsymbol \sigma_{0}^{2},\beta) = \int_{t}p(\boldsymbol t|\mu_{0},\sigma_{0}^{2})p(\boldsymbol d|\boldsymbol \beta,\boldsymbol t)\prod_{m=1}^{M}p(y_{m}|y_{1:m-1},\boldsymbol t)d \boldsymbol t </math> </center><br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, <math> q(t|d) </math>, which is the probability of latent vector <math> t </math> given bag-of-words from Neural Topic Model, is used to be the variational distribution of the real marginal probability <math>p(t) </math>, compensated by Kullback-Leibler divergence. The log likelihood function of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> can be estimated as follows:<br />
<br />
<center><br />
<math> <br />
\begin{align}<br />
\mathcal{L} =& \ \underbrace{\mathbb{E}_{q(\boldsymbol t|\boldsymbol d)} (log \ p(\boldsymbol d|\boldsymbol t)) - KL (q(\boldsymbol t|\boldsymbol d)||p(\boldsymbol t|\mu_{0},\sigma_{0}^{2})}_\text{neural topic model} \\ &+ \underbrace{\mathbb{E}_{q(\boldsymbol t|\boldsymbol d)} (\sum_{m=1}^{M} log p(y_{m}|y_{1:m-1}, \boldsymbol t)}_\text{neural language model} \leq log \ p(y_{1:M}, \boldsymbol d|\mu_{0},\sigma_{0}^{2},\boldsymbol \beta)<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Hence, the goal of TCNLM is to maximize <math> \mathcal{L} </math> together with the diversity regularization <math> R </math>, i.e<br />
<center><br />
<math><br />
\mathcal{J} = \mathcal{L} + \lambda \cdot R<br />
</math><br />
</center><br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**'''Language Models: '''<br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**'''Topic Models:'''<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB, and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
<center><br />
[[File:lm2.png]]<br />
</center><br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
<center><br />
[[File:tm7.png]]<br />
</center><br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
Another advantage of our TCNLM is its capacity to generate meaningful sentences conditioned on given topics. Significant extensions have proposed, including:<br />
<br />
#Google Mail Smart Compose: Much like autocomplete in the search bar or on your smartphone’s keyboard, the new AI-powered feature promises to not only intelligently work out what you’re currently trying to write but to predict whole emails.<br />
#Grammarly: Grammarly automatically detects grammar, spelling, punctuation, word choice and style mistakes in your writing.<br />
#Sentence generator helps people with language barrier express more fluently: Stephen Hawking's main interface to the computer, called ACAT, includes a word prediction algorithm provided by SwiftKey, trained on his books and lectures, so he usually only have to type the first couple of characters before he can select the whole word.<br />
<br />
=References=<br />
* <sup>[https://arxiv.org/abs/1712.09783 [1]]</sup>W. Wang, Z. Gan, W. Wang, D. Shen, J. Huang, W. Ping, S. Satheesh, and L. Carin. Topic compositional neural language model. arXiv preprint\ arXiv:1712.09783, 2017.<br />
<br />
* <sup>[https://arxiv.org/abs/1611.08002 [2]]</sup>Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. Semantic compositional networks for visual captioning. arXiv preprint\ arXiv:1611.08002, 2016.<br />
<br />
* <sup>[https://arxiv.org/abs/1412.6980 [3]]</sup>D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
* <sup>[http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf [4]]</sup>D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 2003.<br />
<br />
* <sup>[https://arxiv.org/abs/1605.06715 [4]]</sup>J. Song, Z. Gan, and L. Carin. Factored temporal sigmoid belief networks for sequence learning. In ICML, 2016.<br />
<br />
* <sup>[http://www.cs.cmu.edu/~pengtaox/papers/kdd15_drbm.pdf [5]]</sup>P. Xie, Y. Deng, and E. Xing. Diversifying restricted boltzmann machine for document modeling. In KDD, 2015.<br />
<br />
* <sup>[https://arxiv.org/abs/1706.00359 [6]]</sup>Y. Miao, E. Grefenstette, and P. Blunsom. Discovering discrete latent topics with neural variational inference. arXiv preprint arXiv:1706.00359, 2017.</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:tm7.png&diff=38417
File:tm7.png
2018-11-08T16:47:43Z
<p>Q26deng: </p>
<hr />
<div></div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=38416
stat441F18/TCNLM
2018-11-08T16:46:55Z
<p>Q26deng: /* Model Evaluation */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [https://en.wikipedia.org/wiki/Autoencoder variational autoencoder] framework, coupled with the probability of topic usage, are further trained in a MoE model. <br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of [[#Topic Model| latent topics]], weighted by the topic-usage probabilities, yields an effective prediction for the sentences<sup>[[#References|[1]]]</sup>. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
[[File:Screen_Shot_2018-11-08_at_10.35.41_AM.png|thumb|center|700px|alt=model architecture.|[[#Model Architecture|Overall architecture]]]]<br />
<br />
==Topic Model==<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
===LDA===<br />
<br />
A common example of a topic model would be [https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation latent Dirichlet allocation] (LDA)<sup>[[#References|[4]]]</sup>, which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as the following:<br />
<br />
<center><br />
<math><br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) = \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
</math><br />
</center><br />
<br />
===Neural Topic Model===<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
<center><br />
[[File:Screen Shot 2018-11-08 at 10.28.02 AM.png]]<br />
</center><br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to [[#LDA|LDA]], the neural topic model parameterized the multinational document topic distribution<sup>[[#References|[7]]]</sup>. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\begin{align}<br />
&\boldsymbol{\theta} \sim N(\mu_0, \sigma_0^2) \\<br />
&\boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}}) \\<br />
&z_n \sim Discrete(\boldsymbol t) \\<br />
&w_n \sim Discrete(\boldsymbol \beta_{z_n})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Where <math>\boldsymbol{\hat W}</math> and <math> \boldsymbol{\hat b}</math> are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) &= \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n p(w_n | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) p(\boldsymbol d | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t<br />
\end{align}<br />
</math><br />
</center><br />
<br />
The first equality of the equation above holds since we can marginalize out the sampled topic words <math>z_n</math>:<br />
<center><br />
<math><br />
p(w_n|\boldsymbol \beta, \boldsymbol t) = \sum_{z_n} p(w_n | \boldsymbol \beta) p(z_n | \boldsymbol t)<br />
</math><br />
</center><br />
<br />
====Re-Parameterization Trick====<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick<sup>[[#References|[3]]]</sup>. The update for the parameters is derived from variational lower bound will be discussed in the section [[#Model Inference|model inference]].<br />
<br />
====Diversity Regularizer====<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer<sup>[[#References|[6]]]</sup><sup>[[#References|[7]]]</sup> to reduce it. The idea is to regularize the row-wise distance between each paired topics. <br />
<br />
First, we measure the '''distance''' between pair of topics with:<br />
<br />
<center><br />
<math><br />
a(\boldsymbol \beta_i, \boldsymbol \beta_j) = \arccos(\frac{|\boldsymbol \beta_i \cdot \boldsymbol \beta_j|}{||\boldsymbol \beta_i||_2||\boldsymbol \beta_j||_2})<br />
</math><br />
</center><br />
<br />
Then, '''mean''' angle of all pairs of T topics is <br />
<br />
<center><br />
<math><br />
\phi = \frac{1}{T^2} \sum_i \sum_j a(\boldsymbol \beta_i, \boldsymbol \beta_j)<br />
</math><br />
</center><br />
<br />
and '''variance''' is <br />
<br />
<center><br />
<math><br />
\nu - \frac{1}{T^2} \sum_i \sum_j (a(\boldsymbol \beta_i, \boldsymbol \beta_j) - \phi)^2<br />
</math><br />
</center><br />
<br />
Finally, we identify the topic diversity regularization as <br />
<br />
<center><br />
<math> <br />
R = \phi - \nu<br />
</math><br />
</center> <br />
<br />
which will be used in the [[#Model Inference|model inference]].<br />
<br />
==Language Model==<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> \boldsymbol h_{m} </math>. <br />
<br />
<center><br />
<math><br />
p(y_{m}|y_{1:m-1})=p(y_{m}|\boldsymbol h_{m})<br />
\\<br />
\boldsymbol h_{m}= f(\boldsymbol h_{m-1}x_{m})<br />
<br />
<br />
</math><br />
</center><br />
<br />
===Recurrent Neural Network===<br />
Recurrent Neural Networks ([https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]s) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory ([https://en.wikipedia.org/wiki/Long_short-term_memory LSTM]) or Gated Recurrent Unit ([https://en.wikipedia.org/wiki/Gated_recurrent_unit GRU]) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
===Neural Language Model===<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. <br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(\boldsymbol V \boldsymbol h_{m})\\<br />
\boldsymbol h_{m} &= \sigma(W(t)\boldsymbol x_{m} + U(t)\boldsymbol h_{m-1})\\<br />
\end{align}<br />
</math><br />
</center><br />
<br />
where <math> \boldsymbol W(t) </math> and <math> \boldsymbol U(t) </math> are defined as: <br />
<br />
<center> <math> \boldsymbol W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], \boldsymbol U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math> </center><br />
<br />
====LSTM Architecture====<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate <math> i_{m} </math>, forget gate <math> f_{m} </math>, output gate <math> o_{m} </math>, and memory stage <math> \tilde{c}_{m} </math> respectively. Recall a typical LSTM cell, model can be parametrized as follows:<br />
<center><br />
[[File:neurallanguage.png|right]]<br />
</center><br />
<br />
<center><br />
<math><br />
\begin{align}<br />
\boldsymbol i_{m} &= \sigma(\boldsymbol W_{i}(t) \boldsymbol x_{i,m-1} + \boldsymbol U_{i}(t) \boldsymbol h_{i,m-1})\\<br />
\boldsymbol f_{m} &= \sigma(\boldsymbol W_{f}(t) \boldsymbol x_{f,m-1} + \boldsymbol U_{f}(t)\boldsymbol h_{f,m-1})\\<br />
\boldsymbol o_{m} &= \sigma(\boldsymbol W_{o}(t) \boldsymbol x_{o,m-1} +\boldsymbol U_{o}(t)\boldsymbol h_{o,m-1})\\<br />
\tilde{\boldsymbol c}_{m} &= \sigma(\boldsymbol W_{c}(t) \boldsymbol x_{c,m-1} + \boldsymbol U_{c}(t)\boldsymbol h_{c,m-1})\\<br />
\boldsymbol c_{m} &= \boldsymbol i_{m} \odot \tilde{\boldsymbol c}_{m} + \boldsymbol f_{m} \cdot \boldsymbol c_{m-1}\\<br />
\boldsymbol h_{m} &= \boldsymbol o_{m} \odot tanh(\boldsymbol c_{m})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math>\boldsymbol W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, \boldsymbol W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> \boldsymbol W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by ''Gan et al.'' (2016) and ''Song et al.'' (2016) for semantic concept detection RNN. Mathematically,<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
\boldsymbol W(\boldsymbol t) &= W_{a} \cdot diag(\boldsymbol W_{b} \boldsymbol t) \cdot \boldsymbol W_{c} \\<br />
&= \boldsymbol W_{a} \cdot (\boldsymbol W_{b} \boldsymbol t \odot \boldsymbol W_{c}) \\<br />
\end{align}<br />
</math><br />
</center><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
==Model Inference==<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
<center> <math> p(y_{1:M},\boldsymbol d|\mu_{0},\boldsymbol \sigma_{0}^{2},\beta) = \int_{t}p(\boldsymbol t|\mu_{0},\sigma_{0}^{2})p(\boldsymbol d|\boldsymbol \beta,\boldsymbol t)\prod_{m=1}^{M}p(y_{m}|y_{1:m-1},\boldsymbol t)d \boldsymbol t </math> </center><br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, <math> q(t|d) </math>, which is the probability of latent vector <math> t </math> given bag-of-words from Neural Topic Model, is used to be the variational distribution of the real marginal probability <math>p(t) </math>, compensated by Kullback-Leibler divergence. The log likelihood function of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> can be estimated as follows:<br />
<br />
<center><br />
<math> <br />
\begin{align}<br />
\mathcal{L} =& \ \underbrace{\mathbb{E}_{q(\boldsymbol t|\boldsymbol d)} (log \ p(\boldsymbol d|\boldsymbol t)) - KL (q(\boldsymbol t|\boldsymbol d)||p(\boldsymbol t|\mu_{0},\sigma_{0}^{2})}_\text{neural topic model} \\ &+ \underbrace{\mathbb{E}_{q(\boldsymbol t|\boldsymbol d)} (\sum_{m=1}^{M} log p(y_{m}|y_{1:m-1}, \boldsymbol t)}_\text{neural language model} \leq log \ p(y_{1:M}, \boldsymbol d|\mu_{0},\sigma_{0}^{2},\boldsymbol \beta)<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Hence, the goal of TCNLM is to maximize <math> \mathcal{L} </math> together with the diversity regularization <math> R </math>, i.e<br />
<center><br />
<math><br />
\mathcal{J} = \mathcal{L} + \lambda \cdot R<br />
</math><br />
</center><br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**'''Language Models: '''<br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**'''Topic Models:'''<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB, and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
<center><br />
[[File:lm2.png]]<br />
</center><br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
<center><br />
[[File:tm5.png]]<br />
</center><br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
Another advantage of our TCNLM is its capacity to generate meaningful sentences conditioned on given topics. Significant extensions have proposed, including:<br />
<br />
#Google Mail Smart Compose: Much like autocomplete in the search bar or on your smartphone’s keyboard, the new AI-powered feature promises to not only intelligently work out what you’re currently trying to write but to predict whole emails.<br />
#Grammarly: Grammarly automatically detects grammar, spelling, punctuation, word choice and style mistakes in your writing.<br />
#Sentence generator helps people with language barrier express more fluently: Stephen Hawking's main interface to the computer, called ACAT, includes a word prediction algorithm provided by SwiftKey, trained on his books and lectures, so he usually only have to type the first couple of characters before he can select the whole word.<br />
<br />
=References=<br />
* <sup>[https://arxiv.org/abs/1712.09783 [1]]</sup>W. Wang, Z. Gan, W. Wang, D. Shen, J. Huang, W. Ping, S. Satheesh, and L. Carin. Topic compositional neural language model. arXiv preprint\ arXiv:1712.09783, 2017.<br />
<br />
* <sup>[https://arxiv.org/abs/1611.08002 [2]]</sup>Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. Semantic compositional networks for visual captioning. arXiv preprint\ arXiv:1611.08002, 2016.<br />
<br />
* <sup>[https://arxiv.org/abs/1412.6980 [3]]</sup>D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
* <sup>[http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf [4]]</sup>D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 2003.<br />
<br />
* <sup>[https://arxiv.org/abs/1605.06715 [4]]</sup>J. Song, Z. Gan, and L. Carin. Factored temporal sigmoid belief networks for sequence learning. In ICML, 2016.<br />
<br />
* <sup>[http://www.cs.cmu.edu/~pengtaox/papers/kdd15_drbm.pdf [5]]</sup>P. Xie, Y. Deng, and E. Xing. Diversifying restricted boltzmann machine for document modeling. In KDD, 2015.<br />
<br />
* <sup>[https://arxiv.org/abs/1706.00359 [6]]</sup>Y. Miao, E. Grefenstette, and P. Blunsom. Discovering discrete latent topics with neural variational inference. arXiv preprint arXiv:1706.00359, 2017.</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:tm5.png&diff=38415
File:tm5.png
2018-11-08T16:46:22Z
<p>Q26deng: </p>
<hr />
<div></div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:tm4.png&diff=38412
File:tm4.png
2018-11-08T16:45:29Z
<p>Q26deng: </p>
<hr />
<div></div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:tm3.png&diff=38411
File:tm3.png
2018-11-08T16:44:54Z
<p>Q26deng: </p>
<hr />
<div></div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:tm2.png&diff=38409
File:tm2.png
2018-11-08T16:44:05Z
<p>Q26deng: </p>
<hr />
<div></div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=38407
stat441F18/TCNLM
2018-11-08T16:43:21Z
<p>Q26deng: /* Model Evaluation */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [https://en.wikipedia.org/wiki/Autoencoder variational autoencoder] framework, coupled with the probability of topic usage, are further trained in a MoE model. <br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of [[#Topic Model| latent topics]], weighted by the topic-usage probabilities, yields an effective prediction for the sentences<sup>[[#References|[1]]]</sup>. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
[[File:Screen_Shot_2018-11-08_at_10.35.41_AM.png|thumb|center|700px|alt=model architecture.|[[#Model Architecture|Overall architecture]]]]<br />
<br />
==Topic Model==<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
===LDA===<br />
<br />
A common example of a topic model would be [https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation latent Dirichlet allocation] (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as the following:<br />
<br />
<center><br />
<math><br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) = \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
</math><br />
</center><br />
<br />
===Neural Topic Model===<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
<center><br />
[[File:Screen Shot 2018-11-08 at 10.28.02 AM.png]]<br />
</center><br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to [[#LDA|LDA]], the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\begin{align}<br />
&\boldsymbol{\theta} \sim N(\mu_0, \sigma_0^2) \\<br />
&\boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}}) \\<br />
&z_n \sim Discrete(\boldsymbol t) \\<br />
&w_n \sim Discrete(\boldsymbol \beta_{z_n})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Where <math>\boldsymbol{\hat W}</math> and <math> \boldsymbol{\hat b}</math> are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) &= \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n p(w_n | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) p(\boldsymbol d | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t<br />
\end{align}<br />
</math><br />
</center><br />
<br />
The first equality of the equation above holds since we can marginalize out the sampled topic words <math>z_n</math>:<br />
<center><br />
<math><br />
p(w_n|\boldsymbol \beta, \boldsymbol t) = \sum_{z_n} p(w_n | \boldsymbol \beta) p(z_n | \boldsymbol t)<br />
</math><br />
</center><br />
<br />
====Re-Parameterization Trick====<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section [[#Model Inference|model inference]].<br />
<br />
====Diversity Regularizer====<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. <br />
<br />
First, we measure the '''distance''' between pair of topics with:<br />
<br />
<center><br />
<math><br />
a(\boldsymbol \beta_i, \boldsymbol \beta_j) = \arccos(\frac{|\boldsymbol \beta_i \cdot \boldsymbol \beta_j|}{||\boldsymbol \beta_i||_2||\boldsymbol \beta_j||_2})<br />
</math><br />
</center><br />
<br />
Then, '''mean''' angle of all pairs of T topics is <br />
<br />
<center><br />
<math><br />
\phi = \frac{1}{T^2} \sum_i \sum_j a(\boldsymbol \beta_i, \boldsymbol \beta_j)<br />
</math><br />
</center><br />
<br />
and '''variance''' is <br />
<br />
<center><br />
<math><br />
\nu - \frac{1}{T^2} \sum_i \sum_j (a(\boldsymbol \beta_i, \boldsymbol \beta_j) - \phi)^2<br />
</math><br />
</center><br />
<br />
Finally, we identify the topic diversity regularization as <br />
<br />
<center><br />
<math> <br />
R = \phi - \nu<br />
</math><br />
</center> <br />
<br />
which will be used in the [[#Model Inference|model inference]].<br />
<br />
==Language Model==<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> \boldsymbol h_{m} </math>. <br />
<br />
<center><br />
<math><br />
p(y_{m}|y_{1:m-1})=p(y_{m}|\boldsymbol h_{m})<br />
\\<br />
\boldsymbol h_{m}= f(\boldsymbol h_{m-1}x_{m})<br />
<br />
<br />
</math><br />
</center><br />
<br />
===Recurrent Neural Network===<br />
Recurrent Neural Networks ([https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]s) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory ([https://en.wikipedia.org/wiki/Long_short-term_memory LSTM]) or Gated Recurrent Unit ([https://en.wikipedia.org/wiki/Gated_recurrent_unit GRU]) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
===Neural Language Model===<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. <br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(\boldsymbol V \boldsymbol h_{m})\\<br />
\boldsymbol h_{m} &= \sigma(W(t)\boldsymbol x_{m} + U(t)\boldsymbol h_{m-1})\\<br />
\end{align}<br />
</math><br />
</center><br />
<br />
where <math> \boldsymbol W(t) </math> and <math> \boldsymbol U(t) </math> are defined as: <br />
<br />
<center> <math> \boldsymbol W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], \boldsymbol U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math> </center><br />
<br />
====LSTM Architecture====<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate <math> i_{m} </math>, forget gate <math> f_{m} </math>, output gate <math> o_{m} </math>, and memory stage <math> \tilde{c}_{m} </math> respectively. Recall a typical LSTM cell, model can be parametrized as follows:<br />
<center><br />
[[File:neurallanguage.png|right]]<br />
</center><br />
<br />
<center><br />
<math><br />
\begin{align}<br />
\boldsymbol i_{m} &= \sigma(\boldsymbol W_{i}(t) \boldsymbol x_{i,m-1} + \boldsymbol U_{i}(t) \boldsymbol h_{i,m-1})\\<br />
\boldsymbol f_{m} &= \sigma(\boldsymbol W_{f}(t) \boldsymbol x_{f,m-1} + \boldsymbol U_{f}(t)\boldsymbol h_{f,m-1})\\<br />
\boldsymbol o_{m} &= \sigma(\boldsymbol W_{o}(t) \boldsymbol x_{o,m-1} +\boldsymbol U_{o}(t)\boldsymbol h_{o,m-1})\\<br />
\tilde{\boldsymbol c}_{m} &= \sigma(\boldsymbol W_{c}(t) \boldsymbol x_{c,m-1} + \boldsymbol U_{c}(t)\boldsymbol h_{c,m-1})\\<br />
\boldsymbol c_{m} &= \boldsymbol i_{m} \odot \tilde{\boldsymbol c}_{m} + \boldsymbol f_{m} \cdot \boldsymbol c_{m-1}\\<br />
\boldsymbol h_{m} &= \boldsymbol o_{m} \odot tanh(\boldsymbol c_{m})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math>\boldsymbol W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, \boldsymbol W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> \boldsymbol W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by ''Gan et al.'' (2016) and ''Song et al.'' (2016) for semantic concept detection RNN. Mathematically,<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
\boldsymbol W(\boldsymbol t) &= W_{a} \cdot diag(\boldsymbol W_{b} \boldsymbol t) \cdot \boldsymbol W_{c} \\<br />
&= \boldsymbol W_{a} \cdot (\boldsymbol W_{b} \boldsymbol t \odot \boldsymbol W_{c}) \\<br />
\end{align}<br />
</math><br />
</center><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
==Model Inference==<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
<center> <math> p(y_{1:M},\boldsymbol d|\mu_{0},\boldsymbol \sigma_{0}^{2},\beta) = \int_{t}p(\boldsymbol t|\mu_{0},\sigma_{0}^{2})p(\boldsymbol d|\boldsymbol \beta,\boldsymbol t)\prod_{m=1}^{M}p(y_{m}|y_{1:m-1},\boldsymbol t)d \boldsymbol t </math> </center><br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, <math> q(t|d) </math>, which is the probability of latent vector <math> t </math> given bag-of-words from Neural Topic Model, is used to be the variational distribution of the real marginal probability <math>p(t) </math>, compensated by Kullback-Leibler divergence. The log likelihood function of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> can be estimated as follows:<br />
<br />
<center><br />
<math> <br />
\begin{align}<br />
\mathcal{L} =& \ \underbrace{\mathbb{E}_{q(\boldsymbol t|\boldsymbol d)} (log \ p(\boldsymbol d|\boldsymbol t)) - KL (q(\boldsymbol t|\boldsymbol d)||p(\boldsymbol t|\mu_{0},\sigma_{0}^{2})}_\text{neural topic model} \\ &+ \underbrace{\mathbb{E}_{q(\boldsymbol t|\boldsymbol d)} (\sum_{m=1}^{M} log p(y_{m}|y_{1:m-1}, \boldsymbol t)}_\text{neural language model} \leq log \ p(y_{1:M}, \boldsymbol d|\mu_{0},\sigma_{0}^{2},\boldsymbol \beta)<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Hence, the goal of TCNLM is to maximize <math> \mathcal{L} </math> together with the diversity regularization <math> R </math>, i.e<br />
<center><br />
<math><br />
\mathcal{J} = \mathcal{L} + \lambda \cdot R<br />
</math><br />
</center><br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**'''Language Models: '''<br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**'''Topic Models:'''<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB, and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
<center><br />
[[File:lm2.png]]<br />
</center><br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
[[File:tm.png]]<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
Another advantage of our TCNLM is its capacity to generate meaningful sentences conditioned on given topics. Significant extensions have proposed, including:<br />
<br />
#Google Mail Smart Compose: Much like autocomplete in the search bar or on your smartphone’s keyboard, the new AI-powered feature promises to not only intelligently work out what you’re currently trying to write but to predict whole emails.<br />
#Grammarly: Grammarly automatically detects grammar, spelling, punctuation, word choice and style mistakes in your writing.<br />
#Sentence generator helps people with language barrier express more fluently: Stephen Hawking's main interface to the computer, called ACAT, includes a word prediction algorithm provided by SwiftKey, trained on his books and lectures, so he usually only have to type the first couple of characters before he can select the whole word.<br />
<br />
=References=<br />
* <sup>[https://arxiv.org/abs/1712.09783 [1]]</sup>W. Wang, Z. Gan, W. Wang, D. Shen, J. Huang, W. Ping, S. Satheesh, and L. Carin. Topic compositional neural language model. arXiv preprint\ arXiv:1712.09783, 2017.<br />
<br />
* <sup>[2]</sup>Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. Semantic compositional networks for visual captioning. arXiv preprint\ arXiv:1611.08002, 2016.<br />
<br />
* <sup>[3]</sup>D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
* <sup>[4]</sup>D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 2003.<br />
<br />
* <sup>[5]</sup>J. Song, Z. Gan, and L. Carin. Factored temporal sigmoid belief networks for sequence learning. In ICML, 2016.<br />
<br />
* <sup>[6]</sup>P. Xie, Y. Deng, and E. Xing. Diversifying restricted boltzmann machine for document modeling. In KDD, 2015.<br />
<br />
* <sup>[7]</sup>Y. Miao, E. Grefenstette, and P. Blunsom. Discovering discrete latent topics with neural variational inference. arXiv preprint arXiv:1706.00359, 2017.</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:lm2.png&diff=38406
File:lm2.png
2018-11-08T16:42:29Z
<p>Q26deng: </p>
<hr />
<div></div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:1table1.png&diff=38403
File:1table1.png
2018-11-08T16:40:35Z
<p>Q26deng: </p>
<hr />
<div></div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:2table2.png&diff=38400
File:2table2.png
2018-11-08T16:38:35Z
<p>Q26deng: </p>
<hr />
<div></div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=38396
stat441F18/TCNLM
2018-11-08T16:30:55Z
<p>Q26deng: /* References */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [https://en.wikipedia.org/wiki/Autoencoder variational autoencoder] framework, coupled with the probability of topic usage, are further trained in a MoE model. <br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of [[#Topic Model| latent topics]], weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
P. Xie, Y. Deng, and E. Xing. Diversifying restricted boltzmann machine for document modeling. In KDD, 2015.<br />
<br />
Y. Miao, E. Grefenstette, and P. Blunsom. Discovering discrete latent topics with neural variational inference. arXiv preprint arXiv:1706.00359, 2017.<br />
<br />
=Model Architecture=<br />
[[File:Screen_Shot_2018-11-08_at_10.35.41_AM.png|thumb|center|700px|alt=model architecture.|[[#Model Architecture|Overall architecture]]]]<br />
<br />
==Topic Model==<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
===LDA===<br />
<br />
A common example of a topic model would be [https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation latent Dirichlet allocation] (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as the following:<br />
<br />
<center><br />
<math><br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) = \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
</math><br />
</center><br />
<br />
===Neural Topic Model===<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
<center><br />
[[File:Screen Shot 2018-11-08 at 10.28.02 AM.png]]<br />
</center><br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to [[#LDA|LDA]], the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\begin{align}<br />
&\boldsymbol{\theta} \sim N(\mu_0, \sigma_0^2) \\<br />
&\boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}}) \\<br />
&z_n \sim Discrete(\boldsymbol t) \\<br />
&w_n \sim Discrete(\boldsymbol \beta_{z_n})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Where <math>\boldsymbol{\hat W}</math> and <math> \boldsymbol{\hat b}</math> are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) &= \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n p(w_n | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) p(\boldsymbol d | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t<br />
\end{align}<br />
</math><br />
</center><br />
<br />
The first equality of the equation above holds since we can marginalize out the sampled topic words <math>z_n</math>:<br />
<center><br />
<math><br />
p(w_n|\boldsymbol \beta, \boldsymbol t) = \sum_{z_n} p(w_n | \boldsymbol \beta) p(z_n | \boldsymbol t)<br />
</math><br />
</center><br />
<br />
====Re-Parameterization Trick====<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section [[#Model Inference|model inference]].<br />
<br />
====Diversity Regularizer====<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. <br />
<br />
First, we measure the '''distance''' between pair of topics with:<br />
<br />
<center><br />
<math><br />
a(\boldsymbol \beta_i, \boldsymbol \beta_j) = \arccos(\frac{|\boldsymbol \beta_i \cdot \boldsymbol \beta_j|}{||\boldsymbol \beta_i||_2||\boldsymbol \beta_j||_2})<br />
</math><br />
</center><br />
<br />
Then, '''mean''' angle of all pairs of T topics is <br />
<br />
<center><br />
<math><br />
\phi = \frac{1}{T^2} \sum_i \sum_j a(\boldsymbol \beta_i, \boldsymbol \beta_j)<br />
</math><br />
</center><br />
<br />
and '''variance''' is <br />
<br />
<center><br />
<math><br />
\nu - \frac{1}{T^2} \sum_i \sum_j (a(\boldsymbol \beta_i, \boldsymbol \beta_j) - \phi)^2<br />
</math><br />
</center><br />
<br />
Finally, we identify the topic diversity regularization as <br />
<br />
<center><br />
<math> <br />
R = \phi - \nu<br />
</math><br />
</center> <br />
<br />
which will be used in the [[#Model Inference|model inference]].<br />
<br />
==Language Model==<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> \boldsymbol h_{m} </math>. <br />
<br />
<center><br />
<math><br />
p(y_{m}|y_{1:m-1})=p(y_{m}|\boldsymbol h_{m})<br />
\\<br />
\boldsymbol h_{m}= f(\boldsymbol h_{m-1}x_{m})<br />
<br />
<br />
</math><br />
</center><br />
<br />
===Recurrent Neural Network===<br />
Recurrent Neural Networks ([https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]s) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory ([https://en.wikipedia.org/wiki/Long_short-term_memory LSTM]) or Gated Recurrent Unit ([https://en.wikipedia.org/wiki/Gated_recurrent_unit GRU]) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
===Neural Language Model===<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. <br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(\boldsymbol V \boldsymbol h_{m})\\<br />
\boldsymbol h_{m} &= \sigma(W(t)\boldsymbol x_{m} + U(t)\boldsymbol h_{m-1})\\<br />
\end{align}<br />
</math><br />
</center><br />
<br />
where <math> \boldsymbol W(t) </math> and <math> \boldsymbol U(t) </math> are defined as: <br />
<br />
<center> <math> \boldsymbol W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], \boldsymbol U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math> </center><br />
<br />
====LSTM Architecture====<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate <math> i_{m} </math>, forget gate <math> f_{m} </math>, output gate <math> o_{m} </math>, and memory stage <math> \tilde{c}_{m} </math> respectively. Recall a typical LSTM cell, model can be parametrized as follows:<br />
<center><br />
[[File:neurallanguage.png|right]]<br />
</center><br />
<br />
<center><br />
<math><br />
\begin{align}<br />
\boldsymbol i_{m} &= \sigma(\boldsymbol W_{i}(t) \boldsymbol x_{i,m-1} + \boldsymbol U_{i}(t) \boldsymbol h_{i,m-1})\\<br />
\boldsymbol f_{m} &= \sigma(\boldsymbol W_{f}(t) \boldsymbol x_{f,m-1} + \boldsymbol U_{f}(t)\boldsymbol h_{f,m-1})\\<br />
\boldsymbol o_{m} &= \sigma(\boldsymbol W_{o}(t) \boldsymbol x_{o,m-1} +\boldsymbol U_{o}(t)\boldsymbol h_{o,m-1})\\<br />
\tilde{\boldsymbol c}_{m} &= \sigma(\boldsymbol W_{c}(t) \boldsymbol x_{c,m-1} + \boldsymbol U_{c}(t)\boldsymbol h_{c,m-1})\\<br />
\boldsymbol c_{m} &= \boldsymbol i_{m} \odot \tilde{\boldsymbol c}_{m} + \boldsymbol f_{m} \cdot \boldsymbol c_{m-1}\\<br />
\boldsymbol h_{m} &= \boldsymbol o_{m} \odot tanh(\boldsymbol c_{m})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math>\boldsymbol W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, \boldsymbol W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> \boldsymbol W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by ''Gan et al.'' (2016) and ''Song et al.'' (2016) for semantic concept detection RNN. Mathematically,<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
\boldsymbol W(\boldsymbol t) &= W_{a} \cdot diag(\boldsymbol W_{b} \boldsymbol t) \cdot \boldsymbol W_{c} \\<br />
&= \boldsymbol W_{a} \cdot (\boldsymbol W_{b} \boldsymbol t \odot \boldsymbol W_{c}) \\<br />
\end{align}<br />
</math><br />
</center><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
==Model Inference==<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
<center> <math> p(y_{1:M},\boldsymbol d|\mu_{0},\boldsymbol \sigma_{0}^{2},\beta) = \int_{t}p(\boldsymbol t|\mu_{0},\sigma_{0}^{2})p(\boldsymbol d|\boldsymbol \beta,\boldsymbol t)\prod_{m=1}^{M}p(y_{m}|y_{1:m-1},\boldsymbol t)d \boldsymbol t </math> </center><br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, <math> q(t|d) </math>, which is the probability of latent vector <math> t </math> given bag-of-words from Neural Topic Model, is used to be the variational distribution of the real marginal probability <math>p(t) </math>, compensated by Kullback-Leibler divergence. The log likelihood function of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> can be estimated as follows:<br />
<br />
<center><br />
<math> <br />
\begin{align}<br />
\mathcal{L} =& \ \underbrace{\mathbb{E}_{q(\boldsymbol t|\boldsymbol d)} (log \ p(\boldsymbol d|\boldsymbol t)) - KL (q(\boldsymbol t|\boldsymbol d)||p(\boldsymbol t|\mu_{0},\sigma_{0}^{2})}_\text{neural topic model} \\ &+ \underbrace{\mathbb{E}_{q(\boldsymbol t|\boldsymbol d)} (\sum_{m=1}^{M} log p(y_{m}|y_{1:m-1}, \boldsymbol t)}_\text{neural language model} \leq log \ p(y_{1:M}, \boldsymbol d|\mu_{0},\sigma_{0}^{2},\boldsymbol \beta)<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Hence, the goal of TCNLM is to maximize <math> \mathcal{L} </math> together with the diversity regularization <math> R </math>, i.e<br />
<center><br />
<math><br />
\mathcal{J} = \mathcal{L} + \lambda \cdot R<br />
</math><br />
</center><br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**'''Language Models: '''<br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**'''Topic Models:'''<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB, and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
[[File:lm.png]]<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
[[File:tm.png]]<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
Another advantage of our TCNLM is its capacity to generate meaningful sentences conditioned on given topics. Significant extensions have proposed, including:<br />
<br />
#Google Mail Smart Compose: Much like autocomplete in the search bar or on your smartphone’s keyboard, the new AI-powered feature promises to not only intelligently work out what you’re currently trying to write but to predict whole emails.<br />
#Grammarly: Grammarly automatically detects grammar, spelling, punctuation, word choice and style mistakes in your writing.<br />
#Sentence generator helps people with language barrier express more fluently: Stephen Hawking's main interface to the computer, called ACAT, includes a word prediction algorithm provided by SwiftKey, trained on his books and lectures, so he usually only have to type the first couple of characters before he can select the whole word.<br />
<br />
=References=<br />
* Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. Semantic compositional networks for visual captioning. arXiv preprint\ arXiv:1611.08002, 2016.<br />
<br />
* D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.<br />
<br />
* D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 2003.<br />
<br />
* J. Song, Z. Gan, and L. Carin. Factored temporal sigmoid belief networks for sequence learning. In ICML, 2016.</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=38393
stat441F18/TCNLM
2018-11-08T16:26:58Z
<p>Q26deng: /* References */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [https://en.wikipedia.org/wiki/Autoencoder variational autoencoder] framework, coupled with the probability of topic usage, are further trained in a MoE model. <br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of [[#Topic Model| latent topics]], weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
[[File:Screen_Shot_2018-11-08_at_10.35.41_AM.png|thumb|center|700px|alt=model architecture.|[[#Model Architecture|Overall architecture]]]]<br />
<br />
==Topic Model==<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
===LDA===<br />
<br />
A common example of a topic model would be [https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation latent Dirichlet allocation] (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as the following:<br />
<br />
<center><br />
<math><br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) = \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
</math><br />
</center><br />
<br />
===Neural Topic Model===<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
<center><br />
[[File:Screen Shot 2018-11-08 at 10.28.02 AM.png]]<br />
</center><br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to [[#LDA|LDA]], the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\begin{align}<br />
&\boldsymbol{\theta} \sim N(\mu_0, \sigma_0^2) \\<br />
&\boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}}) \\<br />
&z_n \sim Discrete(\boldsymbol t) \\<br />
&w_n \sim Discrete(\boldsymbol \beta_{z_n})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Where <math>\boldsymbol{\hat W}</math> and <math> \boldsymbol{\hat b}</math> are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) &= \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n p(w_n | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) p(\boldsymbol d | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t<br />
\end{align}<br />
</math><br />
</center><br />
<br />
The first equality of the equation above holds since we can marginalize out the sampled topic words <math>z_n</math>:<br />
<center><br />
<math><br />
p(w_n|\boldsymbol \beta, \boldsymbol t) = \sum_{z_n} p(w_n | \boldsymbol \beta) p(z_n | \boldsymbol t)<br />
</math><br />
</center><br />
<br />
====Re-Parameterization Trick====<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section [[#Model Inference|model inference]].<br />
<br />
====Diversity Regularizer====<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. <br />
<br />
First, we measure the '''distance''' between pair of topics with:<br />
<br />
<center><br />
<math><br />
a(\boldsymbol \beta_i, \boldsymbol \beta_j) = \arccos(\frac{|\boldsymbol \beta_i \cdot \boldsymbol \beta_j|}{||\boldsymbol \beta_i||_2||\boldsymbol \beta_j||_2})<br />
</math><br />
</center><br />
<br />
Then, '''mean''' angle of all pairs of T topics is <br />
<br />
<center><br />
<math><br />
\phi = \frac{1}{T^2} \sum_i \sum_j a(\boldsymbol \beta_i, \boldsymbol \beta_j)<br />
</math><br />
</center><br />
<br />
and '''variance''' is <br />
<br />
<center><br />
<math><br />
\nu - \frac{1}{T^2} \sum_i \sum_j (a(\boldsymbol \beta_i, \boldsymbol \beta_j) - \phi)^2<br />
</math><br />
</center><br />
<br />
Finally, we identify the topic diversity regularization as <br />
<br />
<center><br />
<math> <br />
R = \phi - \nu<br />
</math><br />
</center> <br />
<br />
which will be used in the [[#Model Inference|model inference]].<br />
<br />
==Language Model==<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> \boldsymbol h_{m} </math>. <br />
<br />
<center><br />
<math><br />
p(y_{m}|y_{1:m-1})=p(y_{m}|\boldsymbol h_{m})<br />
\\<br />
\boldsymbol h_{m}= f(\boldsymbol h_{m-1}x_{m})<br />
<br />
<br />
</math><br />
</center><br />
<br />
===Recurrent Neural Network===<br />
Recurrent Neural Networks ([https://en.wikipedia.org/wiki/Recurrent_neural_network RNN]s) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory ([https://en.wikipedia.org/wiki/Long_short-term_memory LSTM]) or Gated Recurrent Unit ([https://en.wikipedia.org/wiki/Gated_recurrent_unit GRU]) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
===Neural Language Model===<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. <br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(\boldsymbol V \boldsymbol h_{m})\\<br />
\boldsymbol h_{m} &= \sigma(W(t)\boldsymbol x_{m} + U(t)\boldsymbol h_{m-1})\\<br />
\end{align}<br />
</math><br />
</center><br />
<br />
where <math> \boldsymbol W(t) </math> and <math> \boldsymbol U(t) </math> are defined as: <br />
<br />
<center> <math> \boldsymbol W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], \boldsymbol U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math> </center><br />
<br />
====LSTM Architecture====<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate <math> i_{m} </math>, forget gate <math> f_{m} </math>, output gate <math> o_{m} </math>, and memory stage <math> \tilde{c}_{m} </math> respectively. Recall a typical LSTM cell, model can be parametrized as follows:<br />
<center><br />
[[File:neurallanguage.png|right]]<br />
</center><br />
<br />
<center><br />
<math><br />
\begin{align}<br />
\boldsymbol i_{m} &= \sigma(\boldsymbol W_{i}(t) \boldsymbol x_{i,m-1} + \boldsymbol U_{i}(t) \boldsymbol h_{i,m-1})\\<br />
\boldsymbol f_{m} &= \sigma(\boldsymbol W_{f}(t) \boldsymbol x_{f,m-1} + \boldsymbol U_{f}(t)\boldsymbol h_{f,m-1})\\<br />
\boldsymbol o_{m} &= \sigma(\boldsymbol W_{o}(t) \boldsymbol x_{o,m-1} +\boldsymbol U_{o}(t)\boldsymbol h_{o,m-1})\\<br />
\tilde{\boldsymbol c}_{m} &= \sigma(\boldsymbol W_{c}(t) \boldsymbol x_{c,m-1} + \boldsymbol U_{c}(t)\boldsymbol h_{c,m-1})\\<br />
\boldsymbol c_{m} &= \boldsymbol i_{m} \odot \tilde{\boldsymbol c}_{m} + \boldsymbol f_{m} \cdot \boldsymbol c_{m-1}\\<br />
\boldsymbol h_{m} &= \boldsymbol o_{m} \odot tanh(\boldsymbol c_{m})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math>\boldsymbol W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, \boldsymbol W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> \boldsymbol W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by ''Gan et al.'' (2016) and ''Song et al.'' (2016) for semantic concept detection RNN. Mathematically,<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
\boldsymbol W(\boldsymbol t) &= W_{a} \cdot diag(\boldsymbol W_{b} \boldsymbol t) \cdot \boldsymbol W_{c} \\<br />
&= \boldsymbol W_{a} \cdot (\boldsymbol W_{b} \boldsymbol t \odot \boldsymbol W_{c}) \\<br />
\end{align}<br />
</math><br />
</center><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
==Model Inference==<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
<center> <math> p(y_{1:M},\boldsymbol d|\mu_{0},\boldsymbol \sigma_{0}^{2},\beta) = \int_{t}p(\boldsymbol t|\mu_{0},\sigma_{0}^{2})p(\boldsymbol d|\boldsymbol \beta,\boldsymbol t)\prod_{m=1}^{M}p(y_{m}|y_{1:m-1},\boldsymbol t)d \boldsymbol t </math> </center><br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, <math> q(t|d) </math>, which is the probability of latent vector <math> t </math> given bag-of-words from Neural Topic Model, is used to be the variational distribution of the real marginal probability <math>p(t) </math>, compensated by Kullback-Leibler divergence. The log likelihood function of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> can be estimated as follows:<br />
<br />
<center><br />
<math> <br />
\begin{align}<br />
\mathcal{L} =& \ \underbrace{\mathbb{E}_{q(\boldsymbol t|\boldsymbol d)} (log \ p(\boldsymbol d|\boldsymbol t)) - KL (q(\boldsymbol t|\boldsymbol d)||p(\boldsymbol t|\mu_{0},\sigma_{0}^{2})}_\text{neural topic model} \\ &+ \underbrace{\mathbb{E}_{q(\boldsymbol t|\boldsymbol d)} (\sum_{m=1}^{M} log p(y_{m}|y_{1:m-1}, \boldsymbol t)}_\text{neural language model} \leq log \ p(y_{1:M}, \boldsymbol d|\mu_{0},\sigma_{0}^{2},\boldsymbol \beta)<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Hence, the goal of TCNLM is to maximize <math> \mathcal{L} </math> together with the diversity regularization <math> R </math>, i.e<br />
<center><br />
<math><br />
\mathcal{J} = \mathcal{L} + \lambda \cdot R<br />
</math><br />
</center><br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**'''Language Models: '''<br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**'''Topic Models:'''<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB, and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
[[File:lm.png]]<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
[[File:tm.png]]<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
Another advantage of our TCNLM is its capacity to generate meaningful sentences conditioned on given topics. Significant extensions have proposed, including:<br />
<br />
#Google Mail Smart Compose: Much like autocomplete in the search bar or on your smartphone’s keyboard, the new AI-powered feature promises to not only intelligently work out what you’re currently trying to write but to predict whole emails.<br />
#Grammarly: Grammarly automatically detects grammar, spelling, punctuation, word choice and style mistakes in your writing.<br />
#Sentence generator helps people with language barrier express more fluently: Stephen Hawking's main interface to the computer, called ACAT, includes a word prediction algorithm provided by SwiftKey, trained on his books and lectures, so he usually only have to type the first couple of characters before he can select the whole word.<br />
<br />
=References=<br />
* Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. Semantic compositional networks for visual captioning. arXiv preprint\ arXiv:1611.08002, 2016.</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=38390
stat441F18/TCNLM
2018-11-08T16:20:23Z
<p>Q26deng: /* References */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [https://en.wikipedia.org/wiki/Autoencoder variational autoencoder] framework, coupled with the probability of topic usage, are further trained in a MoE model. <br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of [[#Topic Model| latent topics]], weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
[[File:Screen_Shot_2018-11-08_at_10.35.41_AM.png|thumb|center|700px|alt=model architecture.|[[#Model Architecture|Overall architecture]]]]<br />
<br />
==Topic Model==<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
===LDA===<br />
<br />
A common example of a topic model would be [https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation latent Dirichlet allocation] (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as the following:<br />
<br />
<center><br />
<math><br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) = \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
</math><br />
</center><br />
<br />
===Neural Topic Model===<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
<center><br />
[[File:Screen Shot 2018-11-08 at 10.28.02 AM.png]]<br />
</center><br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to [[#LDA|LDA]], the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\begin{align}<br />
&\boldsymbol{\theta} \sim N(\mu_0, \sigma_0^2) \\<br />
&\boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}}) \\<br />
&z_n \sim Discrete(\boldsymbol t) \\<br />
&w_n \sim Discrete(\boldsymbol \beta_{z_n})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Where <math>\boldsymbol{\hat W}</math> and <math> \boldsymbol{\hat b}</math> are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) &= \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n p(w_n | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) p(\boldsymbol d | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t<br />
\end{align}<br />
</math><br />
</center><br />
<br />
The first equality of the equation above holds since we can marginalize out the sampled topic words <math>z_n</math>:<br />
<center><br />
<math><br />
p(w_n|\boldsymbol \beta, \boldsymbol t) = \sum_{z_n} p(w_n | \boldsymbol \beta) p(z_n | \boldsymbol t)<br />
</math><br />
</center><br />
<br />
====Re-Parameterization Trick====<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section [[#Model Inference|model inference]].<br />
<br />
====Diversity Regularizer====<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. <br />
<br />
First, we measure the '''distance''' between pair of topics with:<br />
<br />
<center><br />
<math><br />
a(\boldsymbol \beta_i, \boldsymbol \beta_j) = \arccos(\frac{|\boldsymbol \beta_i \cdot \boldsymbol \beta_j|}{||\boldsymbol \beta_i||_2||\boldsymbol \beta_j||_2})<br />
</math><br />
</center><br />
<br />
Then, '''mean''' angle of all pairs of T topics is <br />
<br />
<center><br />
<math><br />
\phi = \frac{1}{T^2} \sum_i \sum_j a(\boldsymbol \beta_i, \boldsymbol \beta_j)<br />
</math><br />
</center><br />
<br />
and '''variance''' is <br />
<br />
<center><br />
<math><br />
\nu - \frac{1}{T^2} \sum_i \sum_j (a(\boldsymbol \beta_i, \boldsymbol \beta_j) - \phi)^2<br />
</math><br />
</center><br />
<br />
Finally, we identify the topic diversity regularization as <br />
<br />
<center><br />
<math> <br />
R = \phi - \nu<br />
</math><br />
</center> <br />
<br />
which will be used in the [[#Model Inference|model inference]].<br />
<br />
==Language Model==<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> \boldsymbol h_{m} </math>. <br />
<br />
<center><br />
<math><br />
p(y_{m}|y_{1:m-1})=p(y_{m}|\boldsymbol h_{m})<br />
\\<br />
\boldsymbol h_{m}= f(\boldsymbol h_{m-1}x_{m})<br />
<br />
<br />
</math><br />
</center><br />
<br />
===Recurrent Neural Network===<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
===Neural Language Model===<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. <br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(\boldsymbol V \boldsymbol h_{m})\\<br />
\boldsymbol h_{m} &= \sigma(W(t)\boldsymbol x_{m} + U(t)\boldsymbol h_{m-1})\\<br />
\end{align}<br />
</math><br />
</center><br />
<br />
where <math> \boldsymbol W(t) </math> and <math> \boldsymbol U(t) </math> are defined as: <br />
<br />
<center> <math> \boldsymbol W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], \boldsymbol U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math> </center><br />
<br />
====LSTM Architecture====<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate <math> i_{m} </math>, forget gate <math> f_{m} </math>, output gate <math> o_{m} </math>, and memory stage <math> \tilde{c}_{m} </math> respectively. Recall a typical LSTM cell, model can be parametrized as follows:<br />
<center><br />
[[File:neurallanguage.png|right]]<br />
</center><br />
<br />
<center><br />
<math><br />
\begin{align}<br />
\boldsymbol i_{m} &= \sigma(\boldsymbol W_{i}(t) \boldsymbol x_{i,m-1} + \boldsymbol U_{i}(t) \boldsymbol h_{i,m-1})\\<br />
\boldsymbol f_{m} &= \sigma(\boldsymbol W_{f}(t) \boldsymbol x_{f,m-1} + \boldsymbol U_{f}(t)\boldsymbol h_{f,m-1})\\<br />
\boldsymbol o_{m} &= \sigma(\boldsymbol W_{o}(t) \boldsymbol x_{o,m-1} +\boldsymbol U_{o}(t)\boldsymbol h_{o,m-1})\\<br />
\tilde{\boldsymbol c}_{m} &= \sigma(\boldsymbol W_{c}(t) \boldsymbol x_{c,m-1} + \boldsymbol U_{c}(t)\boldsymbol h_{c,m-1})\\<br />
\boldsymbol c_{m} &= \boldsymbol i_{m} \odot \tilde{\boldsymbol c}_{m} + \boldsymbol f_{m} \cdot \boldsymbol c_{m-1}\\<br />
\boldsymbol h_{m} &= \boldsymbol o_{m} \odot tanh(\boldsymbol c_{m})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math>\boldsymbol W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, \boldsymbol W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> \boldsymbol W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by ''Gan et al.'' (2016) and ''Song et al.'' (2016) for semantic concept detection RNN. Mathematically,<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
W(t) &= W_{a} \cdot diag(W_{b}t) \cdot W_{c} \\<br />
&= W_{a} \cdot (W_{b}t \odot W_{c}) \\<br />
\end{align}<br />
</math><br />
</center><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
==Model Inference==<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
<center> <math> p(y_{1:M},\boldsymbol d|\mu_{0},\boldsymbol \sigma_{0}^{2},\beta) = \int_{t}p(\boldsymbol t|\mu_{0},\sigma_{0}^{2})p(\boldsymbol d|\boldsymbol \beta,\boldsymbol t)\prod_{m=1}^{M}p(y_{m}|y_{1:m-1},\boldsymbol t)d \boldsymbol t </math> </center><br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, <math> q(t|d) </math>, which is the probability of latent vector <math> t </math> given bag-of-words from Neural Topic Model, is used to be the variational distribution of the real marginal probability <math>p(t) </math>, compensated by Kullback-Leibler divergence. The log likelihood function of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> can be estimated as follows:<br />
<br />
<center><br />
<math> <br />
\begin{align}<br />
\mathcal{L} =& \ \underbrace{\mathbb{E}_{q(\boldsymbol t|\boldsymbol d)} (log \ p(\boldsymbol d|\boldsymbol t)) - KL (q(\boldsymbol t|\boldsymbol d)||p(\boldsymbol t|\mu_{0},\sigma_{0}^{2})}_\text{neural topic model} \\ &+ \underbrace{\mathbb{E}_{q(\boldsymbol t|\boldsymbol d)} (\sum_{m=1}^{M} log p(y_{m}|y_{1:m-1}, \boldsymbol t)}_\text{neural language model} \leq log \ p(y_{1:M}, \boldsymbol d|\mu_{0},\sigma_{0}^{2},\boldsymbol \beta)<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Hence, the goal of TCNLM is to maximize <math> \mathcal{L} </math> together with the diversity regularization <math> R </math>, i.e<br />
<center><br />
<math><br />
\mathcal{J} = \mathcal{L} + \lambda \cdot R<br />
</math><br />
</center><br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**'''Language Models: '''<br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**'''Topic Models:'''<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB, and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
[[File:lm.png]]<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
[[File:tm.png]]<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
Another advantage of our TCNLM is its capacity to generate meaningful sentences conditioned on given topics. Significant extensions have proposed, including:<br />
<br />
#Google Mail Smart Compose: Much like autocomplete in the search bar or on your smartphone’s keyboard, the new AI-powered feature promises to not only intelligently work out what you’re currently trying to write but to predict whole emails.<br />
#Grammarly: Grammarly automatically detects grammar, spelling, punctuation, word choice and style mistakes in your writing.<br />
#Sentence generator helps people with language barrier express more fluently: Stephen Hawking's main interface to the computer, called ACAT, includes a word prediction algorithm provided by SwiftKey, trained on his books and lectures, so he usually only have to type the first couple of characters before he can select the whole word.<br />
<br />
=References=<br />
* {{Citation | last=Apostol | first=Tom M. | author-link=Tom M. Apostol | title=Calculus, Vol.&nbsp;1: One-Variable Calculus with an Introduction to Linear Algebra | year=1967 | edition=2nd | publisher=Wiley | isbn=978-0-471-00005-1}}</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=38386
stat441F18/TCNLM
2018-11-08T16:12:39Z
<p>Q26deng: /* Neural Language Model */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [https://en.wikipedia.org/wiki/Autoencoder variational autoencoder] framework, coupled with the probability of topic usage, are further trained in a MoE model. <br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of [[#Topic Model| latent topics]], weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
[[File:Screen_Shot_2018-11-08_at_10.35.41_AM.png|thumb|center|700px|alt=model architecture.|[[#Model Architecture|Overall architecture]]]]<br />
<br />
==Topic Model==<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
===LDA===<br />
<br />
A common example of a topic model would be [https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation latent Dirichlet allocation] (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as the following:<br />
<br />
<center><br />
<math><br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) = \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
</math><br />
</center><br />
<br />
===Neural Topic Model===<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
<center><br />
[[File:Screen Shot 2018-11-08 at 10.28.02 AM.png]]<br />
</center><br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to [[#LDA|LDA]], the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\begin{align}<br />
&\boldsymbol{\theta} \sim N(\mu_0, \sigma_0^2) \\<br />
&\boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}}) \\<br />
&z_n \sim Discrete(\boldsymbol t) \\<br />
&w_n \sim Discrete(\boldsymbol \beta_{z_n})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Where <math>\boldsymbol{\hat W}</math> and <math> \boldsymbol{\hat b}</math> are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) &= \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n p(w_n | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) p(\boldsymbol d | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t<br />
\end{align}<br />
</math><br />
</center><br />
<br />
The first equality of the equation above holds since we can marginalize out the sampled topic words <math>z_n</math>:<br />
<center><br />
<math><br />
p(w_n|\boldsymbol \beta, \boldsymbol t) = \sum_{z_n} p(w_n | \boldsymbol \beta) p(z_n | \boldsymbol t)<br />
</math><br />
</center><br />
<br />
====Re-Parameritization Trick====<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section [[#Model Inference|model inference]].<br />
<br />
====Diversity Regularizer====<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. <br />
<br />
First, we measure the '''distance''' between pair of topics with:<br />
<br />
<center><br />
<math><br />
a(\boldsymbol \beta_i, \boldsymbol \beta_j) = \arccos(\frac{|\boldsymbol \beta_i \cdot \boldsymbol \beta_j|}{||\boldsymbol \beta_i||_2||\boldsymbol \beta_j||_2})<br />
</math><br />
</center><br />
<br />
Then, '''mean''' angle of all pairs of T topics is <br />
<br />
<center><br />
<math><br />
\phi = \frac{1}{T^2} \sum_i \sum_j a(\boldsymbol \beta_i, \boldsymbol \beta_j)<br />
</math><br />
</center><br />
<br />
and '''variance''' is <br />
<br />
<center><br />
<math><br />
\nu - \frac{1}{T^2} \sum_i \sum_j (a(\boldsymbol \beta_i, \boldsymbol \beta_j) - \phi)^2<br />
</math><br />
</center><br />
<br />
Finally, we identify the topic diversity regularization as <br />
<br />
<center><br />
<math> <br />
R = \phi - \nu<br />
</math><br />
</center> <br />
<br />
which will be used in the [[#Model Inference|model inference]].<br />
<br />
==Language Model==<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> \boldsymbol h_{m} </math>. <br />
<br />
<center><br />
<math><br />
p(y_{m}|y_{1:m-1})=p(y_{m}|\boldsymbol h_{m})<br />
\\<br />
\boldsymbol h_{m}= f(\boldsymbol h_{m-1}x_{m})<br />
<br />
<br />
</math><br />
</center><br />
<br />
===Recurrent Neural Network===<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
===Neural Language Model===<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. <br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(\boldsymbol V \boldsymbol h_{m})\\<br />
\boldsymbol h_{m} &= \sigma(W(t)\boldsymbol x_{m} + U(t)\boldsymbol h_{m-1})\\<br />
\end{align}<br />
</math><br />
</center><br />
<br />
where <math> \boldsymbol W(t) </math> and <math> \boldsymbol U(t) </math> are defined as: <br />
<br />
<center> <math> \boldsymbol W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], \boldsymbol U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math> </center><br />
<br />
====LSTM Architecture====<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate <math> i_{m} </math>, forget gate <math> f_{m} </math>, output gate <math> o_{m} </math>, and memory stage <math> \tilde{c}_{m} </math> respectively. Recall a typical LSTM cell, model can be parametrized as follows:<br />
<center><br />
[[File:neurallanguage.png|right]]<br />
</center><br />
<br />
<center><br />
<math><br />
\begin{align}<br />
\boldsymbol i_{m} &= \sigma(\boldsymbol W_{i}(t) \boldsymbol x_{i,m-1} + \boldsymbol U_{i}(t) \boldsymbol h_{i,m-1})\\<br />
\boldsymbol f_{m} &= \sigma(\boldsymbol W_{f}(t) \boldsymbol x_{f,m-1} + \boldsymbol U_{f}(t)\boldsymbol h_{f,m-1})\\<br />
\boldsymbol o_{m} &= \sigma(\boldsymbol W_{o}(t) \boldsymbol x_{o,m-1} +\boldsymbol U_{o}(t)\boldsymbol h_{o,m-1})\\<br />
\tilde{\boldsymbol c}_{m} &= \sigma(\boldsymbol W_{c}(t) \boldsymbol x_{c,m-1} + \boldsymbol U_{c}(t)\boldsymbol h_{c,m-1})\\<br />
\boldsymbol c_{m} &= \boldsymbol i_{m} \odot \tilde{\boldsymbol c}_{m} + \boldsymbol f_{m} \cdot \boldsymbol c_{m-1}\\<br />
\boldsymbol h_{m} &= \boldsymbol o_{m} \odot tanh(\boldsymbol c_{m})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math>\boldsymbol W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, \boldsymbol W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> \boldsymbol W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by ''Gan et al.'' (2016) and ''Song et al.'' (2016) for semantic concept detection RNN. Mathematically,<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
W(t) &= W_{a} \cdot diag(W_{b}t) \cdot W_{c} \\<br />
&= W_{a} \cdot (W_{b}t \odot W_{c}) \\<br />
\end{align}<br />
</math><br />
</center><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
==Model Inference==<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
<center> <math> p(y_{1:M},\boldsymbol d|\mu_{0},\boldsymbol \sigma_{0}^{2},\beta) = \int_{t}p(\boldsymbol t|\mu_{0},\sigma_{0}^{2})p(\boldsymbol d|\boldsymbol \beta,\boldsymbol t)\prod_{m=1}^{M}p(y_{m}|y_{1:m-1},\boldsymbol t)d \boldsymbol t </math> </center><br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, <math> q(t|d) </math>, which is the probability of latent vector <math> t </math> given bag-of-words from Neural Topic Model, is used to be the variational distribution of the real marginal probability <math>p(t) </math>, compensated by Kullback-Leibler divergence. The log likelihood function of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> can be estimated as follows:<br />
<br />
<center><br />
<math> <br />
\begin{align}<br />
\mathcal{L} =& \ \underbrace{\mathbb{E}_{q(\boldsymbol t|\boldsymbol d)} (log \ p(\boldsymbol d|\boldsymbol t)) - KL (q(\boldsymbol t|\boldsymbol d)||p(\boldsymbol t|\mu_{0},\sigma_{0}^{2})}_\text{neural topic model} \\ &+ \underbrace{\mathbb{E}_{q(\boldsymbol t|\boldsymbol d)} (\sum_{m=1}^{M} log p(y_{m}|y_{1:m-1}, \boldsymbol t)}_\text{neural language model} \leq log \ p(y_{1:M}, \boldsymbol d|\mu_{0},\sigma_{0}^{2},\boldsymbol \beta)<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Hence, the goal of TCNLM is to maximize <math> \mathcal{L} </math> together with the diversity regularization <math> R </math>, i.e<br />
<center><br />
<math><br />
\mathcal{J} = \mathcal{L} + \lambda \cdot R<br />
</math><br />
</center><br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**'''Language Models: '''<br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**'''Topic Models:'''<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB, and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
[[File:lm.png]]<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
[[File:tm.png]]<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
Another advantage of our TCNLM is its capacity to generate meaningful sentences conditioned on given topics. Significant extensions have proposed, including:<br />
<br />
#Google Mail Smart Compose: Much like autocomplete in the search bar or on your smartphone’s keyboard, the new AI-powered feature promises to not only intelligently work out what you’re currently trying to write but to predict whole emails.<br />
#Grammarly: Grammarly automatically detects grammar, spelling, punctuation, word choice and style mistakes in your writing.<br />
#Sentence generator helps people with language barrier express more fluently: Stephen Hawking's main interface to the computer, called ACAT, includes a word prediction algorithm provided by SwiftKey, trained on his books and lectures, so he usually only have to type the first couple of characters before he can select the whole word.<br />
<br />
=References=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=38370
stat441F18/TCNLM
2018-11-08T16:02:46Z
<p>Q26deng: /* Language Model */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [https://en.wikipedia.org/wiki/Autoencoder variational autoencoder] framework, coupled with the probability of topic usage, are further trained in a MoE model.<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
[[File:Screen_Shot_2018-11-08_at_10.35.41_AM.png|thumb|center|700px|alt=model architecture.|[[#Model Architecture|Overall architecture]]]]<br />
<br />
==Topic Model==<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
===LDA===<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as the following:<br />
<br />
<center><br />
<math><br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) = \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
</math><br />
</center><br />
<br />
===Neural Topic Model===<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
<center><br />
[[File:Screen Shot 2018-11-08 at 10.28.02 AM.png]]<br />
</center><br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to [[#LDA|LDA]], the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\begin{align}<br />
&\boldsymbol{\theta} \sim N(\mu_0, \sigma_0^2) \\<br />
&\boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}}) \\<br />
&z_n \sim Discrete(\boldsymbol t) \\<br />
&w_n \sim Discrete(\boldsymbol \beta_{z_n})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Where <math>\boldsymbol{\hat W}</math> and <math> \boldsymbol{\hat b}</math> are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) &= \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n p(w_n | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) p(\boldsymbol d | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t<br />
\end{align}<br />
</math><br />
</center><br />
<br />
The first equality of the equation above holds since we can marginalize out the sampled topic words <math>z_n</math>:<br />
<center><br />
<math><br />
p(w_n|\boldsymbol \beta, \boldsymbol t) = \sum_{z_n} p(w_n | \boldsymbol \beta) p(z_n | \boldsymbol t)<br />
</math><br />
</center><br />
<br />
====Re-Parameritization Trick====<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section [[#Model Inference|model inference]].<br />
<br />
====Diversity Regularizer====<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. <br />
<br />
First, we measure the '''distance''' between pair of topics with:<br />
<br />
<center><br />
<math><br />
a(\boldsymbol \beta_i, \boldsymbol \beta_j) = \arccos(\frac{|\boldsymbol \beta_i \cdot \boldsymbol \beta_j|}{||\boldsymbol \beta_i||_2||\boldsymbol \beta_j||_2})<br />
</math><br />
</center><br />
<br />
Then, '''mean''' angle of all pairs of T topics is <br />
<br />
<center><br />
<math><br />
\phi = \frac{1}{T^2} \sum_i \sum_j a(\boldsymbol \beta_i, \boldsymbol \beta_j)<br />
</math><br />
</center><br />
<br />
and '''variance''' is <br />
<br />
<center><br />
<math><br />
\nu - \frac{1}{T^2} \sum_i \sum_j (a(\boldsymbol \beta_i, \boldsymbol \beta_j) - \phi)^2<br />
</math><br />
</center><br />
<br />
Finally, we identify the topic diversity regularization as <br />
<br />
<center><br />
<math> <br />
R = \phi - \nu<br />
</math><br />
</center> <br />
<br />
which will be used in the [[#Model Inference|model inference]].<br />
<br />
==Language Model==<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> \boldsymbol h_{m} </math>. <br />
<br />
<center><br />
<math><br />
p(y_{m}|y_{1:m-1})=p(y_{m}|\boldsymbol h_{m})<br />
\\<br />
\boldsymbol h_{m}= f(\boldsymbol h_{m-1}x_{m})<br />
<br />
<br />
</math><br />
</center><br />
<br />
===Recurrent Neural Network===<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
===Neural Language Model===<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. <br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(\boldsymbol V \boldsymbol h_{m})\\<br />
\boldsymbol h_{m} &= \sigma(W(t)\boldsymbol x_{m} + U(t)\boldsymbol h_{m-1})\\<br />
\end{align}<br />
</math><br />
</center><br />
<br />
where <math> W(t) </math> and <math> U(t) </math> are defined as: <br />
<br />
<center> <math> W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math> </center><br />
<br />
====LSTM Architecture====<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate <math> i_{m} </math>, forget gate <math> f_{m} </math>, output gate <math> o_{m} </math>, and memory stage <math> \tilde{c}_{m} </math> respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:<br />
<center><br />
[[File:neurallanguage.png|right]]<br />
</center><br />
<br />
<center><br />
<math><br />
\begin{align}<br />
i_{m} &= \sigma(W_{i}(t) x_{i,m-1} + U_{i}(t) h_{i,m-1})\\<br />
f_{m} &= \sigma(W_{f}(t) x_{f,m-1} + U_{f}(t)h_{f,m-1})\\<br />
o_{m} &= \sigma(W_{o}(t) x_{o,m-1} + U_{o}(t)h_{o,m-1})\\<br />
\tilde{c}_{m} &= \sigma(W_{c}(t) x_{c,m-1} + U_{c}(t)h_{c,m-1})\\<br />
c_{m} &= i_{m} \odot \tilde{c}_{m} + f_{m} \cdot c_{m-1}\\<br />
h_{m} &= o_{m} \odot tanh(c_{m})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math> W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by () and () for semantic concept detection RNN. Mathematically,<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
W(t) &= W_{a} \cdot diag(W_{b}t) \cdot W_{c} \\<br />
&= W_{a} \cdot (W_{b}t \odot W_{c}) \\<br />
\end{align}<br />
</math><br />
</center><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
==Model Inference==<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
<center> <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) = \int_{t}p(t|\mu_{0},\sigma_{0}^{2})p(d|\beta,t)\prod_{m=1}^{M}p(y_{m}|y_{1:m-1},t)dt </math> </center><br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, <math> q(t|d) </math>, which is the probability of latent vector <math> t </math> given bag-of-words from Neural Topic Model, is used to be the variational distribution of the real marginal probability <math>p(t) </math>, compensated by Kullback-Leibler divergence. The log likelihood function of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> can be estimated as follows:<br />
<br />
<center><br />
:<math> <br />
\begin{align}<br />
\mathcal{L} =& \ \mathbb{E}_{q(t|d)} (log p(d|t)) - KL (q(t|d)||p(t|\mu_{0},\sigma_{0}^{2}) \\ &+ \mathbb{E}_{q(t|d)} (\sum_{m=1}|^{M} log p(y_{m}|y_{1:m-1}, t) \\ & \leq log p(y_{1:M}, d|\mu_{0},\sigma_{0}^{2},\beta)<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Hence, the goal of TCNLM is to maximize <math> \mathcal{L} </math> together with the diversity regularization <math> R </math>, i.e<br />
<center><br />
<math><br />
\mathcal{J} = \mathcal{L} + \lambda R<br />
</math><br />
</center><br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**'''Language Models: '''<br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**'''Topic Models:'''<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB, and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
[[File:language-model.png]]<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
[[File:topic-model.png]]<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
Another advantage of our TCNLM is its capacity to generate meaningful sentences conditioned on given topics. Significant extensions have proposed, including:<br />
<br />
#Google Mail Smart Compose: Much like autocomplete in the search bar or on your smartphone’s keyboard, the new AI-powered feature promises to not only intelligently work out what you’re currently trying to write but to predict whole emails.<br />
#Grammarly: Grammarly automatically detects grammar, spelling, punctuation, word choice and style mistakes in your writing.<br />
#Sentence generator helps people with language barrier express more fluently: Stephen Hawking's main interface to the computer, called ACAT, includes a word prediction algorithm provided by SwiftKey, trained on his books and lectures, so he usually only have to type the first couple of characters before he can select the whole word.<br />
<br />
=References=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=38364
stat441F18/TCNLM
2018-11-08T15:56:08Z
<p>Q26deng: /* LSTM Architecture */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [https://en.wikipedia.org/wiki/Autoencoder variational autoencoder] framework, coupled with the probability of topic usage, are further trained in a MoE model.<br />
[[File:Screen_Shot_2018-11-08_at_10.35.41_AM.png|thumb|alt=model architecture.|[[#Model Architecture|Overall architecture]]]]<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
<br />
==Topic Model==<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
===LDA===<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as the following:<br />
<br />
<center><br />
<math><br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) = \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
</math><br />
</center><br />
<br />
===Neural Topic Model===<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
<center><br />
[[File:Screen Shot 2018-11-08 at 10.28.02 AM.png]]<br />
</center><br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to [[#LDA|LDA]], the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\begin{align}<br />
&\boldsymbol{\theta} \sim N(\mu_0, \sigma_0^2) \\<br />
&\boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}}) \\<br />
&z_n \sim Discrete(\boldsymbol t) \\<br />
&w_n \sim Discrete(\boldsymbol \beta_{z_n})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Where <math>\boldsymbol{\hat W}</math> and <math> \boldsymbol{\hat b}</math> are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) &= \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n p(w_n | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) p(\boldsymbol d | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t<br />
\end{align}<br />
</math><br />
</center><br />
<br />
The first equality of the equation above holds since we can marginalize out the sampled topic words <math>z_n</math>:<br />
<center><br />
<math><br />
p(w_n|\boldsymbol \beta, \boldsymbol t) = \sum_{z_n} p(w_n | \boldsymbol \beta) p(z_n | \boldsymbol t)<br />
</math><br />
</center><br />
<br />
====Re-Parameritization Trick====<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section [[#Model Inference|model inference]].<br />
<br />
====Diversity Regularizer====<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. <br />
<br />
First, we measure the '''distance''' between pair of topics with:<br />
<br />
<center><br />
<math><br />
a(\boldsymbol \beta_i, \boldsymbol \beta_j) = \arccos(\frac{|\boldsymbol \beta_i \cdot \boldsymbol \beta_j|}{||\boldsymbol \beta_i||_2||\boldsymbol \beta_j||_2})<br />
</math><br />
</center><br />
<br />
Then, '''mean''' angle of all pairs of T topics is <br />
<br />
<center><br />
<math><br />
\phi = \frac{1}{T^2} \sum_i \sum_j a(\boldsymbol \beta_i, \boldsymbol \beta_j)<br />
</math><br />
</center><br />
<br />
and '''variance''' is <br />
<br />
<center><br />
<math><br />
\nu - \frac{1}{T^2} \sum_i \sum_j (a(\boldsymbol \beta_i, \boldsymbol \beta_j) - \phi)^2<br />
</math><br />
</center><br />
<br />
Finally, we identify the topic diversity regularization as <br />
<br />
<center><br />
<math> <br />
R = \phi - \nu<br />
</math><br />
</center> <br />
<br />
which will be used in the [[#Model Inference|model inference]].<br />
<br />
==Language Model==<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> h_{m} </math>. <br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}|y_{1:m-1})&=p(y_{m}|h_{m}) \\<br />
h_{m}&= f(h_{m-1}x_{m}) \\<br />
<br />
\end{align}<br />
</math><br />
</center><br />
<br />
===Recurrent Neural Network===<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
===Neural Language Model===<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. <br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(Vh_{m}),\\<br />
h_{m} &= \sigma(W(t)x_{m} + U(t)h_{m-1}), \\<br />
\end{align}<br />
</math><br />
</center><br />
<br />
where <math> W(t) </math> and <math> U(t) </math> are defined as: <br />
<br />
<center> <math> W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math> </center><br />
<br />
====LSTM Architecture====<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate <math> i_{m} </math>, forget gate <math> f_{m} </math>, output gate <math> o_{m} </math>, and memory stage <math> \tilde{c}_{m} </math> respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:<br />
<center><br />
[[File:neurallanguage.png|right]]<br />
</center><br />
<br />
<center><br />
<math><br />
\begin{align}<br />
i_{m} &= \sigma(W_{i}(t) x_{i,m-1} + U_{i}(t) h_{i,m-1})\\<br />
f_{m} &= \sigma(W_{f}(t) x_{f,m-1} + U_{f}(t)h_{f,m-1})\\<br />
o_{m} &= \sigma(W_{o}(t) x_{o,m-1} + U_{o}(t)h_{o,m-1})\\<br />
\tilde{c}_{m} &= \sigma(W_{c}(t) x_{c,m-1} + U_{c}(t)h_{c,m-1})\\<br />
c_{m} &= i_{m} \odot \tilde{c}_{m} + f_{m} \cdot c_{m-1}\\<br />
h_{m} &= o_{m} \odot tanh(c_{m})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math> W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by () and () for semantic concept detection RNN. Mathematically,<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
W(t) &= W_{a} \cdot diag(W_{b}t) \cdot W_{c} \\<br />
&= W_{a} \cdot (W_{b}t \odot W_{c}) \\<br />
\end{align}<br />
</math><br />
</center><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
==Model Inference==<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
<center> <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) = \int_{t}p(t|\mu_{0},\sigma_{0}^{2})p(d|\beta,t)\prod_{m=1}^{M}p(y_{m}|y_{1:m-1},t)dt </math> </center><br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, <math> q(t|d) </math>, which is the probability of latent vector <math> t </math> given bag-of-words from Neural Topic Model, is used to be the variational distribution of the real marginal probability <math>p(t) </math>, compensated by Kullback-Leibler divergence. The log likelihood function of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> can be estimated as follows:<br />
<br />
<center><br />
:<math> <br />
\begin{align}<br />
\mathcal{L} =& \ \mathbb{E}_{q(t|d)} (log p(d|t)) - KL (q(t|d)||p(t|\mu_{0},\sigma_{0}^{2}) \\ &+ \mathbb{E}_{q(t|d)} (\sum_{m=1}|^{M} log p(y_{m}|y_{1:m-1}, t) \\ & \leq log p(y_{1:M}, d|\mu_{0},\sigma_{0}^{2},\beta)<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Hence, the goal of TCNLM is to maximize <math> \mathcal{L} </math> together with the diversity regularization <math> R </math>, i.e<br />
<center><br />
<math><br />
\mathcal{J} = \mathcal{L} + \lambda R<br />
</math><br />
</center><br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**'''Language Models: '''<br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**'''Topic Models:'''<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB, and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
[[File:language_model.png]]<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
[[File:topic_model.png]]<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
Another advantage of our TCNLM is its capacity to generate meaningful sentences conditioned on given topics. Significant extensions have proposed, including:<br />
<br />
#Google Mail Smart Compose: Much like autocomplete in the search bar or on your smartphone’s keyboard, the new AI-powered feature promises to not only intelligently work out what you’re currently trying to write but to predict whole emails.<br />
#Grammarly: Grammarly automatically detects grammar, spelling, punctuation, word choice and style mistakes in your writing.<br />
#Sentence generator helps people with language barrier express more fluently: Stephen Hawking's main interface to the computer, called ACAT, includes a word prediction algorithm provided by SwiftKey, trained on his books and lectures, so he usually only have to type the first couple of characters before he can select the whole word.<br />
<br />
=References=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=38361
stat441F18/TCNLM
2018-11-08T15:51:39Z
<p>Q26deng: /* LSTM Architecture */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [https://en.wikipedia.org/wiki/Autoencoder variational autoencoder] framework, coupled with the probability of topic usage, are further trained in a MoE model.<br />
[[File:Screen Shot 2018-11-08 at 10.35.41 AM|thumb|alt=model architecture.|Overall Architecture]]<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
<br />
==Topic Model==<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
===LDA===<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as the following:<br />
<br />
<center><br />
<math><br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) = \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
</math><br />
</center><br />
<br />
===Neural Topic Model===<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
<center><br />
[[File:Screen Shot 2018-11-08 at 10.28.02 AM.png]]<br />
</center><br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to [[#LDA|LDA]], the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\begin{align}<br />
&\boldsymbol{\theta} \sim N(\mu_0, \sigma_0^2) \\<br />
&\boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}}) \\<br />
&z_n \sim Discrete(\boldsymbol t) \\<br />
&w_n \sim Discrete(\boldsymbol \beta_{z_n})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Where <math>\boldsymbol{\hat W}</math> and <math> \boldsymbol{\hat b}</math> are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) &= \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n p(w_n | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) p(\boldsymbol d | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t<br />
\end{align}<br />
</math><br />
</center><br />
<br />
The first equality of the equation above holds since we can marginalize out the sampled topic words <math>z_n</math>:<br />
<center><br />
<math><br />
p(w_n|\boldsymbol \beta, \boldsymbol t) = \sum_{z_n} p(w_n | \boldsymbol \beta) p(z_n | \boldsymbol t)<br />
</math><br />
</center><br />
<br />
====Re-Parameritization Trick====<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section [[#Model Inference|model inference]].<br />
<br />
====Diversity Regularizer====<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. <br />
<br />
First, we measure the '''distance''' between pair of topics with:<br />
<br />
<center><br />
<math><br />
a(\boldsymbol \beta_i, \boldsymbol \beta_j) = \arccos(\frac{|\boldsymbol \beta_i \cdot \boldsymbol \beta_j|}{||\boldsymbol \beta_i||_2||\boldsymbol \beta_j||_2})<br />
</math><br />
</center><br />
<br />
Then, '''mean''' angle of all pairs of T topics is <br />
<br />
<center><br />
<math><br />
\phi = \frac{1}{T^2} \sum_i \sum_j a(\boldsymbol \beta_i, \boldsymbol \beta_j)<br />
</math><br />
</center><br />
<br />
and '''variance''' is <br />
<br />
<center><br />
<math><br />
\nu - \frac{1}{T^2} \sum_i \sum_j (a(\boldsymbol \beta_i, \boldsymbol \beta_j) - \phi)^2<br />
</math><br />
</center><br />
<br />
Finally, we identify the topic diversity regularization as <br />
<br />
<center><br />
<math> <br />
R = \phi - \nu<br />
</math><br />
</center> <br />
<br />
which will be used in the [[#Model Inference|model inference]].<br />
<br />
==Language Model==<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> h_{m} </math>. <br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}|y_{1:m-1})&=p(y_{m}|h_{m}) \\<br />
h_{m}&= f(h_{m-1}x_{m}) \\<br />
<br />
\end{align}<br />
</math><br />
</center><br />
<br />
===Recurrent Neural Network===<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
===Neural Language Model===<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. <br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(Vh_{m}),\\<br />
h_{m} &= \sigma(W(t)x_{m} + U(t)h_{m-1}), \\<br />
\end{align}<br />
</math><br />
</center><br />
<br />
where <math> W(t) </math> and <math> U(t) </math> are defined as: <br />
<br />
<center> <math> W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math> </center><br />
<br />
====LSTM Architecture====<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate <math> i_{m} </math>, forget gate <math> f_{m} </math>, output gate <math> o_{m} </math>, and memory stage <math> \tilde{c}_{m} </math> respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:<br />
<center><br />
[[File:neurallanguage.png]]<br />
</center><br />
<br />
<center><br />
<math><br />
\begin{align}<br />
i_{m} &= \sigma(W_{i}(t) x_{i,m-1} + U_{i}(t) h_{i,m-1})\\<br />
f_{m} &= \sigma(W_{f}(t) x_{f,m-1} + U_{f}(t)h_{f,m-1})\\<br />
o_{m} &= \sigma(W_{o}(t) x_{o,m-1} + U_{o}(t)h_{o,m-1})\\<br />
\tilde{c}_{m} &= \sigma(W_{c}(t) x_{c,m-1} + U_{c}(t)h_{c,m-1})\\<br />
c_{m} &= i_{m} \odot \tilde{c}_{m} + f_{m} \cdot c_{m-1}\\<br />
h_{m} &= o_{m} \odot tanh(c_{m})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math> W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by () and () for semantic concept detection RNN. Mathematically,<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
W(t) &= W_{a} \cdot diag(W_{b}t) \cdot W_{c} \\<br />
&= W_{a} \cdot (W_{b}t \odot W_{c}) \\<br />
\end{align}<br />
</math><br />
</center><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
==Model Inference==<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
<center> <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) = \int_{t}p(t|\mu_{0},\sigma_{0}^{2})p(d|\beta,t)\prod_{m=1}^{M}p(y_{m}|y_{1:m-1},t)dt </math> </center><br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, <math> q(t|d) </math>, which is the probability of latent vector <math> t </math> given bag-of-words from Neural Topic Model, is used to be the variational distribution of the real marginal probability <math>p(t) </math>, compensated by Kullback-Leibler divergence. The log likelihood function of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> can be estimated as follows:<br />
<br />
<center><br />
:<math> <br />
\begin{align}<br />
\mathcal{L} =& \ \mathbb{E}_{q(t|d)} (log p(d|t)) - KL (q(t|d)||p(t|\mu_{0},\sigma_{0}^{2}) \\ &+ \mathbb{E}_{q(t|d)} (\sum_{m=1}|^{M} log p(y_{m}|y_{1:m-1}, t) \\ & \leq log p(y_{1:M}, d|\mu_{0},\sigma_{0}^{2},\beta)<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Hence, the goal of TCNLM is to maximize <math> \mathcal{L} </math> together with the diversity regularization <math> R </math>, i.e<br />
<center><br />
<math><br />
\mathcal{J} = \mathcal{L} + \lambda R<br />
</math><br />
</center><br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**'''Language Models: '''<br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**'''Topic Models:'''<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB, and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
[[File:language_model.png]]<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
[[File:topic_model.png]]<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
Another advantage of our TCNLM is its capacity to generate meaningful sentences conditioned on given topics. Significant extensions have proposed, including:<br />
<br />
#Google Mail Smart Compose: Much like autocomplete in the search bar or on your smartphone’s keyboard, the new AI-powered feature promises to not only intelligently work out what you’re currently trying to write but to predict whole emails.<br />
#Grammarly: Grammarly automatically detects grammar, spelling, punctuation, word choice and style mistakes in your writing.<br />
#Sentence generator helps people with language barrier express more fluently: Stephen Hawking's main interface to the computer, called ACAT, includes a word prediction algorithm provided by SwiftKey, trained on his books and lectures, so he usually only have to type the first couple of characters before he can select the whole word.<br />
<br />
=References=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:neurallanguage.png&diff=38359
File:neurallanguage.png
2018-11-08T15:50:26Z
<p>Q26deng: </p>
<hr />
<div></div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=38357
stat441F18/TCNLM
2018-11-08T15:49:53Z
<p>Q26deng: /* LSTM Architecture */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [https://en.wikipedia.org/wiki/Autoencoder variational autoencoder] framework, coupled with the probability of topic usage, are further trained in a MoE model.<br />
[[File:Screen Shot 2018-11-08 at 10.35.41 AM|thumb|alt=model architecture.|Overall Architecture]]<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
<br />
==Topic Model==<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
===LDA===<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as the following:<br />
<br />
<center><br />
<math><br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) = \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
</math><br />
</center><br />
<br />
===Neural Topic Model===<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
<center><br />
[[File:Screen Shot 2018-11-08 at 10.28.02 AM.png]]<br />
</center><br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to [[#LDA|LDA]], the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\begin{align}<br />
&\boldsymbol{\theta} \sim N(\mu_0, \sigma_0^2) \\<br />
&\boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}}) \\<br />
&z_n \sim Discrete(\boldsymbol t) \\<br />
&w_n \sim Discrete(\boldsymbol \beta_{z_n})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Where <math>\boldsymbol{\hat W}</math> and <math> \boldsymbol{\hat b}</math> are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) &= \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n p(w_n | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) p(\boldsymbol d | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t<br />
\end{align}<br />
</math><br />
</center><br />
<br />
The first equality of the equation above holds since we can marginalize out the sampled topic words <math>z_n</math>:<br />
<center><br />
<math><br />
p(w_n|\boldsymbol \beta, \boldsymbol t) = \sum_{z_n} p(w_n | \boldsymbol \beta) p(z_n | \boldsymbol t)<br />
</math><br />
</center><br />
<br />
====Re-Parameritization Trick====<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section [[#Model Inference|model inference]].<br />
<br />
====Diversity Regularizer====<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. <br />
<br />
First, we measure the '''distance''' between pair of topics with:<br />
<br />
<center><br />
<math><br />
a(\boldsymbol \beta_i, \boldsymbol \beta_j) = \arccos(\frac{|\boldsymbol \beta_i \cdot \boldsymbol \beta_j|}{||\boldsymbol \beta_i||_2||\boldsymbol \beta_j||_2})<br />
</math><br />
</center><br />
<br />
Then, '''mean''' angle of all pairs of T topics is <br />
<br />
<center><br />
<math><br />
\phi = \frac{1}{T^2} \sum_i \sum_j a(\boldsymbol \beta_i, \boldsymbol \beta_j)<br />
</math><br />
</center><br />
<br />
and '''variance''' is <br />
<br />
<center><br />
<math><br />
\nu - \frac{1}{T^2} \sum_i \sum_j (a(\boldsymbol \beta_i, \boldsymbol \beta_j) - \phi)^2<br />
</math><br />
</center><br />
<br />
Finally, we identify the topic diversity regularization as <br />
<br />
<center><br />
<math> <br />
R = \phi - \nu<br />
</math><br />
</center> <br />
<br />
which will be used in the [[#Model Inference|model inference]].<br />
<br />
==Language Model==<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> h_{m} </math>. <br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}|y_{1:m-1})&=p(y_{m}|h_{m}) \\<br />
h_{m}&= f(h_{m-1}x_{m}) \\<br />
<br />
\end{align}<br />
</math><br />
</center><br />
<br />
===Recurrent Neural Network===<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
===Neural Language Model===<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. <br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(Vh_{m}),\\<br />
h_{m} &= \sigma(W(t)x_{m} + U(t)h_{m-1}), \\<br />
\end{align}<br />
</math><br />
</center><br />
<br />
where <math> W(t) </math> and <math> U(t) </math> are defined as: <br />
<br />
<center> <math> W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math> </center><br />
<br />
====LSTM Architecture====<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate <math> i_{m} </math>, forget gate <math> f_{m} </math>, output gate <math> o_{m} </math>, and memory stage <math> \tilde{c}_{m} </math> respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:<br />
<center><br />
[[File:neural_language.png]]<br />
</center><br />
<br />
<center><br />
<math><br />
\begin{align}<br />
i_{m} &= \sigma(W_{i}(t) x_{i,m-1} + U_{i}(t) h_{i,m-1})\\<br />
f_{m} &= \sigma(W_{f}(t) x_{f,m-1} + U_{f}(t)h_{f,m-1})\\<br />
o_{m} &= \sigma(W_{o}(t) x_{o,m-1} + U_{o}(t)h_{o,m-1})\\<br />
\tilde{c}_{m} &= \sigma(W_{c}(t) x_{c,m-1} + U_{c}(t)h_{c,m-1})\\<br />
c_{m} &= i_{m} \odot \tilde{c}_{m} + f_{m} \cdot c_{m-1}\\<br />
h_{m} &= o_{m} \odot tanh(c_{m})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math> W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by () and () for semantic concept detection RNN. Mathematically,<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
W(t) &= W_{a} \cdot diag(W_{b}t) \cdot W_{c} \\<br />
&= W_{a} \cdot (W_{b}t \odot W_{c}) \\<br />
\end{align}<br />
</math><br />
</center><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
==Model Inference==<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
<center> <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) = \int_{t}p(t|\mu_{0},\sigma_{0}^{2})p(d|\beta,t)\prod_{m=1}^{M}p(y_{m}|y_{1:m-1},t)dt </math> </center><br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, <math> q(t|d) </math>, which is the probability of latent vector <math> t </math> given bag-of-words from Neural Topic Model, is used to be the variational distribution of the real marginal probability <math>p(t) </math>, compensated by Kullback-Leibler divergence. The log likelihood function of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> can be estimated as follows:<br />
<br />
<center><br />
:<math> <br />
\begin{align}<br />
\mathcal{L} &= \mathbb{E}_{q(t|d)} (log p(d|t)) - KL (q(t|d)||p(t|\mu_{0},\sigma_{0}^{2}) \\ &+ \mathbb{E}_{q(t|d)} (\sum_{m=1}|^{M} log p(y_{m}|y_{1:m-1}, t) \\ & \leq log p(y_{1:M}, d|\mu_{0},\sigma_{0}^{2},\beta)<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Hence, the goal of TCNLM is to maximize <math> \mathcal{L} </math> together with the diversity regularization <math> R </math>, i.e<br />
<center><br />
<math><br />
\mathcal{J} = \mathcal{L} + \lambda R<br />
</math><br />
</center><br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**'''Language Models: '''<br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**'''Topic Models:'''<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB, and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
[[File:language_model.png]]<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
[[File:topic_model.png]]<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
Another advantage of our TCNLM is its capacity to generate meaningful sentences conditioned on given topics. Significant extensions have proposed, including:<br />
<br />
#Google Mail Smart Compose: Much like autocomplete in the search bar or on your smartphone’s keyboard, the new AI-powered feature promises to not only intelligently work out what you’re currently trying to write but to predict whole emails.<br />
#Grammarly: Grammarly automatically detects grammar, spelling, punctuation, word choice and style mistakes in your writing.<br />
#Sentence generator helps people with language barrier express more fluently: Stephen Hawking's main interface to the computer, called ACAT, includes a word prediction algorithm provided by SwiftKey, trained on his books and lectures, so he usually only have to type the first couple of characters before he can select the whole word.<br />
<br />
=References=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:lstmandgru.png&diff=38335
File:lstmandgru.png
2018-11-08T15:26:19Z
<p>Q26deng: </p>
<hr />
<div></div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=38334
stat441F18/TCNLM
2018-11-08T15:21:56Z
<p>Q26deng: /* RNN (LSTM) */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [[Autoencoder|variational autoencoder]] framework, coupled with the probability of topic usage, are further trained in a MoE model. (Insert figure here)<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
<br />
=Topic Model=<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
==LDA==<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as (insert formula).<br />
<br />
==Neural Topic Model==<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to [[#LDA|LDA]], the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\begin{align}<br />
&\boldsymbol{\theta} \sim N(\mu_0, \sigma_0^2) \\<br />
&\boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}}) \\<br />
&z_n \sim Discrete(\boldsymbol t) \\<br />
&w_n \sim Discrete(\boldsymbol \beta_{z_n})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Where <math>\boldsymbol{\hat W}</math> and <math> \boldsymbol{\hat b}</math> are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) &= \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n p(w_n | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) p(\boldsymbol d | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t<br />
\end{align}<br />
</math><br />
</center><br />
<br />
The first equality of the equation above holds since we can marginalize out the sampled topic words <math>z_n</math>:<br />
<center><br />
<math><br />
p(w_n|\boldsymbol \beta, \boldsymbol t) = \sum_{z_n} p(w_n | \boldsymbol \beta) p(z_n | \boldsymbol t)<br />
</math><br />
</center><br />
<br />
===Re-Parameritization Trick===<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section [[#Model Inference|model inference]].<br />
<br />
===Diversity Regularizer===<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. <br />
<br />
First, we measure the '''distance''' between pair of topics with:<br />
<br />
<center><br />
<math><br />
a(\boldsymbol \beta_i, \boldsymbol \beta_j) = \arccos(\frac{|\boldsymbol \beta_i \cdot \boldsymbol \beta_j|}{||\boldsymbol \beta_i||_2||\boldsymbol \beta_j||_2})<br />
</math><br />
</center><br />
<br />
Then, '''mean''' angle of all pairs of T topics is <br />
<br />
<center><br />
<math><br />
\phi = \frac{1}{T^2} \sum_i \sum_j a(\boldsymbol \beta_i, \boldsymbol \beta_j)<br />
</math><br />
</center><br />
<br />
and '''variance''' is <br />
<br />
<center><br />
<math><br />
\nu - \frac{1}{T^2} \sum_i \sum_j (a(\boldsymbol \beta_i, \boldsymbol \beta_j) - \phi)^2<br />
</math><br />
</center><br />
<br />
Finally, we identify the topic diversity regularization as <br />
<br />
<center><br />
<math> <br />
R = \phi - \nu <br />
</math><br />
</center> <br />
<br />
which will be used in the [[#Model Inference|model inference]].<br />
<br />
=Language Model=<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> h_{m} </math>. <br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}|y_{1:m-1})&=p(y_{m}|h_{m}) \\<br />
h_{m}&= f(h_{m-1}x_{m}) \\<br />
<br />
\end{align}<br />
</math><br />
</center><br />
<br />
==Recurrent Neural Network==<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
==Neural Language Model==<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. <br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(Vh_{m}),\\<br />
h_{m} &= \sigma(W(t)x_{m} + U(t)h_{m-1}), \\<br />
\end{align}<br />
</math><br />
</center><br />
<br />
where <math> W(t) </math> and <math> U(t) </math> are defined as: <br />
<br />
<center> <math> W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math> </center><br />
<br />
===LSTM Architecture===<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate <math> i_{m} </math>, forget gate <math> f_{m} </math>, output gate <math> o_{m} </math>, and memory stage <math> \tilde{c}_{m} </math> respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:<br />
[[File:neural_language.png]]<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
i_{m} &= \sigma(W_{i}(t) x_{i,m-1} + U_{i}(t) h_{i,m-1})\\<br />
f_{m} &= \sigma(W_{f}(t) x_{f,m-1} + U_{f}(t)h_{f,m-1})\\<br />
o_{m} &= \sigma(W_{o}(t) x_{o,m-1} + U_{o}(t)h_{o,m-1})\\<br />
\tilde{c}_{m} &= \sigma(W_{c}(t) x_{c,m-1} + U_{c}(t)h_{c,m-1})\\<br />
c_{m} &= i_{m} \odot \tilde{c}_{m} + f_{m} \cdot c_{m-1}\\<br />
h_{m} &= o_{m} \odot tanh(c_{m})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math> W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by () and () for semantic concept detection RNN. Mathematically,<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
W(t) &= W_{a} \cdot diag(W_{b}t) \cdot W_{c} \\<br />
&= W_{a} \cdot (W_{b}t \odot W_{c}) \\<br />
\end{align}<br />
</math><br />
</center><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
=Model Inference=<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
<center> <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) = \int_{t}p(t|\mu_{0},\sigma_{0}^{2})p(d|\beta,t)\prod_{m=1}^{M}p(y_{m}|y_{1:m-1},t)dt </math> </center><br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, <math> q(t|d) </math>, which is the probability of latent vector <math> t </math> given bag-of-words from Neural Topic Model, is used to be the variational distribution of the real marginal probability <math>p(t) </math>, compensated by Kullback-Leibler divergence. The log likelihood function of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> can be estimated as follows:<br />
<br />
<center><br />
:<math> <br />
\begin{align}<br />
\mathcal{L} &= \mathbb{E}_{q(t|d)} (log p(d|t)) - KL (q(t|d)||p(t|\mu_{0},\sigma_{0}^{2}) \\ &+ \mathbb{E}_{q(t|d)} (\sum_{m=1}|^{M} log p(y_{m}|y_{1:m-1}, t) \\ & \leq log p(y_{1:M}, d|\mu_{0},\sigma_{0}^{2},\beta)<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Hence, the goal of TCNLM is to maximize <math> \mathcal{L} </math> together with the diversity regularization <math> R </math>, i.e<br />
<center><br />
<math><br />
\mathcal{J} = \mathcal{L} + \lambda R<br />
</math><br />
</center><br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**'''Language Models: '''<br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**'''Topic Models:'''<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB, and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
[[File:language_model.png]]<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
[[File:topic_model.png]]<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
Another advantage of our TCNLM is its capacity to generate meaningful sentences conditioned on given topics. Significant extensions have proposed, including:<br />
<br />
#Google Mail Smart Compose: Much like autocomplete in the search bar or on your smartphone’s keyboard, the new AI-powered feature promises to not only intelligently work out what you’re currently trying to write but to predict whole emails.<br />
#Grammarly: Grammarly automatically detects grammar, spelling, punctuation, word choice and style mistakes in your writing.<br />
#Sentence generator helps people with language barrier express more fluently: Stephen Hawking's main interface to the computer, called ACAT, includes a word prediction algorithm provided by SwiftKey, trained on my books and lectures, so I usually only have to type the first couple of characters before I can select the whole word.<br />
<br />
=References=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=38333
stat441F18/TCNLM
2018-11-08T15:20:57Z
<p>Q26deng: /* LSTM Architecture */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [[Autoencoder|variational autoencoder]] framework, coupled with the probability of topic usage, are further trained in a MoE model. (Insert figure here)<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
<br />
=Topic Model=<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
==LDA==<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as (insert formula).<br />
<br />
==Neural Topic Model==<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to [[#LDA|LDA]], the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\begin{align}<br />
&\boldsymbol{\theta} \sim N(\mu_0, \sigma_0^2) \\<br />
&\boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}}) \\<br />
&z_n \sim Discrete(\boldsymbol t) \\<br />
&w_n \sim Discrete(\boldsymbol \beta_{z_n})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Where <math>\boldsymbol{\hat W}</math> and <math> \boldsymbol{\hat b}</math> are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(\boldsymbol d | \mu_0, \sigma_0, \boldsymbol \beta) &= \int_{t} p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n \sum_{z_n} p(w_n | \boldsymbol \beta_{z_n}) p(z_n | \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) \prod_n p(w_n | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t \\<br />
&= \int_t p(\boldsymbol t | \mu_0, \sigma^2_0) p(\boldsymbol d | \boldsymbol \beta, \boldsymbol t) d \boldsymbol t<br />
\end{align}<br />
</math><br />
</center><br />
<br />
The first equality of the equation above holds since we can marginalize out the sampled topic words <math>z_n</math>:<br />
<center><br />
<math><br />
p(w_n|\boldsymbol \beta, \boldsymbol t) = \sum_{z_n} p(w_n | \boldsymbol \beta) p(z_n | \boldsymbol t)<br />
</math><br />
</center><br />
<br />
===Re-Parameritization Trick===<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section [[#Model Inference|model inference]].<br />
<br />
===Diversity Regularizer===<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. <br />
<br />
First, we measure the '''distance''' between pair of topics with:<br />
<br />
<center><br />
<math><br />
a(\boldsymbol \beta_i, \boldsymbol \beta_j) = \arccos(\frac{|\boldsymbol \beta_i \cdot \boldsymbol \beta_j|}{||\boldsymbol \beta_i||_2||\boldsymbol \beta_j||_2})<br />
</math><br />
</center><br />
<br />
Then, '''mean''' angle of all pairs of T topics is <br />
<br />
<center><br />
<math><br />
\phi = \frac{1}{T^2} \sum_i \sum_j a(\boldsymbol \beta_i, \boldsymbol \beta_j)<br />
</math><br />
</center><br />
<br />
and '''variance''' is <br />
<br />
<center><br />
<math><br />
\nu - \frac{1}{T^2} \sum_i \sum_j (a(\boldsymbol \beta_i, \boldsymbol \beta_j) - \phi)^2<br />
</math><br />
</center><br />
<br />
Finally, we identify the topic diversity regularization as <br />
<br />
<center><br />
<math> <br />
R = \phi - \nu <br />
</math><br />
</center> <br />
<br />
which will be used in the [[#Model Inference|model inference]].<br />
<br />
=Language Model=<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> h_{m} </math>. <br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}|y_{1:m-1})&=p(y_{m}|h_{m}) \\<br />
h_{m}&= f(h_{m-1}x_{m}) \\<br />
<br />
\end{align}<br />
</math><br />
</center><br />
<br />
==RNN (LSTM)==<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
==Neural Language Model==<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. <br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(Vh_{m}),\\<br />
h_{m} &= \sigma(W(t)x_{m} + U(t)h_{m-1}), \\<br />
\end{align}<br />
</math><br />
</center><br />
<br />
where <math> W(t) </math> and <math> U(t) </math> are defined as: <br />
<br />
<center> <math> W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math> </center><br />
<br />
===LSTM Architecture===<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate <math> i_{m} </math>, forget gate <math> f_{m} </math>, output gate <math> o_{m} </math>, and memory stage <math> \tilde{c}_{m} </math> respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:<br />
[[File:neural_language.png]]<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
i_{m} &= \sigma(W_{i}(t) x_{i,m-1} + U_{i}(t) h_{i,m-1})\\<br />
f_{m} &= \sigma(W_{f}(t) x_{f,m-1} + U_{f}(t)h_{f,m-1})\\<br />
o_{m} &= \sigma(W_{o}(t) x_{o,m-1} + U_{o}(t)h_{o,m-1})\\<br />
\tilde{c}_{m} &= \sigma(W_{c}(t) x_{c,m-1} + U_{c}(t)h_{c,m-1})\\<br />
c_{m} &= i_{m} \odot \tilde{c}_{m} + f_{m} \cdot c_{m-1}\\<br />
h_{m} &= o_{m} \odot tanh(c_{m})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math> W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by () and () for semantic concept detection RNN. Mathematically,<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
W(t) &= W_{a} \cdot diag(W_{b}t) \cdot W_{c} \\<br />
&= W_{a} \cdot (W_{b}t \odot W_{c}) \\<br />
\end{align}<br />
</math><br />
</center><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
=Model Inference=<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
<center> <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) = \int_{t}p(t|\mu_{0},\sigma_{0}^{2})p(d|\beta,t)\prod_{m=1}^{M}p(y_{m}|y_{1:m-1},t)dt </math> </center><br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, <math> q(t|d) </math>, which is the probability of latent vector <math> t </math> given bag-of-words from Neural Topic Model, is used to be the variational distribution of the real marginal probability <math>p(t) </math>, compensated by Kullback-Leibler divergence. The log likelihood function of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> can be estimated as follows:<br />
<br />
<center><br />
:<math> <br />
\begin{align}<br />
\mathcal{L} &= \mathbb{E}_{q(t|d)} (log p(d|t)) - KL (q(t|d)||p(t|\mu_{0},\sigma_{0}^{2}) \\ &+ \mathbb{E}_{q(t|d)} (\sum_{m=1}|^{M} log p(y_{m}|y_{1:m-1}, t) \\ & \leq log p(y_{1:M}, d|\mu_{0},\sigma_{0}^{2},\beta)<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Hence, the goal of TCNLM is to maximize <math> \mathcal{L} </math> together with the diversity regularization <math> R </math>, i.e<br />
<center><br />
<math><br />
\mathcal{J} = \mathcal{L} + \lambda R<br />
</math><br />
</center><br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**'''Language Models: '''<br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**'''Topic Models:'''<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB, and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
[[File:language_model.png]]<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
[[File:topic_model.png]]<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
Another advantage of our TCNLM is its capacity to generate meaningful sentences conditioned on given topics. Significant extensions have proposed, including:<br />
<br />
#Google Mail Smart Compose: Much like autocomplete in the search bar or on your smartphone’s keyboard, the new AI-powered feature promises to not only intelligently work out what you’re currently trying to write but to predict whole emails.<br />
#Grammarly: Grammarly automatically detects grammar, spelling, punctuation, word choice and style mistakes in your writing.<br />
#Sentence generator helps people with language barrier express more fluently: Stephen Hawking's main interface to the computer, called ACAT, includes a word prediction algorithm provided by SwiftKey, trained on my books and lectures, so I usually only have to type the first couple of characters before I can select the whole word.<br />
<br />
=References=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:neural_language.png&diff=38332
File:neural language.png
2018-11-08T15:19:33Z
<p>Q26deng: </p>
<hr />
<div></div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=38022
stat441F18/TCNLM
2018-11-06T06:21:58Z
<p>Q26deng: /* Model Inference */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [[Autoencoder|variational autoencoder]] framework, coupled with the probability of topic usage, are further trained in a MoE model. (Insert figure here)<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
<br />
=Topic Model=<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
==LDA==<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as (insert formula).<br />
<br />
==Neural Topic Model==<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to LDA, the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\boldsymbol{\theta} \sim N(\mu_0, \sigma_0^2), \boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}})<br />
</math><br />
</center><br />
Where are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
===Re-Parameritization Trick===<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section model inference.<br />
<br />
===Diversity Regularizer===<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. First, we measure the distance between pair of topics with . Then, mean angle of all pairs of T topics is , and variance is . Finally, we identify the topic diversity regularization as which will be used in the model inference.<br />
<br />
=Language Model=<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> h_{m} </math>. <br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}|y_{1:m-1})&=p(y_{m}|h_{m}) \\<br />
h_{m}&= f(h_{m-1}x_{m}) \\<br />
<br />
\end{align}<br />
</math><br />
</center><br />
<br />
==RNN (LSTM)==<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
==Neural Language Model==<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. <br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(Vh_{m}),\\<br />
h_{m} &= \sigma(W(t)x_{m} + U(t)h_{m-1}), \\<br />
\end{align}<br />
</math><br />
</center><br />
<br />
where <math> W(t) </math> and <math> U(t) </math> are defined as: <br />
<br />
<center> <math> W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math> </center><br />
<br />
===LSTM Architecture===<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate <math> i_{m} </math>, forget gate <math> f_{m} </math>, output gate <math> o_{m} </math>, and memory stage <math> \tilde{c}_{m} </math> respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
i_{m} &= \sigma(W_{i}(t) x_{i,m-1} + U_{i}(t) h_{i,m-1})\\<br />
f_{m} &= \sigma(W_{f}(t) x_{f,m-1} + U_{f}(t)h_{f,m-1})\\<br />
o_{m} &= \sigma(W_{o}(t) x_{o,m-1} + U_{o}(t)h_{o,m-1})\\<br />
\tilde{c}_{m} &= \sigma(W_{c}(t) x_{c,m-1} + U_{c}(t)h_{c,m-1})\\<br />
c_{m} &= i_{m} \odot \tilde{c}_{m} + f_{m} \cdot c_{m-1}\\<br />
h_{m} &= o_{m} \odot tanh(c_{m})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math> W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by () and () for semantic concept detection RNN. Mathematically,<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
W(t) &= W_{a} \cdot diag(W_{b}t) \cdot W_{c} \\<br />
&= W_{a} \cdot (W_{b}t \odot W_{c}) \\<br />
\end{align}<br />
</math><br />
</center><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
=Model Inference=<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
<center> <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) = \int_{t}p(t|\mu_{0},\sigma_{0}^{2})p(d|\beta,t)\prod_{m=1}^{M}p(y_{m}|y_{1:m-1},t)dt </math> </center><br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, <math> q(t|d) </math>, which is the probability of latent vector <math> t </math> given bag-of-words from Neural Topic Model, is used to be the variational distribution of the real marginal probability <math>p(t) </math>, compensated by Kullback-Leibler divergence. The log likelihood function of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> can be estimated as follows:<br />
<br />
<center><br />
:<math> <br />
\begin{align}<br />
\mathcal{L} &= \mathbb{E}_{q(t|d)} (log p(d|t)) - KL (q(t|d)||p(t|\mu_{0},\sigma_{0}^{2}) \\ &+ \mathbb{E}_{q(t|d)} (\sum_{m=1}|^{M} log p(y_{m}|y_{1:m-1}, t) \\ & \leq log p(y_{1:M}, d|\mu_{0},\sigma_{0}^{2},\beta)<br />
\end{align}<br />
</math><br />
</center><br />
<br />
Hence, the goal of TCNLM is to maximize <math> \mathcal{L} </math> together with the diversity regularization <math> R </math>, i.e<br />
<center><br />
<math><br />
\mathcal{J} = \mathcal{L} + \lambda R<br />
</math><br />
</center><br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**'''Language Models: '''<br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**'''Topic Models:'''<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB, and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
<br />
=References=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=38018
stat441F18/TCNLM
2018-11-06T05:28:28Z
<p>Q26deng: /* Language Model */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [[Autoencoder|variational autoencoder]] framework, coupled with the probability of topic usage, are further trained in a MoE model. (Insert figure here)<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
<br />
=Topic Model=<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
==LDA==<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as (insert formula).<br />
<br />
==Neural Topic Model==<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to LDA, the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\boldsymbol{\theta} \sim N(\mu_0, \sigma_0^2), \boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}})<br />
</math><br />
</center><br />
Where are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
===Re-Parameritization Trick===<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section model inference.<br />
<br />
===Diversity Regularizer===<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. First, we measure the distance between pair of topics with . Then, mean angle of all pairs of T topics is , and variance is . Finally, we identify the topic diversity regularization as which will be used in the model inference.<br />
<br />
=Language Model=<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> h_{m} </math>. <br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}|y_{1:m-1})&=p(y_{m}|h_{m}) \\<br />
h_{m}&= f(h_{m-1}x_{m}) \\<br />
<br />
\end{align}<br />
</math><br />
</center><br />
<br />
==RNN (LSTM)==<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
==Neural Language Model==<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. <br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(Vh_{m}),\\<br />
h_{m} &= \sigma(W(t)x_{m} + U(t)h_{m-1}), \\<br />
\end{align}<br />
</math><br />
</center><br />
<br />
where <math> W(t) </math> and <math> U(t) </math> are defined as: <br />
<br />
<center> <math> W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math> </center><br />
<br />
===LSTM Architecture===<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate <math> i_{m} </math>, forget gate <math> f_{m} </math>, output gate <math> o_{m} </math>, and memory stage <math> \tilde{c}_{m} </math> respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
i_{m} &= \sigma(W_{i}(t) x_{i,m-1} + U_{i}(t) h_{i,m-1})\\<br />
f_{m} &= \sigma(W_{f}(t) x_{f,m-1} + U_{f}(t)h_{f,m-1})\\<br />
o_{m} &= \sigma(W_{o}(t) x_{o,m-1} + U_{o}(t)h_{o,m-1})\\<br />
\tilde{c}_{m} &= \sigma(W_{c}(t) x_{c,m-1} + U_{c}(t)h_{c,m-1})\\<br />
c_{m} &= i_{m} \odot \tilde{c}_{m} + f_{m} \cdot c_{m-1}\\<br />
h_{m} &= o_{m} \odot tanh(c_{m})<br />
\end{align}<br />
</math><br />
</center><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math> W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by () and () for semantic concept detection RNN. Mathematically,<br />
<br />
<center><br />
<math><br />
\begin{align}<br />
W(t) &= W_{a} \cdot diag(W_{b}t) \cdot W_{c} \\<br />
&= W_{a} \cdot (W_{b}t \odot W_{c}) \\<br />
\end{align}<br />
</math><br />
</center><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
=Model Inference=<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
<math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) = \int_{t}p(t|\mu_{0},\sigma_{0}^{2})p(d|\beta,t)\prod_{m=1}^{M}p(y_{m}|y_{1:m-1},t)dt </math><br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, <math> q(t|d) </math>, which is the probability of latent vector <math> t </math> given bag-of-words from Neural Topic Model is used to be the variational distribution of the real marginal probability <math>p(t) </math>, compensated by Kullback-Leibler divergence, which will then be used to construct ELBO. Hence, <br />
<br />
:<math> <br />
\begin{align}<br />
\mathcal{L} &= \mathbb{E}_{q(t|d)} (log p(d|t)) - KL (q(t|d)||p(t|\mu_{0},\sigma_{0}^{2}) \\ &+ \mathbb{E}_{q(t|d)} (\sum_{m=1}|^{M} log p(y_{m}|y_{1:m-1}, t) \\ & \leq log p(y_{1:M}, d|\mu_{0},\sigma_{0}^{2},\beta)<br />
\end{align}<br />
</math><br />
<br />
Hence, the goal of TCNLM is to maximize the approximated log likelihood of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> together with the diversity regularization.<br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**'''Language Models: '''<br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**'''Topic Models:'''<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB, and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
<br />
=References=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=38015
stat441F18/TCNLM
2018-11-06T05:17:32Z
<p>Q26deng: /* Model Inference */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [[Autoencoder|variational autoencoder]] framework, coupled with the probability of topic usage, are further trained in a MoE model. (Insert figure here)<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
<br />
=Topic Model=<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
==LDA==<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as (insert formula).<br />
<br />
==Neural Topic Model==<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to LDA, the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\boldsymbol{\theta} \sim N(\mu_0, \sigma_0^2), \boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}})<br />
</math><br />
</center><br />
Where are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
===Re-Parameritization Trick===<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section model inference.<br />
<br />
===Diversity Regularizer===<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. First, we measure the distance between pair of topics with . Then, mean angle of all pairs of T topics is , and variance is . Finally, we identify the topic diversity regularization as which will be used in the model inference.<br />
<br />
=Language Model=<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> h_{m} </math>. <br />
<br />
:<math><br />
\begin{align}<br />
p(y_{m}|y_{1:m-1})&=p(y_{m}|h_{m}) \\<br />
h_{m}&= f(h_{m-1}x_{m}) \\<br />
<br />
\end{align}<br />
</math><br />
<br />
==RNN (LSTM)==<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
==Neural Language Model==<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. <br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
:<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(Vh_{m}),\\<br />
h_{m} &= \sigma(W(t)x_{m} + U(t)h_{m-1}), \\<br />
\end{align}<br />
</math><br />
where <math> W(t) </math> and <math> U(t) </math> are defined as: <math> W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math><br />
<br />
===LSTM Architecture===<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate <math> i_{m} </math>, forget gate <math> f_{m} </math>, output gate <math> o_{m} </math>, and memory stage <math> \tilde{c}_{m} </math> respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:<br />
<br />
:<math><br />
\begin{align}<br />
i_{m} &= \sigma(W_{i}(t) x_{i,m-1} + U_{i}(t) h_{i,m-1})\\<br />
f_{m} &= \sigma(W_{f}(t) x_{f,m-1} + U_{f}(t)h_{f,m-1})\\<br />
o_{m} &= \sigma(W_{o}(t) x_{o,m-1} + U_{o}(t)h_{o,m-1})\\<br />
\tilde{c}_{m} &= \sigma(W_{c}(t) x_{c,m-1} + U_{c}(t)h_{c,m-1})\\<br />
c_{m} &= i_{m} \odot \tilde{c}_{m} + f_{m} \cdot c_{m-1}\\<br />
h_{m} &= o_{m} \odot tanh(c_{m})<br />
\end{align}<br />
</math><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math> W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by () and () for semantic concept detection RNN. Mathematically,<br />
<br />
:<math><br />
\begin{align}<br />
W(t) &= W_{a} \cdot diag(W_{b}t) \cdot W_{c} \\<br />
&= W_{a} \cdot (W_{b}t \odot W_{c}) \\<br />
\end{align}<br />
</math><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
=Model Inference=<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
<math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) = \int_{t}p(t|\mu_{0},\sigma_{0}^{2})p(d|\beta,t)\prod_{m=1}^{M}p(y_{m}|y_{1:m-1},t)dt </math><br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, <math> q(t|d) </math>, which is the probability of latent vector <math> t </math> given bag-of-words from Neural Topic Model is used to be the variational distribution of the real marginal probability <math>p(t) </math>, compensated by Kullback-Leibler divergence, which will then be used to construct ELBO. Hence, <br />
<br />
:<math> <br />
\begin{align}<br />
\mathcal{L} &= \mathbb{E}_{q(t|d)} (log p(d|t)) - KL (q(t|d)||p(t|\mu_{0},\sigma_{0}^{2}) \\ &+ \mathbb{E}_{q(t|d)} (\sum_{m=1}|^{M} log p(y_{m}|y_{1:m-1}, t) \\ & \leq log p(y_{1:M}, d|\mu_{0},\sigma_{0}^{2},\beta)<br />
\end{align}<br />
</math><br />
<br />
Hence, the goal of TCNLM is to maximize the approximated log likelihood of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> together with the diversity regularization.<br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**'''Language Models: '''<br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**'''Topic Models:'''<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB, and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
<br />
=References=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=37999
stat441F18/TCNLM
2018-11-06T04:31:02Z
<p>Q26deng: /* Neural Language Model */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [[Autoencoder|variational autoencoder]] framework, coupled with the probability of topic usage, are further trained in a MoE model. (Insert figure here)<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
<br />
=Topic Model=<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
==LDA==<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as (insert formula).<br />
<br />
==Neural Topic Model==<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to LDA, the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\boldsymbol{\theta} ~ N(\mu_0, \sigma_0^2), \boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}})<br />
</math><br />
</center><br />
Where are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
===Re-Parameritization Trick===<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section model inference.<br />
<br />
===Diversity Regularizer===<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. First, we measure the distance between pair of topics with . Then, mean angle of all pairs of T topics is , and variance is . Finally, we identify the topic diversity regularization as which will be used in the model inference.<br />
<br />
=Language Model=<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> h_{m} </math>. <br />
<br />
:<math><br />
\begin{align}<br />
p(y_{m}|y_{1:m-1})&=p(y_{m}|h_{m}) \\<br />
h_{m}&= f(h_{m-1}x_{m}) \\<br />
<br />
\end{align}<br />
</math><br />
<br />
==RNN (LSTM)==<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
==Neural Language Model==<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. <br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
:<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(Vh_{m}),\\<br />
h_{m} &= \sigma(W(t)x_{m} + U(t)h_{m-1}), \\<br />
\end{align}<br />
</math><br />
where <math> W(t) </math> and <math> U(t) </math> are defined as: <math> W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math><br />
<br />
===LSTM Architecture===<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate <math> i_{m} </math>, forget gate <math> f_{m} </math>, output gate <math> o_{m} </math>, and memory stage <math> \tilde{c}_{m} </math> respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:<br />
<br />
:<math><br />
\begin{align}<br />
i_{m} &= \sigma(W_{i}(t) x_{i,m-1} + U_{i}(t) h_{i,m-1})\\<br />
f_{m} &= \sigma(W_{f}(t) x_{f,m-1} + U_{f}(t)h_{f,m-1})\\<br />
o_{m} &= \sigma(W_{o}(t) x_{o,m-1} + U_{o}(t)h_{o,m-1})\\<br />
\tilde{c}_{m} &= \sigma(W_{c}(t) x_{c,m-1} + U_{c}(t)h_{c,m-1})\\<br />
c_{m} &= i_{m} \odot \tilde{c}_{m} + f_{m} \cdot c_{m-1}\\<br />
h_{m} &= o_{m} \odot tanh(c_{m})<br />
\end{align}<br />
</math><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math> W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by () and () for semantic concept detection RNN. Mathematically,<br />
<br />
:<math><br />
\begin{align}<br />
W(t) &= W_{a} \cdot diag(W_{b}t) \cdot W_{c} \\<br />
&= W_{a} \cdot (W_{b}t \odot W_{c}) \\<br />
\end{align}<br />
</math><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
=Model Inference=<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
(16)<br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, q(t|d) (this is the probability from variational autoencoder part) is used to be the variational distribution of p(t), compensated by Kullback-Leibler divergence, which will then be used to construct ELBO: (also insert a structure image of TCNLM)<br />
<br />
(17)<br />
<br />
Hence, the goal of TCNLM is to maximize the approximated log likelihood of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> together with the diversity regularization.<br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**'''Language Models: '''<br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**'''Topic Models:'''<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB, and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
<br />
=References=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=37984
stat441F18/TCNLM
2018-11-06T03:41:16Z
<p>Q26deng: /* Language Model */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [[Autoencoder|variational autoencoder]] framework, coupled with the probability of topic usage, are further trained in a MoE model. (Insert figure here)<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
<br />
=Topic Model=<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
==LDA==<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as (insert formula).<br />
<br />
==Neural Topic Model==<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to LDA, the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\boldsymbol{\theta} ~ N(\mu_0, \sigma_0^2), \boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}})<br />
</math><br />
</center><br />
Where are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
===Re-Parameritization Trick===<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section model inference.<br />
<br />
===Diversity Regularizer===<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. First, we measure the distance between pair of topics with . Then, mean angle of all pairs of T topics is , and variance is . Finally, we identify the topic diversity regularization as which will be used in the model inference.<br />
<br />
=Language Model=<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> h_{m} </math>. <br />
<br />
:<math><br />
\begin{align}<br />
p(y_{m}|y_{1:m-1})&=p(y_{m}|h_{m}) \\<br />
h_{m}&= f(h_{m-1}x_{m}) \\<br />
<br />
\end{align}<br />
</math><br />
<br />
==RNN (LSTM)==<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
==Neural Language Model==<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document, and each word has its corresponding topic distribution. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.<br />
<br />
In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. Without loss of generality, ‘Mixture-of-Expert’ is first illustrated with a simple RNN cell, which then generalized into the proposed LSTM.<br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Define two tensors <math> \mathcal{W} \in \mathbb{R}^{n_{h} x n_{x} x T} </math> and <math> \mathcal{U} \in \mathbb{R}^{n_{h} x n_{x} x T} </math>, where <math> n_{h} </math> is the number of hidden units and <math> n_{x} </math> is the size of word embeddings. Each expert <math> E_{k} </math> has a corresponding set of parameters <math> \mathcal{W}[k], \mathcal{U}[k] </math>. Specifically, T experts are jointly trained as follows:<br />
<br />
:<math><br />
\begin{align}<br />
p(y_{m}) &= softmax(Vh_{m}),\\<br />
h_{m} &= \sigma(W(t)x_{m} + U(t)h_{m-1}), \\<br />
\end{align}<br />
</math><br />
where <math> W(t) </math> and <math> U(t) </math> are defined as: <math> W(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{W}[k], U(t) = \sum_{k=1}^{T}t_{k} \cdot \mathcal{U}[k]. </math><br />
<br />
A matrix decomposition technique is applied onto <math> \mathcal{W}(t) </math> and <math> \mathcal{U}(t) </math> to further reduce the number of model parameters, which is each a multiplication of three terms: <math> W_{a} \in \mathbb{R}^{n_{h}xn_{f}}, W_{b} \in \mathbb{R}^{n_{f} x T}, </math>and <math> W_{c} \in \mathbb{R}^{n_{f}xn_{x}} </math>. This method is enlightened by () and () for semantic concept detection RNN. Mathematically,<br />
<br />
:<math><br />
\begin{align}<br />
W(t) &= W_{a} \cdot diag(W_{b}t) \cdot W_{c} \\<br />
&= W_{a} \cdot (W_{b}t \odot W_{c}) \\<br />
\end{align}<br />
</math><br />
where <math> \odot </math> denotes entrywise product.<br />
<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate, forget gate, output gate, and memory respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:<br />
<br />
(13)<br />
<br />
(14)<br />
<br />
(15)<br />
<br />
=Model Inference=<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
(16)<br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, q(t|d) (this is the probability from variational autoencoder part) is used to be the variational distribution of p(t), compensated by Kullback-Leibler divergence, which will then be used to construct ELBO: (also insert a structure image of TCNLM)<br />
<br />
(17)<br />
<br />
Hence, the goal of TCNLM is to maximize the approximated log likelihood of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> together with the diversity regularization.<br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**Language Models: <br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**Topic Models:<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
<br />
=References=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=37830
stat441F18/TCNLM
2018-11-05T17:39:11Z
<p>Q26deng: /* RNN (LSTM) */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [[Autoencoder|variational autoencoder]] framework, coupled with the probability of topic usage, are further trained in a MoE model. (Insert figure here)<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
<br />
=Topic Model=<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
==LDA==<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as (insert formula).<br />
<br />
==Neural Topic Model==<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to LDA, the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<center><br />
<math><br />
\boldsymbol{\theta} ~ N(\mu_0, \sigma_0^2), \boldsymbol{t} = g(\boldsymbol{\theta}) = softmax(\hat{\boldsymbol{W}} \boldsymbol{\theta} + \hat{\boldsymbol{b}})<br />
</math><br />
</center><br />
Where are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
===Re-Parameritization Trick===<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section model inference.<br />
<br />
===Diversity Regularizer===<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. First, we measure the distance between pair of topics with . Then, mean angle of all pairs of T topics is , and variance is . Finally, we identify the topic diversity regularization as which will be used in the model inference.<br />
<br />
=Language Model=<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> h_{m} </math>. <br />
==RNN (LSTM)==<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
[[https://en.wikipedia.org/wiki/Long_short-term_memory|LSTM]]<br />
<br />
==Neural Language Model==<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document, and each word has its corresponding topic distribution. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.<br />
<br />
In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. Without loss of generality, ‘Mixture-of-Expert’ is first illustrated with a simple RNN cell, which then generalized into the proposed LSTM.<br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Specifically, T experts are jointly trained as follows:<br />
<br />
(9)<br />
<br />
(10)<br />
<br />
(11)<br />
<br />
Where (introducing notations)<br />
<br />
A matrix decomposition technique is applied onto W and U to further reduce the number of model parameters, which is each a multiplication of three terms. This method is enlightened by () and () for semantic concept detection RNN. (element-wise product)<br />
<br />
(12)<br />
<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate, forget gate, output gate, and memory respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:<br />
<br />
(13)<br />
<br />
(14)<br />
<br />
(15)<br />
<br />
=Model Inference=<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
(16)<br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, q(t|d) (this is the probability from variational autoencoder part) is used to be the variational distribution of p(t), compensated by Kullback-Leibler divergence, which will then be used to construct ELBO: (also insert a structure image of TCNLM)<br />
<br />
(17)<br />
<br />
Hence, the goal of TCNLM is to maximize the approximated log likelihood of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> together with the diversity regularization.<br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**Language Models: <br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**Topic Models:<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
<br />
=References=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=37828
stat441F18/TCNLM
2018-11-05T17:35:28Z
<p>Q26deng: /* Model Inference */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [[Autoencoder|variational autoencoder]] framework, coupled with the probability of topic usage, are further trained in a MoE model. (Insert figure here)<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
<br />
=Topic Model=<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
==LDA==<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as (insert formula).<br />
<br />
==Neural Topic Model==<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to LDA, the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<br />
Where are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
===Re-Parameritization Trick===<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section model inference.<br />
<br />
===Diversity Regularizer===<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. First, we measure the distance between pair of topics with . Then, mean angle of all pairs of T topics is , and variance is . Finally, we identify the topic diversity regularization as which will be used in the model inference.<br />
<br />
=Language Model=<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> h_{m} </math>. <br />
==RNN (LSTM)==<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
==Neural Language Model==<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document, and each word has its corresponding topic distribution. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.<br />
<br />
In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. Without loss of generality, ‘Mixture-of-Expert’ is first illustrated with a simple RNN cell, which then generalized into the proposed LSTM.<br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Specifically, T experts are jointly trained as follows:<br />
<br />
(9)<br />
<br />
(10)<br />
<br />
(11)<br />
<br />
Where (introducing notations)<br />
<br />
A matrix decomposition technique is applied onto W and U to further reduce the number of model parameters, which is each a multiplication of three terms. This method is enlightened by () and () for semantic concept detection RNN. (element-wise product)<br />
<br />
(12)<br />
<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate, forget gate, output gate, and memory respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:<br />
<br />
(13)<br />
<br />
(14)<br />
<br />
(15)<br />
<br />
=Model Inference=<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
(16)<br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, q(t|d) (this is the probability from variational autoencoder part) is used to be the variational distribution of p(t), compensated by Kullback-Leibler divergence, which will then be used to construct ELBO: (also insert a structure image of TCNLM)<br />
<br />
(17)<br />
<br />
Hence, the goal of TCNLM is to maximize the approximated log likelihood of <math> p(y_{1:M},d|\mu_{0},\sigma_{0}^{2},\beta) </math> together with the diversity regularization.<br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**Language Models: <br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**Topic Models:<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
<br />
=References=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=37827
stat441F18/TCNLM
2018-11-05T17:34:20Z
<p>Q26deng: /* Model Inference */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [[Autoencoder|variational autoencoder]] framework, coupled with the probability of topic usage, are further trained in a MoE model. (Insert figure here)<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
<br />
=Topic Model=<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
==LDA==<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as (insert formula).<br />
<br />
==Neural Topic Model==<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to LDA, the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<br />
Where are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
===Re-Parameritization Trick===<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section model inference.<br />
<br />
===Diversity Regularizer===<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. First, we measure the distance between pair of topics with . Then, mean angle of all pairs of T topics is , and variance is . Finally, we identify the topic diversity regularization as which will be used in the model inference.<br />
<br />
=Language Model=<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> h_{m} </math>. <br />
==RNN (LSTM)==<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
==Neural Language Model==<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document, and each word has its corresponding topic distribution. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.<br />
<br />
In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. Without loss of generality, ‘Mixture-of-Expert’ is first illustrated with a simple RNN cell, which then generalized into the proposed LSTM.<br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Specifically, T experts are jointly trained as follows:<br />
<br />
(9)<br />
<br />
(10)<br />
<br />
(11)<br />
<br />
Where (introducing notations)<br />
<br />
A matrix decomposition technique is applied onto W and U to further reduce the number of model parameters, which is each a multiplication of three terms. This method is enlightened by () and () for semantic concept detection RNN. (element-wise product)<br />
<br />
(12)<br />
<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate, forget gate, output gate, and memory respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:<br />
<br />
(13)<br />
<br />
(14)<br />
<br />
(15)<br />
<br />
=Model Inference=<br />
<br />
To summarize, the proposed model consists of a variational autoencoder frameworks as Neural Topic Model that learns a generative process, where the model reads in the bag-of-words, embeds a document into the topic vector, and reconstructs the bag-of-words as output, followed by an ensemble of LSTMs for predicting a sequence of words in the document. In a nutshell, the joint marginal distribution of the M predicted words and the document is:<br />
<br />
(16)<br />
<br />
However, the direct optimization is intractable, therefore variational inference is employed to provide an analytical approximation to the posterior probability of the unobservable t. Here, q(t|d) (this is the probability from variational autoencoder part) is used to be the variational distribution of p(t), compensated by Kullback-Leibler divergence, which will then be used to construct ELBO: (also insert a structure image of TCNLM)<br />
<br />
(17)<br />
<br />
Hence, the goal of TCNLM is to maximize the approximated log likelihood of <math> p(y1:M,d|\mu_{0},\sigma_{0}^{2},\beta) </math> together with the diversity regularization.<br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**Language Models: <br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**Topic Models:<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#A larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#The advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
<br />
=References=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=37825
stat441F18/TCNLM
2018-11-05T17:32:07Z
<p>Q26deng: /* Language Model */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[#Neural Topic Model|neural topic model]] (NTM) and a [[#Neural Language Model| Mixture-of-Experts]] (MoE) language model. The latent topics learned within a [[Autoencoder|variational autoencoder]] framework, coupled with the probability of topic usage, are further trained in a MoE model. (Insert figure here)<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of [[#RNN (LSTM)| RNN]]-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
<br />
=Topic Model=<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
==LDA==<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as (insert formula).<br />
<br />
==Neural Topic Model==<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to LDA, the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<br />
Where are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
===Re-Parameritization Trick===<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section model inference.<br />
<br />
===Diversity Regularizer===<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. First, we measure the distance between pair of topics with . Then, mean angle of all pairs of T topics is , and variance is . Finally, we identify the topic diversity regularization as which will be used in the model inference.<br />
<br />
=Language Model=<br />
<br />
A typical Language Model aims to define the conditional probability of each word <math>y_{m} </math>given all the preceding input <math> y_{1},...,y_{m-1} </math>, connected through a hidden state <math> h_{m} </math>. <br />
==RNN (LSTM)==<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
==Neural Language Model==<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document, and each word has its corresponding topic distribution. A '''‘Mixture of Expert’''' language model is proposed, where each ‘expert’ itself is a topic specific LSTM unit with trained parameters corresponding to the latent topic vector <math> t </math> inherited from Neural Topic Model.<br />
<br />
In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. Without loss of generality, ‘Mixture-of-Expert’ is first illustrated with a simple RNN cell, which then generalized into the proposed LSTM.<br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Specifically, T experts are jointly trained as follows:<br />
<br />
(9)<br />
<br />
(10)<br />
<br />
(11)<br />
<br />
Where (introducing notations)<br />
<br />
A matrix decomposition technique is applied onto W and U to further reduce the number of model parameters, which is each a multiplication of three terms. This method is enlightened by () and () for semantic concept detection RNN. (element-wise product)<br />
<br />
(12)<br />
<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate, forget gate, output gate, and memory respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:<br />
<br />
(13)<br />
<br />
(14)<br />
<br />
(15)<br />
<br />
=Model Inference=<br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
*Other models for comparison:<br />
**Language Models: <br />
***basic-LSTM<br />
***LDA+LSTM, LCLM (Wang and Cho, 2016)<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
**Topic Models:<br />
***LDA<br />
***NTM<br />
***TDLM (Lau et al., 2017)<br />
***Topic-RNN (Dieng et al., 2016)<br />
<br />
Using the datasets APNEWS, IMDB and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#We also observe that a larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#Additionally, the advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=<br />
<br />
=References=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=37820
stat441F18/TCNLM
2018-11-05T17:26:05Z
<p>Q26deng: </p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a neural topic model (NTM) and a Mixture-of-Experts (MoE) language model. The latent topics learned within a variational autoencoder framework, coupled with the probability of topic usage, are further trained in a MoE model. (Insert figure here)<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of RNN-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Model Architecture=<br />
<br />
=Topic Model=<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
==LDA==<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as (insert formula).<br />
<br />
==Neural Topic Model==<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to LDA, the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<br />
Where are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
===Re-Parameritization Trick===<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section model inference.<br />
<br />
===Diversity Regularizer===<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. First, we measure the distance between pair of topics with . Then, mean angle of all pairs of T topics is , and variance is . Finally, we identify the topic diversity regularization as which will be used in the model inference.<br />
<br />
=Language Model=<br />
<br />
A typical Language Model aims to define the conditional probability of each word \y_{m} given all the preceding input \y_{1},...,\y_{m-1}, connected through the hidden state hm. <br />
==RNN (LSTM)==<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
==Neural Language Model==<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document, and each word has its corresponding topic distribution. A ‘Mixture of Expert’ language model is proposed, where each ‘Expert’ itself is a topic specific LSTM with trained parameters corresponding to the latent topic vector inherited from Neural Topic Model.<br />
<br />
In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. Without loss of generality, ‘Mixture-of-Expert’ is first illustrated with a simple RNN cell, which then generalized into the proposed LSTM.<br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Specifically, T experts are jointly trained as follows:<br />
<br />
(9)<br />
<br />
(10)<br />
<br />
(11)<br />
<br />
Where (introducing notations)<br />
<br />
A matrix decomposition technique is applied onto W and U to further reduce the number of model parameters, which is each a multiplication of three terms. This method is enlightened by () and () for semantic concept detection RNN. (element-wise product)<br />
<br />
(12)<br />
<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate, forget gate, output gate, and memory respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:<br />
<br />
(13)<br />
<br />
(14)<br />
<br />
(15)<br />
<br />
=Model Inference=<br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
Using the datasets APNEWS, IMDB and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:<br />
<br />
*'''In the evaluation of Language Model:'''<br />
#All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information. <br />
#TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers. <br />
#The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly. <br />
#The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.<br />
<br />
*'''In the evaluation of Topic Model:'''<br />
#TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC. <br />
#We also observe that a larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement. <br />
#Additionally, the advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.<br />
<br />
=Extensions=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=37816
stat441F18/TCNLM
2018-11-05T17:22:29Z
<p>Q26deng: /* Neural Language Model */</p>
<hr />
<div>'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a neural topic model (NTM) and a Mixture-of-Experts (MoE) language model. The latent topics learned within a variational autoencoder framework, coupled with the probability of topic usage, are further trained in a MoE model. (Insert figure here)<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of RNN-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Topic Model=<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
==LDA==<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as (insert formula).<br />
<br />
==Neural Topic Model==<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
The variables are defined as the following:<br />
<br />
*<math>d</math> be document with <math> D </math> distinct vocabulary<br />
*<math>\boldsymbol{d} \in \mathbb{Z}_+^D</math> be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
*<math>\boldsymbol{t}</math> be the topic proportion for document d<br />
*<math>T</math> be the number of topics<br />
*<math>z_n</math> be the topic assignment for word <math>w_n</math><br />
*<math>\boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \}</math> be the transition matrix from the topic distribution trained in the decoder where <math>\beta_i \in \mathbb{R}^D</math>is the topic distribution over the i-th word in the corresponding <math>\boldsymbol{d}</math>.<br />
<br />
Similar to LDA, the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<br />
Where are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
===Re-Parameritization Trick===<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section model inference.<br />
<br />
===Diversity Regularizer===<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. First, we measure the distance between pair of topics with . Then, mean angle of all pairs of T topics is , and variance is . Finally, we identify the topic diversity regularization as which will be used in the model inference.<br />
<br />
=Language Model=<br />
<br />
A typical Language Model aims to define the conditional probability of each word \y_{m} given all the preceding input \y_{1},...,\y_{m-1}, connected through the hidden state hm. <br />
==RNN (LSTM)==<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
==Neural Language Model==<br />
<br />
In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document, and each word has its corresponding topic distribution. A ‘Mixture of Expert’ language model is proposed, where each ‘Expert’ itself is a topic specific LSTM with trained parameters corresponding to the latent topic vector inherited from Neural Topic Model.<br />
<br />
In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. Without loss of generality, ‘Mixture-of-Expert’ is first illustrated with a simple RNN cell, which then generalized into the proposed LSTM.<br />
<br />
TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Specifically, T experts are jointly trained as follows:<br />
<br />
(9)<br />
<br />
(10)<br />
<br />
(11)<br />
<br />
Where (introducing notations)<br />
<br />
A matrix decomposition technique is applied onto W and U to further reduce the number of model parameters, which is each a multiplication of three terms. This method is enlightened by () and () for semantic concept detection RNN. (element-wise product)<br />
<br />
(12)<br />
<br />
To generalize into LSTM, TCNLM requires four sets of parameters for input gate, forget gate, output gate, and memory respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:<br />
<br />
(13)<br />
<br />
(14)<br />
<br />
(15)<br />
<br />
=Model Comparison and Evaluation=<br />
==Model Comparison==<br />
In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:<br />
<br />
#The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation. <br />
#Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.<br />
#TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.<br />
<br />
==Model Evaluation==<br />
<br />
=Extensions=</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=37805
stat441F18/TCNLM
2018-11-05T17:11:46Z
<p>Q26deng: /* Language Model */</p>
<hr />
<div>Topic Compositional Neural Language Model (TCNLM) simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a neural topic model (NTM) and a Mixture-of-Experts (MoE) language model. The latent topics learned within a variational autoencoder framework, coupled with the probability of topic usage, are further trained in a MoE model. (Insert figure here)<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of RNN-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Topic Model=<br />
<br />
A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics. <br />
<br />
==LDA==<br />
<br />
A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as (insert formula).<br />
<br />
==Neural Topic Model==<br />
<br />
The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.<br />
<br />
The variables are defined as the following:<br />
<br />
d be document with D distinct vocabulary<br />
d be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),<br />
t be the topic proportion for document d<br />
T be the number of topics<br />
zn be the topic assignment for word wn<br />
beta be the transition matrix from the topic distribution trained in the decoder where is the topic distribution over the i-th word in the corresponding d.<br />
<br />
Similar to LDA, the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:<br />
<br />
Where are trainable parameters.<br />
<br />
The marginal likelihood for document d is then calculated as the following:<br />
<br />
===Re-Parameritization Trick===<br />
<br />
In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section model inference.<br />
<br />
===Diversity Regularizer===<br />
<br />
One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. First, we measure the distance between pair of topics with . Then, mean angle of all pairs of T topics is , and variance is . Finally, we identify the topic diversity regularization as which will be used in the model inference.<br />
<br />
=Language Model=<br />
<br />
A typical Language Model aims to define the conditional probability of each word \y_{m} given all the preceding input \y_{1},...,\y_{m-1}, connected through the hidden state hm. <br />
==RNN (LSTM)==<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
==Neural Language Model==</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=37801
stat441F18/TCNLM
2018-11-05T17:08:41Z
<p>Q26deng: /* RNN (LSTM) */</p>
<hr />
<div>=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Introduction=<br />
<br />
'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[neural topic model]] (NTM) and a Mixture-of-Experts (MoE) language model. The latent topics learned within a variational autoencoder framework, coupled with the probability of topic usage, are further trained in a MoE model. (Insert figure here)<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of RNN-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Topic Model=<br />
<br />
==LDA==<br />
<br />
==Neural Topic Model==<br />
<br />
=Language Model=<br />
<br />
==RNN (LSTM)==<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
==Neural Language Model==</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=37800
stat441F18/TCNLM
2018-11-05T17:07:54Z
<p>Q26deng: /* Language Model */</p>
<hr />
<div>=Presented by=<br />
*Yan Yu Chen<br />
*Qisi Deng<br />
*Hengxin Li<br />
*Bochao Zhang<br />
<br />
=Introduction=<br />
<br />
'''Topic Compositional Neural Language Model''' ('''TCNLM''') simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a [[neural topic model]] (NTM) and a Mixture-of-Experts (MoE) language model. The latent topics learned within a variational autoencoder framework, coupled with the probability of topic usage, are further trained in a MoE model. (Insert figure here)<br />
<br />
TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of RNN-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.<br />
<br />
=Topic Model=<br />
<br />
==LDA==<br />
<br />
==Neural Topic Model==<br />
<br />
=Language Model=<br />
<br />
==RNN (LSTM)==<br />
Recurrent Neural Networks (RNNs) capture the temporal relationship among input information, where the outputs are assumed to be dependent on a sequence of input data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.<br />
<br />
==Neural Language Model==</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18/TCNLM&diff=37791
stat441F18/TCNLM
2018-11-05T16:48:05Z
<p>Q26deng: Created page with "Summary"</p>
<hr />
<div>Summary</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18&diff=37788
stat441F18
2018-11-05T16:35:21Z
<p>Q26deng: /* Paper presentation */</p>
<hr />
<div><br />
<br />
== [[F18-STAT841-Proposal| Project Proposal ]] ==<br />
<br />
[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Nov 13 || Jason Schneider, Jordyn Walton, Zahraa Abbas, Andrew Na || 1|| Memory-Based Parameter Adaptation || || <br />
|-<br />
|Nov 13 ||Sai Praneeth M, Xudong Peng, Alice Li, Shahrzad Hosseini Vajargah|| 2|| Going deeper with convolutions ||[https://arxiv.org/pdf/1409.4842.pdf paper] || <br />
|-<br />
|NOv 15 || Yan Yu Chen, Qisi Deng, Hengxin Li, Bochao Zhang|| 3|| Topic Compositional Neural Language Model|| [https://arxiv.org/pdf/1712.09783.pdf paper] || <br />
|-<br />
|Nov 15 || Zhaoran Hou, Pei Wei Wang, Chi Zhang, Yiming Li, Daoyi Chen, Ying Chi|| 4|| Extreme Learning Machine for regression and Multi-class Classification|| [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6035797] || ||<br />
|-<br />
|NOv 20 || Kristi Brewster, Isaac McLellan, Ahmad Nayar Hassan, Marina Medhat Rassmi Melek, Brendan Ross, Jon Barenboim, Junqiao Lin, James Bootsma || 5|| A Neural Representation of Sketch Drawings || || <br />
|-<br />
|Nov 20 || Maya(Mahdiyeh) Bayati, Saber Malekmohammadi, Vincent Loung || 6|| Convolutional Neural Networks for Sentence Classification || [https://arxiv.org/pdf/1408.5882.pdf paper] || <br />
|-<br />
|NOv 22 || Qingxi Huo, Yanmin Yang, Jiaqi Wang, Yuanjing Cai, Colin Stranc, Philomène Bobichon, Aditya Maheshwari, Zepeng An || 7|| Robust Probabilistic Modeling with Bayesian Data Reweighting || [http://proceedings.mlr.press/v70/wang17g/wang17g.pdf Paper] || <br />
|-<br />
|Nov 22 || Hanzhen Yang, Jing Pu Sun, Ganyuan Xuan, Yu Su, Jiacheng Weng, Keqi Li, Yi Qian, Bomeng Liu || 8|| Deep Residual Learning for Image Recognition || [http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf Paper] || <br />
|-<br />
|NOv 27 || Mitchell Snaith || 9|| You Only Look Once: Unified, Real-Time Object Detection, V1 -> V3 || [https://arxiv.org/pdf/1506.02640.pdf Paper] || <br />
|-<br />
|Nov 27 || Qi Chu, Gloria Huang, Dylan Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu, Shikun Cui || 10|| tba || || <br />
|-<br />
|NOv 29 || Jameson Ngo, Amy Xu, Aden Grant, Yu Hao Wang, Andrew McMurry, Baizhi Song || 11|| TBA || || <br />
|-<br />
|Nov 29 || Qianying Zhao, Hui Huang, Lingyun Yi, Jiayue Zhang, Siao Chen, Rongrong Su, Gezhou Zhang, Meiyu Zhou || 12|| || ||</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18&diff=37398
stat441F18
2018-10-31T01:43:39Z
<p>Q26deng: /* Paper presentation */</p>
<hr />
<div><br />
<br />
== [[F18-STAT841-Proposal| Project Proposal ]] ==<br />
<br />
[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Nov 13 || Jason Schneider, Jordyn Walton, Zahraa Abbas, Andrew Na || 1|| Will be added soon || || <br />
|-<br />
|Nov 13 ||Sai Praneeth M, Xudong Peng, Alice Li, Shahrzad Hosseini Vajargah|| 2|| Going deeper with convolutions ||[https://arxiv.org/pdf/1409.4842.pdf paper] || <br />
|-<br />
|NOv 15 || Yan Yu Chen, Qisi Deng, Hengxin Li, Bochao Zhang|| 3|| Topic Compositional Neural Language Model|| || <br />
|-<br />
|Nov 15 || Zhaoran Hou, Pei Wei Wang, Chi Zhang, Yiming Li, Daoyi Chen, Ying Chi|| 4|| Extreme Learning Machine for regression and Multi-class Classification|| [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6035797] || ||<br />
|-<br />
|NOv 20 || Kristi Brewster, Isaac McLellan, Ahmad Nayar Hassan, Marina Medhat Rassmi Melek, Brendan Ross, Jon Barenboim, Junqiao Lin, James Bootsma || 5|| A Neural Representation of Sketch Drawings || || <br />
|-<br />
|Nov 20 || Maya(Mahdiyeh) Bayati, Saber Malekmohammadi, Vincent Loung || 6|| Convolutional Neural Networks for Sentence Classification || [https://arxiv.org/pdf/1408.5882.pdf paper] || <br />
|-<br />
|NOv 22 || Qingxi Huo, Yanmin Yang, Jiaqi Wang, Yuanjing Cai, Colin Stranc, Philomène Bobichon, Aditya Maheshwari, Zepeng An || 7|| Robust Probabilistic Modeling with Bayesian Data Reweighting || [http://proceedings.mlr.press/v70/wang17g/wang17g.pdf Paper] || <br />
|-<br />
|Nov 22 || Hanzhen Yang, Jing Pu Sun, Ganyuan Xuan, Yu Su, Jiacheng Weng, Keqi Li, Yi Qian, Bomeng Liu || 8|| Deep Residual Learning for Image Recognition || [http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf Paper] || <br />
|-<br />
|NOv 27 || Mitchell Snaith, Alexandre Xiao, Hudson Ash, Richard Zhang, Stephen Kingston, Ziqiu Zhu || 9|| You Only Look Once: Unified, Real-Time Object Detection, V1 -> V3 || [https://arxiv.org/pdf/1506.02640.pdf Paper] || <br />
|-<br />
|Nov 27 || Qi Chu, Gloria Huang, Dylan Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu, Shikun Cui || 10|| tba || || <br />
|-<br />
|NOv 29 || Jameson Ngo, Amy Xu, Aden Grant, Yu Hao Wang, Andrew McMurry, Baizhi Song || 11|| TBA || || <br />
|-<br />
|Nov 29 || Qianying Zhao, Hui Huang, Lingyun Yi, Jiayue Zhang, Siao Chen, Rongrong Su, Gezhou Zhang, Meiyu Zhou || 12|| || ||</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18&diff=36796
stat441F18
2018-10-18T20:45:36Z
<p>Q26deng: /* Paper presentation */</p>
<hr />
<div><br />
<br />
== [[F18-STAT841-Proposal| Project Proposal ]] ==<br />
<br />
[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Nov 13 || Jason Schneider, Jordyn Walton, Zahraa Abbas, Andrew Na || 1|| Will be added soon || || <br />
|-<br />
|Nov 13 ||Sai Praneeth M, Xudong Peng, Alice Li, Shahrzad Hosseini Vajargah|| 2|| || || <br />
|-<br />
|NOv 15 || Yan Yu Chen, Qisi Deng, Hengxin Li, Bochao Zhang|| 3|| The Evolution of Sentiment Analysis|| || <br />
|-<br />
|Nov 15 || Eric, Mike, Rebcca, Susan|| 4|| Will be added soon|| || || <br />
|-<br />
|NOv 20 || Kristi Brewster, Isaac McLellan, Ahmad Nayar Hassan, Marina Medhat Rassmi Melek || 5|| A Neural Representation of Sketch Drawings || || <br />
|-<br />
|Nov 20 || Maya(Mahdiyeh) Bayati, Saber Malekmohammadi, Vincent Loung || 6|| Convolutional Neural Networks for Sentence Classification || [https://arxiv.org/pdf/1408.5882.pdf paper] || <br />
|-<br />
|NOv 22 || Qingxi Huo, Yanmin Yang, Jiaqi Wang, Yuanjing Cai, Colin Stranc, Philomène Bobichon, Aditya Maheshwari, Zepeng An || 7|| Will be added soon || || <br />
|-<br />
|Nov 22 || Hanzhen Yang, Jing Pu Sun, Ganyuan Xuan, Yu Su, Jiacheng Weng, Keqi Li, Yi Qian, Bomeng Liu || 8|| Deep Residual Learning for Image Recognition || [http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf Paper] || <br />
|-<br />
|NOv 27 || Mitchell Snaith, Alexandre Xiao, Hudson Ash, Richard Zhang, Stephen Kingston, Ziqiu Zhu || 9|| You Only Look Once: Unified, Real-Time Object Detection, V1 -> V3 || [https://arxiv.org/pdf/1506.02640.pdf Paper] || <br />
|-<br />
|Nov 27 || Qi Chu, Gloria Huang, Dylan Sang, Amanda Lam, Yan Jiao, Shuyue Wang, Yutong Wu, Shikun Cui || 10|| tba || || <br />
|-<br />
|NOv 29 || Jameson Ngo, Amy Xu || 11|| TBA || || <br />
|-<br />
|Nov 29 || Qianying Zhao, Hui Huang, Lingyun Yi, Jiayue Zhang, Siao Chen, Rongrong Su, Gezhou Zhang, Meiyu Zhou || 12|| || ||</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=F18-STAT841-Proposal&diff=36683
F18-STAT841-Proposal
2018-10-08T03:00:15Z
<p>Q26deng: </p>
<hr />
<div><br />
'''Use this format (Don’t remove Project 0)'''<br />
<br />
'''Project # 0'''<br />
Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
'''Title:''' Making a String Telephone<br />
<br />
'''Description:''' We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1'''<br />
Group members:<br />
<br />
Weng, Jiacheng<br />
<br />
Li, Keqi<br />
<br />
Qian, Yi<br />
<br />
Liu, Bomeng<br />
<br />
'''Title:''' RSNA Pneumonia Detection Challenge<br />
<br />
'''Description:''' <br />
<br />
Our team’s project is the RSNA Pneumonia Detection Challenge from Kaggle competition. The primary goal of this project is to develop a machine learning tool to detect patients with pneumonia based on their chest radiographs (CXR). <br />
<br />
Pneumonia is an infection that inflames the air sacs in human lungs which has symptoms such as chest pain, cough, and fever [1]. Pneumonia can be very dangerous especially to infants and elders. In 2015, 920,000 children under the age of 5 died from this disease [2]. Due to its fatality to children, diagnosing pneumonia has a high order. A common method of diagnosing pneumonia is to obtain patients’ chest radiograph (CXR) which is a gray-scale scan image of patients’ chests using x-ray. The infected region due to pneumonia usually shows as an area or areas of increased opacity [3] on CXR. However, many other factors can also contribute to increase in opacity on CXR which makes the diagnose very challenging. The diagnose also requires highly-skilled clinicians and a lot of time of CXR screening. The Radiological Society of North America (RSNA®) sees the opportunity of using machine learning to potentially accelerate the initial CXR screening process. <br />
<br />
For the scope of this project, our team plans to contribute to solving this problem by applying our machine learning knowledge in image processing and classification. Team members are going to apply techniques that include, but are not limited to: logistic regression, random forest, SVM, kNN, CNN, etc., in order to successfully detect CXRs with pneumonia.<br />
<br />
<br />
[1] (Accessed 2018, Oct. 4). Pneumonia [Online]. MAYO CLINIC. Available from: https://www.mayoclinic.org/diseases-conditions/pneumonia/symptoms-causes/syc-20354204<br />
[2] (Accessed 2018, Oct. 4). RSNA Pneumonia Detection Challenge [Online]. Kaggle. Available from: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge<br />
[3] Franquet T. Imaging of community-acquired pneumonia. J Thorac Imaging 2018 (epub ahead of print). PMID 30036297<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 2'''<br />
Group members:<br />
<br />
Hou, Zhaoran<br />
<br />
Zhang, Chi<br />
<br />
'''Title:''' <br />
<br />
'''Description:'''<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 3'''<br />
Group members:<br />
<br />
Hanzhen Yang<br />
<br />
Jing Pu Sun<br />
<br />
Ganyuan Xuan<br />
<br />
Yu Su<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:'''<br />
<br />
Our team chose the [https://www.kaggle.com/c/quickdraw-doodle-recognition Quick, Draw! Doodle Recognition Challenge] from the Kaggle Competition. The goal of the competition is to build an image recognition tool that can classify hand-drawn doodles into one of the 340 categories.<br />
<br />
The main challenge of the project remains in the training set being very noisy. Hand-drawn artwork may deviate substantially from the actual object, and is almost definitively different from person to person. Mislabeled images also present a problem since they will create outlier points when we train our models. <br />
<br />
We plan on learning more about some of the currently mature image recognition algorithms to inspire and develop our own model.<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 4'''<br />
Group members:<br />
<br />
Snaith, Mitchell<br />
<br />
'''Title:''' Reproducibility report: *Fixing Variational Bayes: Deterministic Variational Inference for Bayesian Neural Networks*<br />
<br />
'''Description:''' <br />
<br />
The paper *Fixing Variational Bayes: Deterministic Variational Inference for Bayesian Neural Networks* [1] has been submitted to ICLR 2019. It aims to "fix" variational Bayes and turn it into a robust inference tool through two innovations. <br />
<br />
Goals are to: <br />
<br />
- reproduce the deterministic variational inference scheme as described in the paper without referencing the original author's code, providing a 3rd party implementation<br />
<br />
- reproduce experiment results with own implementation, using the same NN framework for reference implementations of compared methods described in the paper<br />
<br />
- reproduce experiment results with the author's own implementation<br />
<br />
- explore other possible applications of variational Bayes besides heteroscedastic regression<br />
<br />
[1] OpenReview location: https://openreview.net/forum?id=B1l08oAct7<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 5'''<br />
Group members:<br />
<br />
Rebecca, Chen<br />
<br />
Susan,<br />
<br />
Mike, Li<br />
<br />
Ted, Wang<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' <br />
<br />
Classification has become a more and more eye-catching, especially with the rise of machine learning in these years. Our team is particularly interested in machine learning algorithms that optimize some specific type image classification. <br />
<br />
In this project, we will dig into base classifiers we learnt from the class and try to cook them together to find an optimal solution for a certain type images dataset. Currently, we are looking into a dataset from Kaggle: Quick, Draw! Doodle Recognition Challenge. The dataset in this competition contains 50M drawings among 340 categories and is the subset of the world’s largest doodling dataset and the doodling dataset is updating by real drawing game players. Anyone can contribution by joining it! (quickdraw.withgoogle.com).<br />
<br />
For us, as machine learning students, we are more eager to help getting a better classification method. By “better”, we mean find a balance between simplify and accuracy. We will start with neural network via different activation functions in each layer and we will also combine base classifiers with bagging, random forest, boosting for ensemble learning. Also, we will try to regulate our parameters to avoid overfitting in training dataset. Last, we will summary features of this type image dataset, formulate our solutions and standardize our steps to solve this kind problems <br />
<br />
Hopefully, we can not only finish our project successfully, but also make a little contribution to machine learning research field.<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 6'''<br />
Group members:<br />
<br />
Ngo, Jameson<br />
<br />
Xu, Amy<br />
<br />
'''Title:''' Kaggle Challenge: [https://www.kaggle.com/c/human-protein-atlas-image-classification Human Protein Atlas Image Classification]<br />
<br />
'''Description:''' <br />
<br />
We will participate in the Human Protein Atlas Image Classification competition featured on Kaggle. We will classify proteins based on patterns seen in microscopic images of human cells.<br />
<br />
Historically, the work done to classify proteins had only developed methods to classify proteins using single patterns of very few cell types at a time. The goal of this challenge is to develop methods to classify proteins based on multiple/mixed patterns and with a larger range of cell types.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 7'''<br />
Group members:<br />
<br />
Qianying Zhao<br />
<br />
Hui Huang<br />
<br />
Meiyu Zhou<br />
<br />
Gezhou Zhang<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction<br />
<br />
'''Description:''' <br />
Our group will participate in the featured Kaggle competition of Google Analytics Customer Revenue Prediction. In this competition, we will analyze customer dataset from a Google Merchandise Store selling swags to predict revenue per customer using Rstudio. Our presentation report will include not only how we've concluded by classifying and analyzing provided data with appropriate models, but also how we performed in the contest.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 8'''<br />
Group members:<br />
<br />
Jiayue Zhang<br />
<br />
Lingyun Yi<br />
<br />
Rongrong Su<br />
<br />
Siao Chen<br />
<br />
<br />
'''Title:''' Kaggle--Two Sigma: Using News to Predict Stock Movements<br />
<br />
<br />
'''Description:''' <br />
Stock price is affected by the news to some extent. What is the news influence on stock price and what is the predicted power of the news? <br />
What we are going to do is to use the content of news to predict the tendency of stock price. We will mine the data, finding the useful information behind the big data. As the result we will predict the stock price performance when market faces news.<br />
<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 9'''<br />
Group members:<br />
<br />
Hassan, Ahmad Nayar<br />
<br />
McLellan, Isaac<br />
<br />
Brewster, Kristi<br />
<br />
Melek, Marina Medhat Rassmi <br />
<br />
<br />
'''Title:''' Quick, Draw! Doodle Recognition<br />
<br />
'''Description:''' <br />
<br />
'''Background'''<br />
<br />
Google’s Quick, Draw! is an online game where a user is prompted to draw an image depicting a certain category in under 20 seconds. As the drawing is being completed, the game uses a model which attempts to correctly identify the image being drawn. With the aim to improve the underlying pattern recognition model this game uses, Google is hosting a Kaggle competition asking the public to build a model to correctly identify a given drawing. The model should classify the drawing into one of the 340 label categories within the Quick, Draw! Game in 3 guesses or less.<br />
<br />
'''Proposed Approach'''<br />
<br />
Each image/doodle (input) is considered as a matrix of pixel values. In order to classify images, we need to essentially reshape an images’ respective matrix of pixel values - convolution. This would reduce the dimensionality of the input significantly which in turn reduces the number of parameters of any proposed recognition model. Using filters, pooling layers and further convolution, a final layer called the fully connected layer is used to correlate images with categories, assigning probabilities (weights) and hence classifying images. <br />
<br />
This approach to image classification is called convolution neural network (CNN) and we propose using this to classify the doodles within the Quick, Draw! Dataset.<br />
<br />
To control overfitting and underfitting of our proposed model and minimizing the error, we will use different architectures consisting of different types and dimensions of pooling layers and input filters.<br />
<br />
'''Challenges'''<br />
<br />
This project presents a number of interesting challenges:<br />
* The data given for training is noisy in that it contains drawings that are incomplete or simply poorly drawn. Dealing with this noise will be a significant part of our work. <br />
* There are 340 label categories within the Quick, Draw! dataset, this means that the model created must be able to classify drawings based on a large pool of information while making effective use of powerful computational resources.<br />
<br />
'''Tools & Resources'''<br />
<br />
* We will use Python & MATLAB.<br />
* We will use the Quick, Draw! Dataset available on the Kaggle competition website. <https://www.kaggle.com/c/quickdraw-doodle-recognition/data><br />
<br />
--------------------------------------------------------------------<br />
'''Project # 10'''<br />
Group members:<br />
<br />
Lam, Amanda<br />
<br />
Huang, Xiaoran<br />
<br />
Chu, Qi<br />
<br />
Sang, Di<br />
<br />
'''Title:''' Kaggle Competition: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:'''<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 11'''<br />
Group members:<br />
<br />
Bobichon, Philomene<br />
<br />
Maheshwari, Aditya<br />
<br />
An, Zepeng<br />
<br />
Stranc, Colin<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 12'''<br />
Group members:<br />
<br />
Huo, Qingxi<br />
<br />
Yang, Yanmin<br />
<br />
Cai, Yuanjing<br />
<br />
Wang, Jiaqi<br />
<br />
'''Title:''' <br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 13'''<br />
Group members:<br />
<br />
Ross, Brendan<br />
<br />
Barenboim, Jon<br />
<br />
Lin, Junqiao<br />
<br />
Bootsma, James<br />
<br />
'''Title:''' Expanding Neural Netwrok<br />
<br />
'''Description:''' The goal of our project is to create an expanding neural network algorithm which starts off by training a small neural network then expands it to a larger one. We hypothesize that with the proper expansion method we could decrease training time and prevent overfitting. The method we wish to explore is to link together input dimensions based on covariance. Then when the neural network reaches convergence we create a larger neural network without the links between dimensions and using starting values from the smaller neural network. <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 14'''<br />
Group members:<br />
<br />
Schneider, Jason <br />
<br />
Walton, Jordyn <br />
<br />
Abbas, Zahraa<br />
<br />
Na, Andrew<br />
<br />
'''Title:''' Application of ML Classification to Cancer Identification<br />
<br />
'''Description:''' The application of machine learning to cancer classification based on gene expression is a topic of great interest to physicians and biostatisticians alike. We would like to work on this for our final project to encourage the application of proven ML techniques to improve accuracy of cancer classification and diagnosis. In this project, we will use the dataset from Golub et al. [1] which contains data on gene expression on tumour biopsies to train a model and classify healthy individuals and individuals who have cancer.<br />
<br />
One challenge we may face pertains to the way that the data was collected. Some parts of the dataset have thousands of features (which each represent a quantitative measure of the expression of a certain gene) but as few as twenty samples. We propose some ways to mitigate the impact of this; including the use of PCA, leave-one-out cross validation, or regularization. <br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 15'''<br />
Group members:<br />
<br />
Praneeth, Sai<br />
<br />
Peng, Xudong <br />
<br />
Li, Alice<br />
<br />
Vajargah, Shahrzad<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction [1] - A Kaggle Competition<br />
<br />
'''Description:''' Guess which cabin class in airlines is the most profitable? One might guess economy - but in reality, it's the premium classes that show higher returns. According to research conducted by Wendover productions [2], despite having less than 50 seats and taking up more space than the economy class, premium classes end up driving more revenue than other classes.<br />
<br />
In fact, just like airlines, many companies adopt the business model where the vast majority of revenue is derived from a minority group of customers. As a result, data-intensive promotional strategies are getting more and more attention nowadays from marketing teams to further improve company returns.<br />
<br />
In this Kaggle competition, we are challenged to analyze a Google Merchanidize Store's customer dataset to predict revenue per customer. We will implement a series of data analytics methods including pre-processing, data augmentation, and parameter tuning. Different classification algorithms will be compared and optimized in order to achieve the best results.<br />
<br />
'''Reference:'''<br />
<br />
[1] Kaggle. (2018, Sep 18). Google Analytics Customer Revenue Prediction. Retrieved from https://www.kaggle.com/c/ga-customer-revenue-prediction<br />
<br />
[2] Kottke, J (2017, Mar 17). The economics of airline classes. Retrieved from https://kottke.org/17/03/the-economics-of-airline-classes<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 16'''<br />
Group members:<br />
<br />
Wang, Yu Hao<br />
<br />
Grant, Aden <br />
<br />
McMurray, Andrew<br />
<br />
Song, Baizhi<br />
<br />
'''Title:''' Two Sigma: Using News to Predict Stock Movements - A Kaggle Competition<br />
<br />
By analyzing news data to predict stock prices, Kagglers have a unique opportunity to advance the state of research in understanding the predictive power of the news. This power, if harnessed, could help predict financial outcomes and generate significant economic impact all over the world.<br />
<br />
Data for this competition comes from the following sources:<br />
<br />
Market data provided by Intrinio.<br />
News data provided by Thomson Reuters. Copyright ©, Thomson Reuters, 2017. All Rights Reserved. Use, duplication, or sale of this service, or data contained herein, except as described in the Competition Rules, is strictly prohibited.<br />
<br />
we will test a variety of classification algorithms to determine an appropriate model.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 17'''<br />
Group Members:<br />
<br />
Jiang, Ya Fan<br />
<br />
Zhang, Yuan<br />
<br />
Hu, Jerry Jie<br />
<br />
'''Title:''' Kaggle Competition: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' Construction of a classifier that can learn from noisy training data and generalize to a clean test set . Training data coming from the Google game "Quick, Draw"<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 18'''<br />
Group Members:<br />
<br />
Zhang, Ben<br />
<br />
'''Title:''' Two Sigma: Using News to Predict Stock Movements<br />
<br />
'''Description:''' Use news analytics to predict stock price performance. This is subject to change.<br />
<br />
----------------------------------------------------------------------<br />
'''Project # 19'''<br />
Group Members:<br />
<br />
Yan Yu Chen<br />
<br />
Qisi Deng<br />
<br />
Hengxin Li<br />
<br />
Bochao Zhang<br />
<br />
Our team currently has two interested topics at hand, and we have summarized the objective of each topic below. Please note that we will narrow down our choices after further discussions with the instructor.<br />
<br />
'''Description 1:''' With 14 percent of American claiming that social media is their most dominant news source, fake news shared on Facebook and Twitter are invading people’s information learning experience. Concomitantly, the quality and nature of online news have been gradually diluted by fake news that are sometimes imperceptible. With an aim of creating an unalloyed Internet surfing experience, we sought to develop a tool that performs fake news detection and classification. <br />
<br />
'''Description 2:''' Statistics Canada has recently reported an increasing trend of Toronto’s violent crime score. Though the Royal Canadian Mounted Police has put in the effort and endeavor to track crimes, the ambiguous snapshots captured by outdated cameras often hamper the investigation. Motivated by the aforementioned circumstance, our second interest focuses on the accurate numeral and letter identification within variable-resolution images.</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=F18-STAT841-Proposal&diff=36682
F18-STAT841-Proposal
2018-10-08T02:59:35Z
<p>Q26deng: </p>
<hr />
<div><br />
'''Use this format (Don’t remove Project 0)'''<br />
<br />
'''Project # 0'''<br />
Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
'''Title:''' Making a String Telephone<br />
<br />
'''Description:''' We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1'''<br />
Group members:<br />
<br />
Weng, Jiacheng<br />
<br />
Li, Keqi<br />
<br />
Qian, Yi<br />
<br />
Liu, Bomeng<br />
<br />
'''Title:''' RSNA Pneumonia Detection Challenge<br />
<br />
'''Description:''' <br />
<br />
Our team’s project is the RSNA Pneumonia Detection Challenge from Kaggle competition. The primary goal of this project is to develop a machine learning tool to detect patients with pneumonia based on their chest radiographs (CXR). <br />
<br />
Pneumonia is an infection that inflames the air sacs in human lungs which has symptoms such as chest pain, cough, and fever [1]. Pneumonia can be very dangerous especially to infants and elders. In 2015, 920,000 children under the age of 5 died from this disease [2]. Due to its fatality to children, diagnosing pneumonia has a high order. A common method of diagnosing pneumonia is to obtain patients’ chest radiograph (CXR) which is a gray-scale scan image of patients’ chests using x-ray. The infected region due to pneumonia usually shows as an area or areas of increased opacity [3] on CXR. However, many other factors can also contribute to increase in opacity on CXR which makes the diagnose very challenging. The diagnose also requires highly-skilled clinicians and a lot of time of CXR screening. The Radiological Society of North America (RSNA®) sees the opportunity of using machine learning to potentially accelerate the initial CXR screening process. <br />
<br />
For the scope of this project, our team plans to contribute to solving this problem by applying our machine learning knowledge in image processing and classification. Team members are going to apply techniques that include, but are not limited to: logistic regression, random forest, SVM, kNN, CNN, etc., in order to successfully detect CXRs with pneumonia.<br />
<br />
<br />
[1] (Accessed 2018, Oct. 4). Pneumonia [Online]. MAYO CLINIC. Available from: https://www.mayoclinic.org/diseases-conditions/pneumonia/symptoms-causes/syc-20354204<br />
[2] (Accessed 2018, Oct. 4). RSNA Pneumonia Detection Challenge [Online]. Kaggle. Available from: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge<br />
[3] Franquet T. Imaging of community-acquired pneumonia. J Thorac Imaging 2018 (epub ahead of print). PMID 30036297<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 2'''<br />
Group members:<br />
<br />
Hou, Zhaoran<br />
<br />
Zhang, Chi<br />
<br />
'''Title:''' <br />
<br />
'''Description:'''<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 3'''<br />
Group members:<br />
<br />
Hanzhen Yang<br />
<br />
Jing Pu Sun<br />
<br />
Ganyuan Xuan<br />
<br />
Yu Su<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:'''<br />
<br />
Our team chose the [https://www.kaggle.com/c/quickdraw-doodle-recognition Quick, Draw! Doodle Recognition Challenge] from the Kaggle Competition. The goal of the competition is to build an image recognition tool that can classify hand-drawn doodles into one of the 340 categories.<br />
<br />
The main challenge of the project remains in the training set being very noisy. Hand-drawn artwork may deviate substantially from the actual object, and is almost definitively different from person to person. Mislabeled images also present a problem since they will create outlier points when we train our models. <br />
<br />
We plan on learning more about some of the currently mature image recognition algorithms to inspire and develop our own model.<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 4'''<br />
Group members:<br />
<br />
Snaith, Mitchell<br />
<br />
'''Title:''' Reproducibility report: *Fixing Variational Bayes: Deterministic Variational Inference for Bayesian Neural Networks*<br />
<br />
'''Description:''' <br />
<br />
The paper *Fixing Variational Bayes: Deterministic Variational Inference for Bayesian Neural Networks* [1] has been submitted to ICLR 2019. It aims to "fix" variational Bayes and turn it into a robust inference tool through two innovations. <br />
<br />
Goals are to: <br />
<br />
- reproduce the deterministic variational inference scheme as described in the paper without referencing the original author's code, providing a 3rd party implementation<br />
<br />
- reproduce experiment results with own implementation, using the same NN framework for reference implementations of compared methods described in the paper<br />
<br />
- reproduce experiment results with the author's own implementation<br />
<br />
- explore other possible applications of variational Bayes besides heteroscedastic regression<br />
<br />
[1] OpenReview location: https://openreview.net/forum?id=B1l08oAct7<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 5'''<br />
Group members:<br />
<br />
Rebecca, Chen<br />
<br />
Susan,<br />
<br />
Mike, Li<br />
<br />
Ted, Wang<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' <br />
<br />
Classification has become a more and more eye-catching, especially with the rise of machine learning in these years. Our team is particularly interested in machine learning algorithms that optimize some specific type image classification. <br />
<br />
In this project, we will dig into base classifiers we learnt from the class and try to cook them together to find an optimal solution for a certain type images dataset. Currently, we are looking into a dataset from Kaggle: Quick, Draw! Doodle Recognition Challenge. The dataset in this competition contains 50M drawings among 340 categories and is the subset of the world’s largest doodling dataset and the doodling dataset is updating by real drawing game players. Anyone can contribution by joining it! (quickdraw.withgoogle.com).<br />
<br />
For us, as machine learning students, we are more eager to help getting a better classification method. By “better”, we mean find a balance between simplify and accuracy. We will start with neural network via different activation functions in each layer and we will also combine base classifiers with bagging, random forest, boosting for ensemble learning. Also, we will try to regulate our parameters to avoid overfitting in training dataset. Last, we will summary features of this type image dataset, formulate our solutions and standardize our steps to solve this kind problems <br />
<br />
Hopefully, we can not only finish our project successfully, but also make a little contribution to machine learning research field.<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 6'''<br />
Group members:<br />
<br />
Ngo, Jameson<br />
<br />
Xu, Amy<br />
<br />
'''Title:''' Kaggle Challenge: [https://www.kaggle.com/c/human-protein-atlas-image-classification Human Protein Atlas Image Classification]<br />
<br />
'''Description:''' <br />
<br />
We will participate in the Human Protein Atlas Image Classification competition featured on Kaggle. We will classify proteins based on patterns seen in microscopic images of human cells.<br />
<br />
Historically, the work done to classify proteins had only developed methods to classify proteins using single patterns of very few cell types at a time. The goal of this challenge is to develop methods to classify proteins based on multiple/mixed patterns and with a larger range of cell types.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 7'''<br />
Group members:<br />
<br />
Qianying Zhao<br />
<br />
Hui Huang<br />
<br />
Meiyu Zhou<br />
<br />
Gezhou Zhang<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction<br />
<br />
'''Description:''' <br />
Our group will participate in the featured Kaggle competition of Google Analytics Customer Revenue Prediction. In this competition, we will analyze customer dataset from a Google Merchandise Store selling swags to predict revenue per customer using Rstudio. Our presentation report will include not only how we've concluded by classifying and analyzing provided data with appropriate models, but also how we performed in the contest.<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 8'''<br />
Group members:<br />
<br />
Jiayue Zhang<br />
<br />
Lingyun Yi<br />
<br />
Rongrong Su<br />
<br />
Siao Chen<br />
<br />
<br />
'''Title:''' Kaggle--Two Sigma: Using News to Predict Stock Movements<br />
<br />
<br />
'''Description:''' <br />
Stock price is affected by the news to some extent. What is the news influence on stock price and what is the predicted power of the news? <br />
What we are going to do is to use the content of news to predict the tendency of stock price. We will mine the data, finding the useful information behind the big data. As the result we will predict the stock price performance when market faces news.<br />
<br />
<br />
--------------------------------------------------------------------<br />
'''Project # 9'''<br />
Group members:<br />
<br />
Hassan, Ahmad Nayar<br />
<br />
McLellan, Isaac<br />
<br />
Brewster, Kristi<br />
<br />
Melek, Marina Medhat Rassmi <br />
<br />
<br />
'''Title:''' Quick, Draw! Doodle Recognition<br />
<br />
'''Description:''' <br />
<br />
'''Background'''<br />
<br />
Google’s Quick, Draw! is an online game where a user is prompted to draw an image depicting a certain category in under 20 seconds. As the drawing is being completed, the game uses a model which attempts to correctly identify the image being drawn. With the aim to improve the underlying pattern recognition model this game uses, Google is hosting a Kaggle competition asking the public to build a model to correctly identify a given drawing. The model should classify the drawing into one of the 340 label categories within the Quick, Draw! Game in 3 guesses or less.<br />
<br />
'''Proposed Approach'''<br />
<br />
Each image/doodle (input) is considered as a matrix of pixel values. In order to classify images, we need to essentially reshape an images’ respective matrix of pixel values - convolution. This would reduce the dimensionality of the input significantly which in turn reduces the number of parameters of any proposed recognition model. Using filters, pooling layers and further convolution, a final layer called the fully connected layer is used to correlate images with categories, assigning probabilities (weights) and hence classifying images. <br />
<br />
This approach to image classification is called convolution neural network (CNN) and we propose using this to classify the doodles within the Quick, Draw! Dataset.<br />
<br />
To control overfitting and underfitting of our proposed model and minimizing the error, we will use different architectures consisting of different types and dimensions of pooling layers and input filters.<br />
<br />
'''Challenges'''<br />
<br />
This project presents a number of interesting challenges:<br />
* The data given for training is noisy in that it contains drawings that are incomplete or simply poorly drawn. Dealing with this noise will be a significant part of our work. <br />
* There are 340 label categories within the Quick, Draw! dataset, this means that the model created must be able to classify drawings based on a large pool of information while making effective use of powerful computational resources.<br />
<br />
'''Tools & Resources'''<br />
<br />
* We will use Python & MATLAB.<br />
* We will use the Quick, Draw! Dataset available on the Kaggle competition website. <https://www.kaggle.com/c/quickdraw-doodle-recognition/data><br />
<br />
--------------------------------------------------------------------<br />
'''Project # 10'''<br />
Group members:<br />
<br />
Lam, Amanda<br />
<br />
Huang, Xiaoran<br />
<br />
Chu, Qi<br />
<br />
Sang, Di<br />
<br />
'''Title:''' Kaggle Competition: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:'''<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 11'''<br />
Group members:<br />
<br />
Bobichon, Philomene<br />
<br />
Maheshwari, Aditya<br />
<br />
An, Zepeng<br />
<br />
Stranc, Colin<br />
<br />
'''Title:''' Kaggle Challenge: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 12'''<br />
Group members:<br />
<br />
Huo, Qingxi<br />
<br />
Yang, Yanmin<br />
<br />
Cai, Yuanjing<br />
<br />
Wang, Jiaqi<br />
<br />
'''Title:''' <br />
<br />
'''Description:''' <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 13'''<br />
Group members:<br />
<br />
Ross, Brendan<br />
<br />
Barenboim, Jon<br />
<br />
Lin, Junqiao<br />
<br />
Bootsma, James<br />
<br />
'''Title:''' Expanding Neural Netwrok<br />
<br />
'''Description:''' The goal of our project is to create an expanding neural network algorithm which starts off by training a small neural network then expands it to a larger one. We hypothesize that with the proper expansion method we could decrease training time and prevent overfitting. The method we wish to explore is to link together input dimensions based on covariance. Then when the neural network reaches convergence we create a larger neural network without the links between dimensions and using starting values from the smaller neural network. <br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 14'''<br />
Group members:<br />
<br />
Schneider, Jason <br />
<br />
Walton, Jordyn <br />
<br />
Abbas, Zahraa<br />
<br />
Na, Andrew<br />
<br />
'''Title:''' Application of ML Classification to Cancer Identification<br />
<br />
'''Description:''' The application of machine learning to cancer classification based on gene expression is a topic of great interest to physicians and biostatisticians alike. We would like to work on this for our final project to encourage the application of proven ML techniques to improve accuracy of cancer classification and diagnosis. In this project, we will use the dataset from Golub et al. [1] which contains data on gene expression on tumour biopsies to train a model and classify healthy individuals and individuals who have cancer.<br />
<br />
One challenge we may face pertains to the way that the data was collected. Some parts of the dataset have thousands of features (which each represent a quantitative measure of the expression of a certain gene) but as few as twenty samples. We propose some ways to mitigate the impact of this; including the use of PCA, leave-one-out cross validation, or regularization. <br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 15'''<br />
Group members:<br />
<br />
Praneeth, Sai<br />
<br />
Peng, Xudong <br />
<br />
Li, Alice<br />
<br />
Vajargah, Shahrzad<br />
<br />
'''Title:''' Google Analytics Customer Revenue Prediction [1] - A Kaggle Competition<br />
<br />
'''Description:''' Guess which cabin class in airlines is the most profitable? One might guess economy - but in reality, it's the premium classes that show higher returns. According to research conducted by Wendover productions [2], despite having less than 50 seats and taking up more space than the economy class, premium classes end up driving more revenue than other classes.<br />
<br />
In fact, just like airlines, many companies adopt the business model where the vast majority of revenue is derived from a minority group of customers. As a result, data-intensive promotional strategies are getting more and more attention nowadays from marketing teams to further improve company returns.<br />
<br />
In this Kaggle competition, we are challenged to analyze a Google Merchanidize Store's customer dataset to predict revenue per customer. We will implement a series of data analytics methods including pre-processing, data augmentation, and parameter tuning. Different classification algorithms will be compared and optimized in order to achieve the best results.<br />
<br />
'''Reference:'''<br />
<br />
[1] Kaggle. (2018, Sep 18). Google Analytics Customer Revenue Prediction. Retrieved from https://www.kaggle.com/c/ga-customer-revenue-prediction<br />
<br />
[2] Kottke, J (2017, Mar 17). The economics of airline classes. Retrieved from https://kottke.org/17/03/the-economics-of-airline-classes<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 16'''<br />
Group members:<br />
<br />
Wang, Yu Hao<br />
<br />
Grant, Aden <br />
<br />
McMurray, Andrew<br />
<br />
Song, Baizhi<br />
<br />
'''Title:''' Two Sigma: Using News to Predict Stock Movements - A Kaggle Competition<br />
<br />
By analyzing news data to predict stock prices, Kagglers have a unique opportunity to advance the state of research in understanding the predictive power of the news. This power, if harnessed, could help predict financial outcomes and generate significant economic impact all over the world.<br />
<br />
Data for this competition comes from the following sources:<br />
<br />
Market data provided by Intrinio.<br />
News data provided by Thomson Reuters. Copyright ©, Thomson Reuters, 2017. All Rights Reserved. Use, duplication, or sale of this service, or data contained herein, except as described in the Competition Rules, is strictly prohibited.<br />
<br />
we will test a variety of classification algorithms to determine an appropriate model.<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 17'''<br />
Group Members:<br />
<br />
Jiang, Ya Fan<br />
<br />
Zhang, Yuan<br />
<br />
Hu, Jerry Jie<br />
<br />
'''Title:''' Kaggle Competition: Quick, Draw! Doodle Recognition Challenge<br />
<br />
'''Description:''' Construction of a classifier that can learn from noisy training data and generalize to a clean test set . Training data coming from the Google game "Quick, Draw"<br />
<br />
----------------------------------------------------------------------<br />
<br />
'''Project # 18'''<br />
Group Members:<br />
<br />
Zhang, Ben<br />
<br />
'''Title:''' Two Sigma: Using News to Predict Stock Movements<br />
<br />
'''Description:''' Use news analytics to predict stock price performance. This is subject to change.<br />
<br />
----------------------------------------------------------------------<br />
'''Project # 19'''<br />
Group Members:<br />
<br />
Yan Yu Chen<br />
Qisi Deng<br />
Hengxin Li<br />
Bochao Zhang<br />
<br />
Our team currently has two interested topics at hand, and we have summarized the objective of each topic below. Please note that we will narrow down our choices after further discussions with the instructor.<br />
<br />
'''Description 1:''' With 14 percent of American claiming that social media is their most dominant news source, fake news shared on Facebook and Twitter are invading people’s information learning experience. Concomitantly, the quality and nature of online news have been gradually diluted by fake news that are sometimes imperceptible. With an aim of creating an unalloyed Internet surfing experience, we sought to develop a tool that performs fake news detection and classification. <br />
<br />
'''Description 2:''' Statistics Canada has recently reported an increasing trend of Toronto’s violent crime score. Though the Royal Canadian Mounted Police has put in the effort and endeavor to track crimes, the ambiguous snapshots captured by outdated cameras often hamper the investigation. Motivated by the aforementioned circumstance, our second interest focuses on the accurate numeral and letter identification within variable-resolution images.</div>
Q26deng
http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F18&diff=36570
stat441F18
2018-10-04T20:16:23Z
<p>Q26deng: </p>
<hr />
<div><br />
<br />
== [[F18-STAT841-Proposal| Project Proposal ]] ==<br />
<br />
[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|NOv 13 || || 1|| || || <br />
|-<br />
|Nov 13 || || 2|| || || <br />
|-<br />
|NOv 15 || Yan Yu Chen, Qisi Deng, Hengxin Li, Bochao Zhang|| 3|| Will be added soon|| || <br />
|-<br />
|Nov 15 || || 4|| || || <br />
|-<br />
|NOv 20 || Kristi Brewster, Isaac McLellan, Ahmad Nayar Hassan, Marina Medhat Rassmi Melek || 5|| A Neural Representation of Sketch Drawings || || <br />
|-<br />
|Nov 20 || Maya(Mahdiyeh) Bayati, Saber Malekmohammadi, Vincent || 6|| Will be added soon || || <br />
|-<br />
|NOv 22 || Qingxi Huo, Yanmin Yang, Jiaqi Wang, Yuanjing Cai || 7|| Will be added soon || || <br />
|-<br />
|Nov 22 || Hanzhen Yang, Jing Pu Sun, Ganyuan Xuan, Yu Su|| 8|| Deep Residual Learning for Image Recognition || [http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf Paper] || <br />
|-<br />
|NOv 27 || Mitchell Snaith || 9|| You Only Look Once: Unified, Real-Time Object Detection, V1 -> V3 || [https://arxiv.org/pdf/1506.02640.pdf Paper] || <br />
|-<br />
|Nov 27 || Qi Chu, Gloria Huang, Dylan Sang, Amanda Lam|| 10|| tba || || <br />
|-<br />
|NOv 29 || Jameson Ngo, Amy Xu || 11|| TBA || || <br />
|-<br />
|Nov 29 || || 12|| || ||</div>
Q26deng