Wide and Deep Learning for Recommender Systems: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(4 intermediate revisions by the same user not shown)
Line 10: Line 10:
== Related Work ==
== Related Work ==


1. '''Generalized linear models''' like logistic regression are trained on binarized sparse features with one-hot encoding using cross-product transformation to achieve memorization, but these models don't generalize to unseen query-item feature pairs, which lacks in generalization.
1. '''Embedding-based models''' like factorization machines [5] factorizes the interactions between two variables as a dot product between two low dimensional embedding vectors to achieve generalization.


2. '''Embedding-based models''' like factorization machines (S. Rendle, 2012) or deep neural networks learns a low-dimensional dense embedding vector for each query and item feature to generalize on query-item feature pairs that have never been seen before by learning, but with less work on feature engineering. However, under a sparse and high rank query-item matrix, it is hard to learn the low-dimensional representation for the query-item matrix, which lacks in memorization.
2. '''Joint training of RNN and maximum entropy models with n-gram features''' in language models has significantly reduced the complexity of RNN by learning direct weights between inputs and outputs [4].
 
3. '''Deep residual learning''' [2] can reduce the difficulty of training deeper models and improves the accuracy with shortcut connections.
 
4. '''Collaborative deep learning''' haven been used to couple deep learning for content information and collaborative filtering for the rating matrix [7].


== Motivation ==
== Motivation ==
Line 18: Line 22:
Can we build a model to achieve both memorization and generalization? This question motivates the concept of joint training wide and deep models, specifically Wide & Deep Learning.
Can we build a model to achieve both memorization and generalization? This question motivates the concept of joint training wide and deep models, specifically Wide & Deep Learning.


The performance of generalized linear models with cross-product transformation can be improved by adding features that are less granular. However, this requires lots of work in feature engineering. On the other hand, the performance of embedding-based models can be improved by linar models with cross-product feature transformations to memorize the rules with a few number of parameters.
The performance of generalized linear models with cross-product transformation can be improved by adding features that are less granular. However, this requires lots of work in feature engineering. On the other hand, the performance of embedding-based models can be improved by linear models with cross-product feature transformations to memorize the rules with a few number of parameters.


Thus, to handle both problems would be to combine the wide and deep models in the training phase. Therefore, the architecture was motivated by Heng-Tze et al. [1] that overcome these difficulties by jointly training wide models and deep models together. It takes the advantage of both memorization and generalization.
Thus, to handle both problems would be to combine the wide and deep models in the training phase. Therefore, the architecture was motivated by Heng-Tze et al. [1] that overcome these difficulties by jointly training wide models and deep models together. It takes the advantage of both memorization and generalization.
Line 24: Line 28:
== Model Architecture ==
== Model Architecture ==


abc
[[File:netowrokstruct.png|700px|thumb|center]]
 
The '''wide component''' is a GLM in the form of <math>y=w^Tx+b</math> as illustrated in the left part of Figure 1 where y is the prediction, x is a vector of d features, w are the model parameters in d-dimensional and b is the bias. And the feature set includes transformed features using the cross-product transformation which can be defined as:
 
[[File:equation.png|700px|thumb|center]]
 
And, this transformation adds nonlinearity to the GLM and captures the interactions between the binary features.
 
The '''deep component''' is a feed-forward neural network as illustrated in the right part of Figure 1. For the sparse inputs, high dimensional categorical features are converted into a low-dimensional and dense real-valued vector (embedding vector). Then, the embedding vector is initialized randomly and trained to minimize the final loss function during training. Last, the low dimensional dense embedding vectors are fed into the hidden layers of the network in the forward pass.
 
During the training phase, the wide component and deep component are combined using the weighted sum of their output log odds. This gives the prediction and then fed to one common logistic loss function for joint training with back-propagating the gradients from output to both parts of the model simultaneously using mini-batch stochastic optimization. Also, the author used FTRL [3] with L1 regularization and AdaGrad [1] as optimizers for the wide and deep part respectively.


== Model Results ==
== Model Results ==
Line 35: Line 49:


For serving performance, during the peak traffic, the author implemented multithreading and split each batch into smaller sizes which reduced the client-side latency from 31ms to 14ms as shown in Table 2.
For serving performance, during the peak traffic, the author implemented multithreading and split each batch into smaller sizes which reduced the client-side latency from 31ms to 14ms as shown in Table 2.
[[File:abc2.png|700px|thumb|center]]


== Conclusion ==
== Conclusion ==


abc
Achieving both memorization and generalization is important in recommender system. The Wide & Deep learning proposed in the paper combines wide model and deep model to achieve these two factors, where the wide linear models memorize sparse feature interactions with cross-product feature transformations while the deep neural network uses low-dimensional representation to generalize to unseen feature interactions. And the proposed model led to significant improvement on app acquisitions over wide models and deep models on the Google Play recommender system.


== Critiques ==
== Critiques ==


abc
The Wide & Deep learning framework has dominated in the recommender system over the last 5 years where almost every company uses it. However, the model prefers to extract low dimensional or high dimensional combined features where it cannot extract both types of features at the same time. So, it requires specialized domain knowledge to do feature engineering and the model doesn't learn well on low dimensional combinational features.


== References ==
== References ==
[1] Heng-Tze Cheng


[1] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, July 2011.
[1] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, July 2011.
Line 58: Line 72:
[5] S. Rendle. Factorization machines with libFM. ACM Trans. Intell. Syst. Technol., 3(3):57:1–57:22, May 2012.abc
[5] S. Rendle. Factorization machines with libFM. ACM Trans. Intell. Syst. Technol., 3(3):57:1–57:22, May 2012.abc


[6] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, NIPS, pages 1799–1807. 2014.
[6] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, Hemal Shah. Wide & Deep Learning for Recommender Systems. arXiv:1606.07792v1 [cs.LG] 24 Jun 2016


[7] H. Wang, N. Wang, and D.-Y. Yeung. Collaborative deep learning for recommender systems. In Proc. KDD, pages 1235–1244, 2015.
[7] H. Wang, N. Wang, and D.-Y. Yeung. Collaborative deep learning for recommender systems. In Proc. KDD, pages 1235–1244, 2015.
[8] B. Yan and G. Chen. AppJoy: Personalized mobile application discovery. In MobiSys, pages 113–126, 2011.

Latest revision as of 04:28, 1 December 2021

Presented by

Junbin Pan

Introduction

This paper presents a jointly trained wide linear models and deep neural networks architecture - Wide & Deep Learning. In the past, deep neural networks which is good at generalization and generalized linear models with nonlinear feature transformations methods which is good at memorization are widely used in the recommender system. However, combining the benefits of the two models can achieve both memorization and generalization at the same time in recommender system. With jointly training wide linear models and deep neural networks, this paper has demonstrated that a newly proposed Wide & Deep learning outperforms wide-only and deep-only models in recommender systems under the Google Play app with over one billion active users and over one million apps.

Related Work

1. Embedding-based models like factorization machines [5] factorizes the interactions between two variables as a dot product between two low dimensional embedding vectors to achieve generalization.

2. Joint training of RNN and maximum entropy models with n-gram features in language models has significantly reduced the complexity of RNN by learning direct weights between inputs and outputs [4].

3. Deep residual learning [2] can reduce the difficulty of training deeper models and improves the accuracy with shortcut connections.

4. Collaborative deep learning haven been used to couple deep learning for content information and collaborative filtering for the rating matrix [7].

Motivation

Can we build a model to achieve both memorization and generalization? This question motivates the concept of joint training wide and deep models, specifically Wide & Deep Learning.

The performance of generalized linear models with cross-product transformation can be improved by adding features that are less granular. However, this requires lots of work in feature engineering. On the other hand, the performance of embedding-based models can be improved by linear models with cross-product feature transformations to memorize the rules with a few number of parameters.

Thus, to handle both problems would be to combine the wide and deep models in the training phase. Therefore, the architecture was motivated by Heng-Tze et al. [1] that overcome these difficulties by jointly training wide models and deep models together. It takes the advantage of both memorization and generalization.

Model Architecture

The wide component is a GLM in the form of [math]\displaystyle{ y=w^Tx+b }[/math] as illustrated in the left part of Figure 1 where y is the prediction, x is a vector of d features, w are the model parameters in d-dimensional and b is the bias. And the feature set includes transformed features using the cross-product transformation which can be defined as:

And, this transformation adds nonlinearity to the GLM and captures the interactions between the binary features.

The deep component is a feed-forward neural network as illustrated in the right part of Figure 1. For the sparse inputs, high dimensional categorical features are converted into a low-dimensional and dense real-valued vector (embedding vector). Then, the embedding vector is initialized randomly and trained to minimize the final loss function during training. Last, the low dimensional dense embedding vectors are fed into the hidden layers of the network in the forward pass.

During the training phase, the wide component and deep component are combined using the weighted sum of their output log odds. This gives the prediction and then fed to one common logistic loss function for joint training with back-propagating the gradients from output to both parts of the model simultaneously using mini-batch stochastic optimization. Also, the author used FTRL [3] with L1 regularization and AdaGrad [1] as optimizers for the wide and deep part respectively.

Model Results

The proposed architecture was implemented and evaluated in a real-world recommender system, Google Play app in two aspects: app acquisitions and serving performance.

For app acquisition, the author conducted live online experiments in an A/B testing framework for 3 weeks, where in the control group, 1% users were randomly selected and presented with the previous recommendation models and in the experiment group, 1% users were randomly selected and presented with the Wide & Deep model using the same features as the wide model. Also, 1% users were randomly selected and presented with the deep part of the model with same network structure and features. In Table 1, the Wide & Deep model outperforms the wide model and deep model by 2.9 % and 3.9% respectively on online acquisition gain. And for offline experiments, the Wide & Deep model outperforms the wide model and deep model by 0.002 and 0.006 in terms of AUC. Note that the difference is relative small in offline compared to online since the labels in offline data are fixed while the online system can generate new exploratory recommendations using both memorization and generalization.

For serving performance, during the peak traffic, the author implemented multithreading and split each batch into smaller sizes which reduced the client-side latency from 31ms to 14ms as shown in Table 2.

Conclusion

Achieving both memorization and generalization is important in recommender system. The Wide & Deep learning proposed in the paper combines wide model and deep model to achieve these two factors, where the wide linear models memorize sparse feature interactions with cross-product feature transformations while the deep neural network uses low-dimensional representation to generalize to unseen feature interactions. And the proposed model led to significant improvement on app acquisitions over wide models and deep models on the Google Play recommender system.

Critiques

The Wide & Deep learning framework has dominated in the recommender system over the last 5 years where almost every company uses it. However, the model prefers to extract low dimensional or high dimensional combined features where it cannot extract both types of features at the same time. So, it requires specialized domain knowledge to do feature engineering and the model doesn't learn well on low dimensional combinational features.

References

[1] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, July 2011.

[2] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2016.

[3] H. B. McMahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. In Proc. AISTATS, 2011.

[4] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. H. Cernocky. Strategies for training large scale neural network language models. In IEEE Automatic Speech Recognition & Understanding Workshop, 2011.

[5] S. Rendle. Factorization machines with libFM. ACM Trans. Intell. Syst. Technol., 3(3):57:1–57:22, May 2012.abc

[6] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, Hemal Shah. Wide & Deep Learning for Recommender Systems. arXiv:1606.07792v1 [cs.LG] 24 Jun 2016

[7] H. Wang, N. Wang, and D.-Y. Yeung. Collaborative deep learning for recommender systems. In Proc. KDD, pages 1235–1244, 2015.