Wide and Deep Learning for Recommender Systems: Difference between revisions
No edit summary |
No edit summary |
||
Line 10: | Line 10: | ||
== Related Work == | == Related Work == | ||
1. ''' | 1. '''Embedding-based models''' like factorization machines [5] factorizes the interactions between two variables as a dot product between two low dimensional embedding vectors to achieve generalization. | ||
2. | 2. Joint training of RNN and maximum entropy models with n-gram features in language models has significantly reduced the complexity of RNN by learning direct weights between inputs and outputs [4]. | ||
3. Deep residual learning [2] can reduce the difficulty of training deeper models and improves the accuracy with shortcut connections. | |||
4. Collaborative deep learning haven been used to couple deep learning for content information and collaborative filtering for the rating matrix [7]. | |||
== Motivation == | == Motivation == | ||
Line 18: | Line 22: | ||
Can we build a model to achieve both memorization and generalization? This question motivates the concept of joint training wide and deep models, specifically Wide & Deep Learning. | Can we build a model to achieve both memorization and generalization? This question motivates the concept of joint training wide and deep models, specifically Wide & Deep Learning. | ||
The performance of generalized linear models with cross-product transformation can be improved by adding features that are less granular. However, this requires lots of work in feature engineering. On the other hand, the performance of embedding-based models can be improved by | The performance of generalized linear models with cross-product transformation can be improved by adding features that are less granular. However, this requires lots of work in feature engineering. On the other hand, the performance of embedding-based models can be improved by linear models with cross-product feature transformations to memorize the rules with a few number of parameters. | ||
Thus, to handle both problems would be to combine the wide and deep models in the training phase. Therefore, the architecture was motivated by Heng-Tze et al. [1] that overcome these difficulties by jointly training wide models and deep models together. It takes the advantage of both memorization and generalization. | Thus, to handle both problems would be to combine the wide and deep models in the training phase. Therefore, the architecture was motivated by Heng-Tze et al. [1] that overcome these difficulties by jointly training wide models and deep models together. It takes the advantage of both memorization and generalization. | ||
Line 24: | Line 28: | ||
== Model Architecture == | == Model Architecture == | ||
The wide component is a GLM in the form of <math>y=w^Tx+b</math> | |||
== Model Results == | == Model Results == | ||
Line 40: | Line 44: | ||
== Conclusion == | == Conclusion == | ||
Achieving both memorization and generalization is important in recommender system. The Wide & Deep learning proposed in the paper combines wide model and deep model to achieve these two factors, where the wide linear models memorize sparse feature interactions with cross-product feature transformations while the deep neural network uses low-dimensional representation to generalize to unseen feature interactions. And the proposed model led to significant improvement on app acquisitions over wide models and deep models on the Google Play recommender system. | |||
== Critiques == | == Critiques == | ||
The Wide & Deep learning framework has dominated in the recommender system over the last 5 years where almost every company uses it. However, the model prefers to extract low dimensional or high dimensional combined features where it cannot extract both types of features at the same time. So, it requires specialized domain knowledge to do feature engineering and the model doesn't learn well on low dimensional combinational features. | |||
== References == | == References == | ||
[1] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, July 2011. | [1] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, July 2011. | ||
Line 65: | Line 67: | ||
[8] B. Yan and G. Chen. AppJoy: Personalized mobile application discovery. In MobiSys, pages 113–126, 2011. | [8] B. Yan and G. Chen. AppJoy: Personalized mobile application discovery. In MobiSys, pages 113–126, 2011. | ||
[9] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, Hemal Shah. Wide & Deep Learning for Recommender Systems. arXiv:1606.07792v1 [cs.LG] 24 Jun 2016 |
Revision as of 03:04, 1 December 2021
Presented by
Junbin Pan
Introduction
This paper presents a jointly trained wide linear models and deep neural networks architecture - Wide & Deep Learning. In the past, deep neural networks which is good at generalization and generalized linear models with nonlinear feature transformations methods which is good at memorization are widely used in the recommender system. However, combining the benefits of the two models can achieve both memorization and generalization at the same time in recommender system. With jointly training wide linear models and deep neural networks, this paper has demonstrated that a newly proposed Wide & Deep learning outperforms wide-only and deep-only models in recommender systems under the Google Play app with over one billion active users and over one million apps.
Related Work
1. Embedding-based models like factorization machines [5] factorizes the interactions between two variables as a dot product between two low dimensional embedding vectors to achieve generalization.
2. Joint training of RNN and maximum entropy models with n-gram features in language models has significantly reduced the complexity of RNN by learning direct weights between inputs and outputs [4].
3. Deep residual learning [2] can reduce the difficulty of training deeper models and improves the accuracy with shortcut connections.
4. Collaborative deep learning haven been used to couple deep learning for content information and collaborative filtering for the rating matrix [7].
Motivation
Can we build a model to achieve both memorization and generalization? This question motivates the concept of joint training wide and deep models, specifically Wide & Deep Learning.
The performance of generalized linear models with cross-product transformation can be improved by adding features that are less granular. However, this requires lots of work in feature engineering. On the other hand, the performance of embedding-based models can be improved by linear models with cross-product feature transformations to memorize the rules with a few number of parameters.
Thus, to handle both problems would be to combine the wide and deep models in the training phase. Therefore, the architecture was motivated by Heng-Tze et al. [1] that overcome these difficulties by jointly training wide models and deep models together. It takes the advantage of both memorization and generalization.
Model Architecture
The wide component is a GLM in the form of [math]\displaystyle{ y=w^Tx+b }[/math]
Model Results
The proposed architecture was implemented and evaluated in a real-world recommender system, Google Play app in two aspects: app acquisitions and serving performance.
For app acquisition, the author conducted live online experiments in an A/B testing framework for 3 weeks, where in the control group, 1% users were randomly selected and presented with the previous recommendation models and in the experiment group, 1% users were randomly selected and presented with the Wide & Deep model using the same features as the wide model. Also, 1% users were randomly selected and presented with the deep part of the model with same network structure and features. In Table 1, the Wide & Deep model outperforms the wide model and deep model by 2.9 % and 3.9% respectively on online acquisition gain. And for offline experiments, the Wide & Deep model outperforms the wide model and deep model by 0.002 and 0.006 in terms of AUC. Note that the difference is relative small in offline compared to online since the labels in offline data are fixed while the online system can generate new exploratory recommendations using both memorization and generalization.
For serving performance, during the peak traffic, the author implemented multithreading and split each batch into smaller sizes which reduced the client-side latency from 31ms to 14ms as shown in Table 2.
Conclusion
Achieving both memorization and generalization is important in recommender system. The Wide & Deep learning proposed in the paper combines wide model and deep model to achieve these two factors, where the wide linear models memorize sparse feature interactions with cross-product feature transformations while the deep neural network uses low-dimensional representation to generalize to unseen feature interactions. And the proposed model led to significant improvement on app acquisitions over wide models and deep models on the Google Play recommender system.
Critiques
The Wide & Deep learning framework has dominated in the recommender system over the last 5 years where almost every company uses it. However, the model prefers to extract low dimensional or high dimensional combined features where it cannot extract both types of features at the same time. So, it requires specialized domain knowledge to do feature engineering and the model doesn't learn well on low dimensional combinational features.
References
[1] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, July 2011.
[2] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[3] H. B. McMahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. In Proc. AISTATS, 2011.
[4] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. H. Cernocky. Strategies for training large scale neural network language models. In IEEE Automatic Speech Recognition & Understanding Workshop, 2011.
[5] S. Rendle. Factorization machines with libFM. ACM Trans. Intell. Syst. Technol., 3(3):57:1–57:22, May 2012.abc
[6] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, NIPS, pages 1799–1807. 2014.
[7] H. Wang, N. Wang, and D.-Y. Yeung. Collaborative deep learning for recommender systems. In Proc. KDD, pages 1235–1244, 2015.
[8] B. Yan and G. Chen. AppJoy: Personalized mobile application discovery. In MobiSys, pages 113–126, 2011.
[9] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, Hemal Shah. Wide & Deep Learning for Recommender Systems. arXiv:1606.07792v1 [cs.LG] 24 Jun 2016