Wide and Deep Learning for Recommender Systems
This paper presents a jointly trained wide linear models and deep neural networks architecture - Wide & Deep Learning. In the past, deep neural networks which is good at generalization and generalized linear models with nonlinear feature transformations methods which is good at memorization are widely used in the recommender system. However, combining the benefits of the two models can achieve both memorization and generalization at the same time in recommender system. With jointly training wide linear models and deep neural networks, this paper has demonstrated that a newly proposed Wide & Deep learning outperforms wide-only and deep-only models in recommender systems under the Google Play app with over one billion active users and over one million apps.
1. Generalized linear models like logistic regression are trained on binarized sparse features with one-hot encoding using cross-product transformation to achieve memorization, but these models don't generalize to unseen query-item feature pairs, which lacks in generalization.
2. Embedding-based models like factorization machines (S. Rendle, 2012) or deep neural networks learns a low-dimensional dense embedding vector for each query and item feature to generalize on query-item feature pairs that have never been seen before by learning, but with less work on feature engineering. However, under a sparse and high rank query-item matrix, it is hard to learn the low-dimensional representation for the query-item matrix, which lacks in memorization.
Can we build a model to achieve both memorization and generalization? This question motivates the concept of joint training wide and deep models, specifically Wide & Deep Learning.
The performance of generalized linear models with cross-product transformation can be improved by adding features that are less granular. However, this requires lots of work in feature engineering. On the other hand, the performance of embedding-based models can be improved by linar models with cross-product feature transformations to memorize the rules with a few number of parameters.
Thus, to handle both problems would be to combine the wide and deep models together in the training phase. Therefore, the inception architecture was motivated by Heng-Tze et al.  that overcome these difficulties by clustering sparse matrices into relatively dense submatrices. It takes advantage of both extra sparsity and existing computational hardware.
 Heng-Tze Cheng
 J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, July 2011.
 K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 H. B. McMahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. In Proc. AISTATS, 2011.
 T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. H. Cernocky. Strategies for training large scale neural network language models. In IEEE Automatic Speech Recognition & Understanding Workshop, 2011.
 S. Rendle. Factorization machines with libFM. ACM Trans. Intell. Syst. Technol., 3(3):57:1–57:22, May 2012.abc
 J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, NIPS, pages 1799–1807. 2014.
 H. Wang, N. Wang, and D.-Y. Yeung. Collaborative deep learning for recommender systems. In Proc. KDD, pages 1235–1244, 2015.
 B. Yan and G. Chen. AppJoy: Personalized mobile application discovery. In MobiSys, pages 113–126, 2011.