Wide and Deep Learning for Recommender Systems
This paper presents a jointly trained wide linear models and deep neural networks architecture - Wide & Deep Learning. In the past, deep neural networks which is good at generalization and generalized linear models with nonlinear feature transformations methods which is good at memorization are widely used in the recommender system. However, combining the benefits of the two models can achieve both memorization and generalization at the same time in recommender system. With jointly training wide linear models and deep neural networks, this paper has demonstrated that a newly proposed Wide & Deep learning outperforms wide-only and deep-only models in recommender systems under the Google Play app with over one billion active users and over one million apps.
1. Generalized linear models like logistic regression are trained on binarized sparse features with one-hot encoding using cross-product transformation to achieve memorization, but these models don't generalize to unseen query-item feature pairs, which lacks in generalization.
2. Embedding-based models like factorization machines (S. Rendle, 2012) or deep neural networks learns a low-dimensional dense embedding vector for each query and item feature to generalize on query-item feature pairs that have never been seen before by learning, but with less work on feature engineering. However, under a sparse and high rank query-item matrix, it is hard to learn the low-dimensional representation for the query-item matrix, which lacks in memorization.
Can we build a model to achieve both memorization and generalization? This question motivates the concept of joint training wide and deep models, specifically Wide & Deep Learning.
The performance of generalized linear models with cross-product transformation can be improved by adding features that are less granular. However, this requires lots of work in feature engineering. On the other hand, the performance of embedding-based models can be improved by linar models with cross-product feature transformations to memorize the rules with a few number of parameters.
Thus, to handle both problems would be to combine the wide and deep models in the training phase. Therefore, the architecture was motivated by Heng-Tze et al.  that overcome these difficulties by jointly training wide models and deep models together. It takes the advantage of both memorization and generalization.
The proposed architecture was implemented and evaluated in a real-world recommender system, Google Play app in two aspects: app acquisitions and serving performance.
For app acquisition, the author conducted live online experiments in an A/B testing framework for 3 weeks, where in the control group, 1% users were randomly selected and presented with the previous recommendation models and in the experiment group, 1% users were randomly selected and presented with the Wide & Deep model using the same features as the wide model. Also, 1% users were randomly selected and presented with the deep part of the model with same network structure and features. In Table 1, the Wide & Deep model outperforms the wide model and deep model by 2.9 % and 3.9% respectively on online acquisition gain. And for offline experiments, the Wide & Deep model outperforms the wide model and deep model by 0.002 and 0.006 in terms of AUC. Note that the difference is relative small in offline compared to online since the labels in offline data are fixed while the online system can generate new exploratory recommendations using both memorization and generalization.
For serving performance, during the peak traffic, the author implemented multithreading and split each batch into smaller sizes which reduced the client-side latency from 31ms to 14ms as shown in Table 2.
 Heng-Tze Cheng
 J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, July 2011.
 K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 H. B. McMahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. In Proc. AISTATS, 2011.
 T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. H. Cernocky. Strategies for training large scale neural network language models. In IEEE Automatic Speech Recognition & Understanding Workshop, 2011.
 S. Rendle. Factorization machines with libFM. ACM Trans. Intell. Syst. Technol., 3(3):57:1–57:22, May 2012.abc
 J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, NIPS, pages 1799–1807. 2014.
 H. Wang, N. Wang, and D.-Y. Yeung. Collaborative deep learning for recommender systems. In Proc. KDD, pages 1235–1244, 2015.
 B. Yan and G. Chen. AppJoy: Personalized mobile application discovery. In MobiSys, pages 113–126, 2011.