Wide and Deep Learning for Recommender Systems
This paper presents a jointly trained wide linear models and deep neural networks architecture - Wide & Deep Learning. In the past, deep neural networks which is good at generalization and generalized linear models with nonlinear feature transformations methods which is good at memorization are widely used in the recommender system. However, combining the benefits of the two models can achieve both memorization and generalization at the same time in recommender system. With jointly training wide linear models and deep neural networks, this paper has demonstrated that a newly proposed Wide & Deep learning outperforms wide-only and deep-only models in recommender systems under the Google Play app with over one billion active users and over one million apps.
1. Embedding-based models like factorization machines  factorizes the interactions between two variables as a dot product between two low dimensional embedding vectors to achieve generalization.
2. Joint training of RNN and maximum entropy models with n-gram features in language models has significantly reduced the complexity of RNN by learning direct weights between inputs and outputs .
3. Deep residual learning  can reduce the difficulty of training deeper models and improves the accuracy with shortcut connections.
4. Collaborative deep learning haven been used to couple deep learning for content information and collaborative filtering for the rating matrix .
Can we build a model to achieve both memorization and generalization? This question motivates the concept of joint training wide and deep models, specifically Wide & Deep Learning.
The performance of generalized linear models with cross-product transformation can be improved by adding features that are less granular. However, this requires lots of work in feature engineering. On the other hand, the performance of embedding-based models can be improved by linear models with cross-product feature transformations to memorize the rules with a few number of parameters.
Thus, to handle both problems would be to combine the wide and deep models in the training phase. Therefore, the architecture was motivated by Heng-Tze et al.  that overcome these difficulties by jointly training wide models and deep models together. It takes the advantage of both memorization and generalization.
The wide component is a GLM in the form of [math]y=w^Tx+b[/math] as illustrated in the left part of Figure 1 where y is the prediction, x is a vector of d features, w are the model parameters in d-dimensional and b is the bias. And the feature set includes transformed features using the cross-product transformation which can be defined as:
And, this transformation adds nonlinearity to the GLM and captures the interactions between the binary features.
The deep component is a feed-forward neural network as illustrated in the right part of Figure 1. For the sparse inputs, high dimensional categorical features are converted into a low-dimensional and dense real-valued vector (embedding vector). Then, the embedding vector is initialized randomly and trained to minimize the final loss function during training. Last, the low dimensional dense embedding vectors are fed into the hidden layers of the network in the forward pass.
During the training phase, the wide component and deep component are combined using the weighted sum of their output log odds. This gives the prediction and then fed to one common logistic loss function for joint training with back-propagating the gradients from output to both parts of the model simultaneously using mini=batch stochastic optimization. Also, the author used FTRL  with L1 regularization and AdaGrad  as optimizers for the wide and deep part respectively.
The proposed architecture was implemented and evaluated in a real-world recommender system, Google Play app in two aspects: app acquisitions and serving performance.
For app acquisition, the author conducted live online experiments in an A/B testing framework for 3 weeks, where in the control group, 1% users were randomly selected and presented with the previous recommendation models and in the experiment group, 1% users were randomly selected and presented with the Wide & Deep model using the same features as the wide model. Also, 1% users were randomly selected and presented with the deep part of the model with same network structure and features. In Table 1, the Wide & Deep model outperforms the wide model and deep model by 2.9 % and 3.9% respectively on online acquisition gain. And for offline experiments, the Wide & Deep model outperforms the wide model and deep model by 0.002 and 0.006 in terms of AUC. Note that the difference is relative small in offline compared to online since the labels in offline data are fixed while the online system can generate new exploratory recommendations using both memorization and generalization.
For serving performance, during the peak traffic, the author implemented multithreading and split each batch into smaller sizes which reduced the client-side latency from 31ms to 14ms as shown in Table 2.
Achieving both memorization and generalization is important in recommender system. The Wide & Deep learning proposed in the paper combines wide model and deep model to achieve these two factors, where the wide linear models memorize sparse feature interactions with cross-product feature transformations while the deep neural network uses low-dimensional representation to generalize to unseen feature interactions. And the proposed model led to significant improvement on app acquisitions over wide models and deep models on the Google Play recommender system.
The Wide & Deep learning framework has dominated in the recommender system over the last 5 years where almost every company uses it. However, the model prefers to extract low dimensional or high dimensional combined features where it cannot extract both types of features at the same time. So, it requires specialized domain knowledge to do feature engineering and the model doesn't learn well on low dimensional combinational features.
 J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, July 2011.
 K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 H. B. McMahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. In Proc. AISTATS, 2011.
 T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. H. Cernocky. Strategies for training large scale neural network language models. In IEEE Automatic Speech Recognition & Understanding Workshop, 2011.
 S. Rendle. Factorization machines with libFM. ACM Trans. Intell. Syst. Technol., 3(3):57:1–57:22, May 2012.abc
 Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, Hemal Shah. Wide & Deep Learning for Recommender Systems. arXiv:1606.07792v1 [cs.LG] 24 Jun 2016
 H. Wang, N. Wang, and D.-Y. Yeung. Collaborative deep learning for recommender systems. In Proc. KDD, pages 1235–1244, 2015.