MediaWiki API result

This is the HTML representation of the JSON format. HTML is good for debugging, but is unsuitable for application use.

Specify the format parameter to change the output format. To see the non-HTML representation of the JSON format, set format=json.

See the complete documentation, or the API help for more information.

{
    "batchcomplete": "",
    "continue": {
        "gapcontinue": "Research_Papers_Classification_System",
        "continue": "gapcontinue||"
    },
    "warnings": {
        "main": {
            "*": "Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes."
        },
        "revisions": {
            "*": "Because \"rvslots\" was not specified, a legacy format has been used for the output. This format is deprecated, and in the future the new format will always be used."
        }
    },
    "query": {
        "pages": {
            "3884": {
                "pageid": 3884,
                "ns": 0,
                "title": "Reinforcement Learning of Theorem Proving",
                "revisions": [
                    {
                        "contentformat": "text/x-wiki",
                        "contentmodel": "wikitext",
                        "*": "== Introduction ==\nAutomated reasoning over mathematical proofs was a major motivation for the development of computer science. Automated theorem provers(ATPs) can, in principle, be used to attack any formally stated mathematical problem and is a research area that has been present since the early 20th century [1]. As of today, state-of-art ATP systems rely on the fast implementation of complete proof calculi, such as resolution and tableau. However, they are still far weaker than trained mathematicians. Within current ATP systems, many heuristics are essential for their performance. As a result, in recent years machine learning has been used to replace such heuristics and improve the performance of ATPs.\n\nIn this paper, the authors propose a reinforcement learning based ATP, rlCoP. The proposed ATP reasons within first-order logic. The underlying proof calculi are the connection calculi [2], and the reinforcement learning method is Monte Carlo tree search along with policy and value learning. It is shown that reinforcement learning results in a 42.1% performance increase compared to the base prover(without learning).\n\n== Related Work ==\nC. Kalizyk and J. Urban proposed a supervised learning based ATP, FEMaLeCoP, whose underlying proof calculi is the same as this paper in 2015 [3]. Their algorithm learns from existing proofs to choose the next tableau extension step. Since the MaLARea [8] system, number of iterations of a feedback loop between proving and learning have been explored, remarkably improving over human-designed heuristics when reasoning in large theories. However, such systems are known to only learn a high-level selection of relevant facts from a large knowledge base and delegate the internal proof search to standard ATP systems. S. Loos, et al. developed an supervised learning ATP system in 2017 [4], with superposition as their proof calculi. However, they chose deep neural network (CNNs and RNNs) as feature extractor. These systems are treated as black boxes in literature with not much understanding of their performances possible. \n\nIn leanCoP [9], one of the simpler connection tableau systems, the next tableau extension step could be selected using supervised learning. In addition, the first experiments with Monte-Carlo guided proof search [5] have been done for connection tableau systems. The improvement over the baseline measured in that work is much less significant than here. This is closest to the authors' approach but the performance is poorer than this paper.\n\nOn a different note, A. Alemi, et al. proposed a deep sequence model for premise selection in 2016 [6], and they claim to be the first team to involve deep neural networks in ATPs. Although premise selection is not directly linked to automated reasoning, it is still an important component in ATPs, and their paper provides some insights into how to process datasets of formally stated mathematical problems.\n\n== First Order Logic and Connection Calculi ==\nHere we assume basic first-order logic and theorem proving terminology, and we will offer a brief introduction of the bare prover and connection calculi. Let us try to prove the following first-order sentence.\n\n[[file:fof_sentence.png|frameless|450px|center]]\n\nThis sentence can be transformed into a formula in Skolemized Disjunctive Normal Form (DNF), which is referred to as the \"matrix\".\n\n[[file:skolemized_dnf.png|frameless|450px|center]] \n[[file:matrix.png|frameless|center]] \n\nThe original first-order sentence is valid if and only if the Skolemized DNF formula is a tautology. The connection calculi attempt to show that the Skolemized DNF formula is a tautology by constructing a tableau. We will start at the special node, root, which is an open leaf. At each step, we select a clause (for example, clause <math display=\"inline\">P \\wedge R</math> is selected in the first step), and add the literals as children for an existing open leaf. For every open leaf, examine the path from the root to this leaf. If two literals on this path are unifiable (for example, <math display=\"inline\">Qx'</math> is unifiable with <math display=\"inline\">\\neg Qc</math>), this leaf is then closed. An example of closed tableaux is shown in Figure 1. In standard terminology, it states that a connection is found on this branch.\n\n[[file:tableaux_example.png|thumb|center|Figure 1. An example of closed tableaux. Adapted from [2]]]\n\nThe paper's goal is to close every leaf, i.e. on every branch, there exists a connection. If such state is reached, the paper has shown that the Skolemized DNF formula is a tautology, thus proving the original first-order sentence. As we can see from the constructed tableaux, the example sentence is indeed valid.\n\nIn formal terms, the rules of connection calculi is shown in Figure 2, and the formal tableaux for the example sentence is shown in Figure 3. Each leaf is denoted as <math display=\"inline\">subgoal, M, path</math> where <math display=\"inline\">subgoal</math> is a list of literals that we need to find connection later, <math display=\"inline\">M</math> stands for the matrix, and <math display=\"inline\">path</math> stands for the path leading to this leaf.\n\n[[file:formal_calculi.png|thumb|600px|center|Figure 2. Formal connection calculi. Adapted from [2].]]\n[[file:formal_tableaux.png|thumb|600px|center|Figure 3. Formal tableaux constructed from the example sentence. Adapted from [2].]]\n\nTo sum up, the bare prover follows a very simple algorithm. given a matrix, a non-negated clause is chosen as the first subgoal. The function ''prove(subgoal, M, path)'' is stated as follows:\n* If ''subgoal'' is empty\n**    return ''TRUE''\n* If reduction is possible\n**    Perform reduction, generating ''new_subgoal'', ''new_path''\n**    return ''prove(new_subgoal, M, new_path)''\n* For all clauses in ''M''\n**    If a clause can do extension with ''subgoal''\n**    Perform extension, generating ''new_subgoal1'', ''new_path'', ''new_subgoal2''\n**    return ''prove(new_subgoal1, M, new_path)'' and ''prove(new_subgoal2, M, path)''\n* return ''FALSE''\n\nIt is important to note that the bare prover implemented in this paper is incomplete. Here is a pathological example. Suppose the following matrix (which is trivially a tautology) is feed into the bare prover. Let clause <math display=\"inline\">P(0)</math> be the first subgoal. Clearly choosing <math display=\"inline\">\\neg P(0)</math> to extend will complete the proof.\n\n[[file:pathological.png|frameless|400px|center]] \n\nHowever, if we choose <math display=\"inline\">\\neg P(x) \\lor P(s(x))</math> to do extension, the algorithm will generate an infinite branch <math display=\"inline\">P(0), P(s(0)), P(s(s(0))) ...</math>. It is the task of reinforcement learning to guide the prover in such scenarios towards a successful proof.\n\nA technique called iterative deepening can be used to avoid such infinite loop, making the bare prover complete. Iterative deepening will force the prover to try all shorter proofs before moving into long ones, it is effective, but also waste valuable computing resource trying to enumerate all short proofs.\n\nIn addition, the provability of first-order sentences is generally undecidable (this result is named the Church-Turing Thesis), which sheds light on the difficulty of automated theorem proving.\n\n== Mizar Math Library ==\nMizar Math Library (MML) [7, 10] is a library of mathematical theories. The axioms behind the library is the Tarski-Grothendieck set theory, written in first-order logic. The library contains 57,000+ theorems and their proofs, along with many other lemmas, as well as unproven conjectures. Figure 4 shows a Mizar article of the theorem \"If <math display=\"inline\"> p </math> is prime, then <math display=\"inline\"> \\sqrt p </math> is irrational.\"\n\n[[file:mizar_article.png|thumb|center|Figure 4. An article from MML. Adapted from [6].]]\n\nThe training and testing data for this paper is a subset of MML, the Mizar40, which is 32,524 theorems proved by automated theorem provers. Below is an example from the Mizar40 library, it states that with ''d3_xboole_0'' and ''t3_xboole_0'' as premises, we can prove ''t5_xboole_0''.\n\n[[file:mizar40_0.png|frameless|400px|center]]\n[[file:mizar40_1.png|frameless|600px|center]]\n[[file:mizar40_2.png|frameless|600px|center]]\n[[file:mizar40_3.png|frameless|600px|center]]\n\n== Monte Carlo Guidance ==\n\nMonte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes. The focus of Monte Carlo tree search is on the analysis of the most promising moves, expanding the search tree based on random sampling of the search space.  Then the expansion will then be used to weight the node in the search tree.\n\nIn the reinforcement learning setting, the action is defined as one inference (either reduction or extension). The proof state is defined as the whole tableaux. To implement Monte-Carlo tree search, each proof state <math display=\"inline\"> i </math> needs to maintain three parameters, its prior probability <math display=\"inline\"> p_i </math>, its total reward <math display=\"inline\"> w_i </math>, and number of its visits <math display=\"inline\"> n_i </math>. If no policy learning is used, the prior probabilities are all equal to one. \n\nA simple heuristic is used to estimate the future reward of leaf states: suppose leaf state <math display=\"inline\"> i </math> has <math display=\"inline\"> G_i </math> open subgoals, the reward is computed as <math display=\"inline\"> 0.95 ^ {G_i}  </math>. This will be replaced once value learning is implemented.\n\nThe standard UCT formula is chosen to select the next actions in the playouts\n\\begin{align}\n{\\frac{w_i}{n_i}} + 2 \\cdot p_i \\cdot {\\sqrt{\\frac{\\log N}{n_i}}}\n\\end{align}\nwhere <math display=\"inline\"> N </math> stands for the total number of visits of the parent node.\n\nThe bare prover is asked to play <math display=\"inline\"> b </math> playouts of length <math display=\"inline\"> d </math> from the empty tableaux, each playout backpropagates the values of proof states it visits. After these <math display=\"inline\"> b </math> playouts a special action (inference) is made, corresponding to an actual move, resulting in a new bigstep tableaux. The next <math display=\"inline\"> b </math> playouts will start from this tableaux, followed by another bigstep, etc.\n\n== Policy Learning and Guidance ==\n\nFrom many runs of MCT, we will know the optimal prior probability of actions (inferences) in particular proof states, we can extract the frequency of each action <math display=\"inline\"> a </math>, and normalize it by dividing with the average action frequency at that state, resulting in a relative proportion <math display=\"inline\"> r_a \\in (0, \\infty) </math>. We characterize the proof states for policy learning by extracting human-engineered features. Also, we characterize actions by extracting features from the clause chosen and literal chosen as well. Thus we will have a feature vector <math display=\"inline\"> (f_s, f_a) </math>. \n\nThe feature vector <math display=\"inline\"> (f_s, f_a) </math> is regressed against the associated <math display=\"inline\"> r_a </math>.\n\nDuring the proof search, the prior probabilities <math display=\"inline\"> p_i </math> of available actions <math display=\"inline\"> a_i </math> in a state <math display=\"inline\"> s </math> is computed as the softmax of their predictions.\n\nTraining examples are only extracted from big step states, making the amount of training data manageable.\n\n== Value Learning and Guidance ==\n\nBigstep states are also used for proof state evaluation. For a proof state <math display=\"inline\"> s </math>, if it corresponds to a successful proof, the value is assigned as <math display=\"inline\"> v_s = 1 </math>. If it corresponds to a failed proof, the value is assigned as <math display=\"inline\"> v_s = 0 </math>. For other scenarios, denote the distance between state <math display=\"inline\"> s </math> and a successful state as <math display=\"inline\"> d_s </math>, then the value is assigned as <math display=\"inline\"> v_s = 0.99^{d_s} </math> \n\nProof state feature <math display=\"inline\"> f_s </math> is regressed against the value <math display=\"inline\"> v_s </math>. During the proof search, the reward of leaf states are computed from this prediction.\n\n== Features and Learners ==\nFor proof states, features are collected from the whole tableaux (subgoals, matrix, and paths). Each unique symbol is represented by an integer, and the tableaux can be represented as a sequence of integers. Term walk is implemented to combine a sequence of integers into a single integer by multiplying components by a fixed large prime and adding them up. Then the resulting integer is reduced to a smaller feature space by taking modulo by a large prime.\n\nFor actions the feature extraction process is similar, but the term walk is over the chosen literal and the chosen clause.\n\nIn addition to the term walks, they also added several common features: number of goals, total symbol size of all goals, length of active paths, number of current variable instantiations, most common symbols.\n\nThe whole project is implemented in OCaml, and XGBoost is ported into OCaml as the learner.\n\n== Experimental Results ==\nIn the paper, the dataset they were using is Mizar40. They divided the mizar40 dataset into training and testing set, with a ratio of 9 to 1. According to the author, the split is a random split. During the experiment, the authors' method was able to prove 32524 statements out of 146700 statements.  The authors' main approach is transforming the data from First-order logic form into DNF( disjunctive normal form), \nThe authors use the M2k dataset to compare the performance of mlCoP, the bare prover and rlCoP using only UCT.  There were 577 test problems that the rlCop trained. \n*Performance without Learning\nTable 3 shows the baseline result. The Performance of the bare prover is significantly lower than mlCoP and rlCoP without policy/value.\n[[file:table3.png|550px|center]]\n*Reinforcement Learning of Policy Only\nIn this experiment, the authors evaluated on the dataset rlCoP with UCT using policy learning only. They used the policy training data from previous iterations to train a new predictor after each iteration. Which means only the first iteration ran without policy while all the rest iterations used previous policy training data.\nFrom Table 4,  rlCoP is better than mlCoP run with the much higher <math>4 \u2217 10^{6}</math> inference limit after fourth iteration. \n[[file:table4.png|550px|center]]\n*Reinforcement Learning of Value Only\nThis experiment was similar to the last one, however, they used only values rather than learned policy. From Table 5, the performance of rlCoP is close to mlCoP but below it after 20 iterations, and it is far below rlCoP using only policy learning.\n[[file:table5.png|550px|center]]\n*Reinforcement Learning of Policy and Value\nFrom Table 6, the performance of rlCoP is 19.4% more than mlCoP with <math>4 \u2217 10^{6}</math> inferences, 13.6% more than the best iteration of rlCoP with policy only, and 44.3% more than the best iteration of rlCoP with value only after 20 iterations.\n[[file:table6.png|550px|center]]\nBesides, they also evaluated the effect of the joint reinforcement learning of both policy and value. Replacing final policy and value with the best one from policy-only or value-only both decreased performance.\n\n*Evaluation on the Whole Miz40 Dataset.\nThe authors split Mizar40 dataset into 90% training examples and 10% testing examples. 200,000 inferences are allowed for each problem. 10 iterations of policy and value learning are performed (based on MCT). The training and testing results are shown as follows. In the table, ''mlCoP'' represents for the bare prover with iterative deepening (i.e. a complete automated theorem prover with connection calculi), and ''bare prover'' stands for the prover implemented in this paper, without MCT guidance.\n\n[[file:atp_result0.jpg|frane|550px|center|Figure 5a. Experimental result on Mizar40 dataset]]\n[[file:atp_result1.jpg|frame|550px|center|Figure 5b. More experimental result on Mizar40 dataset]]\n\nAs shown by these results, reinforcement learning leads to a significant performance increase for automated theorem proving, the 42.1% performance improvement is unusually high, since the published improvement in this field is typically between 3% and 10%. [1]\n\nBesides these results, there were also found that some test problems could be solved with rlCoP easily but mlCoP could not.\n\n[[file:picture3.png|frame|550px|center|Figure 6: The MCTS tree for the WAYBEL 0:28 problem at the moment when the proof is found. For each node we display the predicted probabilityp, the number of visitsnand the average rewardr=w/n. For the (thicker) nodes leading to the proof the corresponding local proof goals arepresented on the right.]]\n\n== Conclusions ==\nIn this work, the authors developed an automated theorem prover that uses no domain engineering and instead replies on MCT guided by reinforcement learning. The resulting system is more than 40% stronger than the baseline system. The authors believe that this is a landmark in the field of automated reasoning, demonstrating that building general problem solvers by reinforcement learning is a viable approach. [1]\n\nThe authors pose that some future research could include strong learning algorithms to characterize mathematical data. The development of suitable deep learning architectures will help the algorithm characterize semantic and syntactic features of mathematical objects which will be crucial to create strong assistants for mathematics and hard sciences.\n\n== Critiques ==\nUntil now, automated reasoning is relatively new to the field of machine learning, and this paper shows a lot of promise in this research area.\n\nThe feature extraction part of this paper is less than optimal. It is my opinion that with proper neural network architecture, deep learning extracted features will be superior to human-engineered features, which is also shown in [4, 6].\n\nAlso, the policy-value learning iteration is quite inefficient. The learning loop is:\n* Loop \n** Run MCT with the previous model on an entire dataset\n** Collect MCT data\n** Train a new model\nIf we adopt this to an online learning scheme by learning as soon as MCT generates new data, and update the model immediately, there might be some performance increase.\n\nThe experimental design of this paper has some flaws. The authors compare the performance of ''mlCoP'' and ''rlCoP'' by limiting them to the same number of inference steps. However, every inference step of ''rlCoP'' requires additional machine learning prediction, which costs more time. A better way to compare their performance is to set a time limit.\n\nIt would also be interesting to study automated theorem proving in another logic system, like high order logic, because many mathematical concepts can only be expressed in higher-order logic.\n\n== References ==\n[1] C. Kaliszyk, et al. Reinforcement Learning of Theorem Proving. NIPS 2018.\n\n[2] J. Otten and W. Bibel. leanCoP: Lean Connection-Based Theorem Proving. Journal of Symbolic Computation, vol. 36, pp. 139-161, 2003.\n\n[3] C. Kaliszyk and J. Urban. FEMaLeCoP: Fairly Efficient Machine Learning Connection Prover. Lecture Notes in Computer Science. vol. 9450. pp. 88-96, 2015.\n\n[4] S. Loos, et al. Deep Network Guided Proof Search. LPAR-21, 2017.\n\n[5] M. F\u00a8arber, C. Kaliszyk, and J. Urban. Monte Carlo tableau proof search. In L. de Moura, editor,\n26th International Conference on Automated Deduction (CADE), volume 10395 of LNCS,\npages 563\u2013579. Springer, 2017.\n\n[6] A. Alemi, et al. DeepMath-Deep Sequence Models for Premise Selection. NIPS 2016.\n\n[7] Mizar Math Library. http://mizar.org/library/\n\n[8] J. Urban, G. Sutcliffe, P. Pudla \u0301k, and J. Vyskoc\u02c7il. MaLARea SG1 - Machine Learner for Automated Reasoning with Semantic Guidance. In A. Armando, P. Baumgartner, and G. Dowek, editors, IJCAR, volume 5195 of LNCS, pages 441\u2013456. Springer, 2008.\n\n[9] J. Otten and W. Bibel. leanCoP: lean connection-based theorem proving. J. Symb. Comput., 36(1-2):139\u2013161, 2003.\n\n[10] A. Grabowski, A. Korni\u0142owicz, and A. Naumowicz. Mizar in a nutshell. J. Formalized Rea-\nsoning, 3(2):153\u2013245, 2010"
                    }
                ]
            },
            "4005": {
                "pageid": 4005,
                "ns": 0,
                "title": "Representations of Words and Phrases and their Compositionality",
                "revisions": [
                    {
                        "contentformat": "text/x-wiki",
                        "contentmodel": "wikitext",
                        "*": "Representations of Words and Phrases and their Compositionality is a popular paper published by the Google team led by Tomas Mikolov  in 2013. It is known for its impact in the field of Natural Language Processing and the techniques described below are still used today.\n\n= Presented by = \n*F. Jiang\n*J. Hu\n*Y. Zhang\n\n\n= Introduction =\nThis paper, \"Distributed Representations of Words and Phrases and their Compositionality\" proposes several methods of improving the performance metrics of the Skip-gram model introduced in a previous paper, a Natural Language Processing technique of encoding words as arbitrary dimensional vectors using a neural network framework. Notably, the Skip-gram model can be made to train faster and produce higher accuracy via a number of simple adjustments; the replacement of the hierarchical soft max function with simple negative sampling, and the subsampling of frequent words.\n\n= Skip Gram Model =\nThe Skip-gram model is a Natural Language Processing method based upon a neural network structure designed to learn vector representations of words in such a way as to produce similar encodings for words that appear in similar contexts. While the model can be used to evaluate certain probabilities, this is considered a side effect of its learning process; its primary function is that of a Word2Vec encoder.\n\nSkip-gram is structured as a one-layer neural network with no non-linear activation function in the hidden layer but a soft-max classifier in the output layer. Words or phrases are encoded using one-hot encoding; the input and output vectors are constructed such that the index of a certain word is indicated by the number 1 within a length determined by a pre-specified vocabulary or corpus (e.g. the word \"ant\" indicated by 1 at its position in the corpus vector while everything else is 0). The size of the hidden layer is also specified as a hyper parameter; larger sizes of the hidden layer will result in encodings of better quality but take longer to train.\n\n[[File:skipmod2.PNG|600px|thumb|center]]     \n  \nThe central premise behind Skip-gram's learning process is that words or phrases that appear close together regularly in the training set are deemed to have similar contexts and should therefore be encoded in such a way as to maximize the probability of the model predicting their appearance together. Training data is prepared by producing a series of word pairs from the training test via a \"window size\" hyper-parameter that specifies all words a certain number ahead and behind the target word as the desired output, while iterating through all the words of the passage. For example, the model will may learn from the training set that \"steering\" and \"wheel\" appear in similar contexts. This means that one is a good predictor of the other, but also that \"driving\" is a good predictor of both. Thus, feeding any one of them into the model should produce high probabilities (of each appearing in the same context) for the all the others. Once we have a neural net that predicts contextual probabilities to an acceptable degree, the hidden layer weights are saved as the desired Word2Vec encodings (as an nxd matrix, each row represents a single encoding for the corpus word at that row index) \n\n[[File:window_skip.PNG|left]]\n[[File:table_skip.PNG|center]]    \n\nOne advantage of the Skip-gram model over older N-gram models is that the encodings preserve certain linguistic patterns that manifest in surprisingly clear and intuitive ways. For example, linear operations work on skip-gram encodings in a surprisingly logical way; the paper notes that on their trained model, vec(\u201cMadrid\u201d) - vec(\u201cSpain\u201d) + vec(\u201cFrance\u201d) is closer to vec(\u201cParis\u201d) than to any other word vector in the corpus. In a sense, subtracting \"Spain\" from \"Madrid\" extracts the notion of a capital city that when added to \"France\", produces \"Paris\". This property is so attractive that the paper uses it as a benchmark for their Skip-gram implementations (\"Do linear operations produce logical results?\")\n\n= Hierarchical Softmax =\nAlthough the Skip-gram method is described using the soft-max function <math>(1)</math> for the purposes of computing <math>\\nabla \\log p(w_{O}|w_{I})</math> for the backpropagation stages, practical considerations make it difficult to calculate this gradient, particularly if the corpus contains many words since it computational complexity scales linearly with <math>W</math>\n\n<math>(1)\\quad p(w_{O}|w_{I}) = \\frac{exp(v_{w_{O}}^{'T}v_{w_{I}})}{\\sum_{w=1}^W exp(v_{w_{O}}^{'T}v_{w_{I}})} </math>\nwhere <math>v_{w}</math> and <math>v_{w}'</math> are the input and output representations of <math>w</math> and <math>W</math> is the vocabulary size \n\nInstead, the Skip-gram model described in Mikolov et al. used an approximation called Hierarchical softmax which provides better asymptotic performance for large numbers of output nodes. Rather than evaluate W output nodes, hierarchical soft-max can instead evaluate <math>\\log W</math> nodes. This is done by encoding the output layer using a binary or Huffman tree where the <math>W</math> words are represented as leaves and each node represents the relative probability of all its child nodes. Optimal asymptotic performance can be achieved if the tree is a balanced binary tree, in which case <math>\\log W</math> complexity is possible. Soft-max probabilities are calculated using <math>()</math>.\n\n<math>(2)\\quad p(w_{O}|w_{I}) = \\prod_{j=1}^{L(w)-1} \\sigma \\Bigl(\\bigl\\|n(w,j+1)=ch(n(w,j))\\bigr\\| \\cdot v_{w_{O}}^{'T}v_{w_{I}} \\Bigr) </math>\nwhere <math>\\sigma(x) = 1/(1+exp(-x))</math>, <math>n(w,j)</math> be the j-th node on the path from the root to w, let <math>ch(n)</math> be an arbitrary fixed child of n and let <math> \\|x\\| </math> be 1 if x is true and -1 otherwise.\n\n= Negative Sampling =\n\nUsing the Skip-gram model, for each input word inside a 1M dictionary, we are adjusting 1M weights on the output layer. This can be very slow. NCE is the previous state of art solution which can efficiently reduce the number of parameters needed. In this paper, we are showing a new technique: Negative Sampling.\n\nNoise Contrastive Estimation (NCE) was introduced in 2012 by Gutmann and Hyvarinen. It uses logistic regression to differentiate data from noise. NCE maximizes the log probability of the softmax. This however not needed for the Skip-Gram Model since our goal is learning high-quality vector representations for context encoding. Negative Sampling is defined in the following formula:\n\n[[File:wordembedding negativesampling.png|700px|thumb|center]]\n\nIt retains the quality of the Skip-Gram model by only updating a subset of the dataset: k = positive samples + negative samples. The value K can be set arbitrarily, though Mikolov recommend 2-5 for a large dataset and 5-20 for a smaller dataset for it to be useful. NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.\n \nTo determine which negative samples should be chosen, an unigram distribution is chosen based on empirical results. It is defined as:\n\n[[File:Screen_Shot_2018-11-21_at_2.03.32_AM.png|700px|thumb|center]]\n\nThe probability function represent the frequency of the word in the dataset.\n\n= Subsampling of Frequent Words =\nFrequently occurring words often do not provide as much information as a rare word. For example, the word-pair \"boat, sea\" is likely to occur far more likely than the word \"boat, the\", yet the former provides the opportunity to encode important contextual information. \n\nThis is a typical case: word pairs containing commonly occurring words often do not provide as much information as rare words. Thus, in order to speed up our implementation of Speed-gram, we discard the word <math>w_{i}</math> from our sample text with probability <math> P(w_{i})=1-\\sqrt{\\frac{t}{f(w_{i}}} </math>\nwhere <math>f(w_{i})</math> is the frequency of word <math>w_{i}</math> and <math>t</math> is a chosen threshold, typically around <math>10^{-5}</math>.\n\nAs the probability of encountering a word decreases, the chance of discarding it decreases and approaches 0 as the frequency of the word approaches <math>10^{-5}</math>. The figure <math>t</math> was chosen empirically as it was shown to work well in practice; the chosen threshold aggressively sub-samples words that appear more frequently than <math>t</math> while preserving the ranking of the frequencies. One thing to note is that the function <math> P(w_{i}) </math> can have undefined behavior if a word with frequency less than <math>t</math> occurs; a simple solution is to fix <math> P(w_{i}) = 0 </math> for any such word. \n\nThis procedure provides a significant speedup to our algorithm as there are a lot of frequently occurring words that can be cut, yet they often encode minimally important information. As the results show, both accuracy and training speed increase.\n\n= Empirical Results =\n\nTo evaluate the results of these optimization, Mikolov and Al. used an internal dataset at Google. This dataset contains 1 billions. By removing all workings which occured less than 5 times, dataset size dropped to 692K words. Two type of data analogies where looked at: syntactic and semantic analogies. Syntactic analogies is when two words have the same meaning but describe two different things (e.g. \u201cquick\u201d : \u201cquickly\u201d :: \u201cslow\u201d : \u201cslowly\u201d). Semantic is when two pairs of words have the same vector meaning. For example, \u201cBerlin\u201d : \u201cGermany\u201d and \u201cParis\u201d : \u201cFrance\u201d are semantic analogies.\n\n[[File:wordembedding empiricalresults.png|700px|thumb|center]]\n\nFinally, the model was compared to state of art models from 2013 to evaluate their accuracy. The word2vec project was trained on a dataset of 30 billions words with 1000 dimensions. A sample of its results for less used words compared the models by Collobert, Turian, and Mnih are shown below. We can see the Skip-Phrase is comparatively a lot faster to run and produce every accurate results.\n\n[[File:wordembedding_studies.png|700px|thumb|center]]\n\n=References=\n[1] McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com\n\n[2] McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com\n\n[3] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546. ''(The paper in question)''\n\n[4] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 ''(Earlier paper)''"
                    }
                ]
            }
        }
    }
}