Reinforcement Learning of Theorem Proving

Introduction

Automated reasoning over mathematical proof was a major motivation for the development of computer science. Automated theorem provers (ATP) can in principle be used to attack any formally stated mathematical problem, and is a research area that has been studied since the early 20th century. [1] As of today, state-of-art ATP systems rely on fast implementation of complete proof calculi. such as resolution and tableau. However, they are still far weaker than trained mathematicians. Within current ATP systems, many heuristics are essential for their performance. As a result, in recent years machine learning has been used to replace such heuristics and improve the performance of ATPs.

In this paper the authors implemented a reinforcement learning based ATP, rlCoP. The proposed ATP reasons within first-order logic. The underlying proof calculi is the connection calculi [2], and the reinforcement learning method is Monte Carlo tree search along with policy and value learning. It was shown that reinforcement learning results in a 42.1% performance increase compared to the base prover (without learning).

Related Work

C. Kalizyk and J. Urban proposed a supervised learning based ATP, FEMaLeCoP, whose underlying proof calculi is the same as this paper in 2015 [3]. Their algorithm learns from existing proofs to choose the next tableau extension step. Similarly, S. Loos, et al. developed an supervised learning ATP system in 2017 [4], with superposition as their proof calculi. However they chose deep neural network (CNNs and RNNs) as feature extractor.

On a different note, A. Alemi, et al. proposed a deep sequence model for premise selection in 2016 [5], and they claim to be the first team to involve deep neural networks in ATPs. Although premise selection is not directly linked to automated reasoning, it is still an important component in ATPs, and their paper provides some insights about how to process datasets of formally stated mathematical problems.

First Order Logic and Connection Calculi

Here we assume basic first-order logic and theorem proving terminology, and we will offer a brief introduction of the bare prover and connection calculi. For example, let us try to prove the following first-order sentence.

This sentence can be transformed into a formula in Skolemized DNF form, which is referred to as the "matrix".

The original first-order sentence is valid if and only if the Skolemized DNF formula is a tautology. The connection calculi attempts to show that the Skolemized DNF formula is a tautology by constructing a tableaux. You will start at the special node, root, which is an open leaf. At each step, you select a clause (for example, clause [math]\displaystyle{ P \wedge R }[/math] is selected in the first step), and add the literals as children for an existing open leaf. For every open leaf, examine the path from root to this leaf. If two literals on this path are unifiable (for example, [math]\displaystyle{ Qx' }[/math] is unifiable with [math]\displaystyle{ \neg Qc }[/math]), this leaf is then closed. In standard terminology it states that a connection is found on this branch.

Our goal is to close every leaf, i.e. on every branch, there exists a connection. If such state is reached, we have showed that the Skolemized DNF formula is a tautology, thus proving the original first-order sentence. As we can see from the constructed tableaux, the example sentence is indeed valid.

In formal terms, the rules of connection calculi is shown in Figure 2, and the formal tableaux for the example sentence is shown in Figure 3. Each leaf is denoted as [math]\displaystyle{ subgoal, M, path }[/math] where [math]\displaystyle{ subgoal }[/math] is a list of literals that we need to find connection later, [math]\displaystyle{ M }[/math] stands for the matrix, and [math]\displaystyle{ path }[/math] stands for the path leading to this leaf.

To sum up, the bare prover follows a very simple algorithm. given a matrix, a non-negated clause is chosen as the first subgoal. The function prove(subgoal, M, path) is stated as follows:

If subgoal is empty
- return TRUE
If reduction is possible
- Perform reduction, generating new_subgoal, new_path
- return prove(new_subgoal, M, new_path)
For all clauses in M
- If a clause can do extension with subgoal
- Perform extension, generating new_subgoal1, new_path, new_subgoal2
- return prove(new_subgoal1, M, new_path) and prove(new_subgoal2, M, path)
return FALSE

It is important to note that the bare prover implemented in this paper is incomplete. For example, here is a pathological example. Suppose the following matrix (which is trivially a tautology) is feed into the bare prover. Let clause [math]\displaystyle{ P(0) }[/math] be the first subgoal. Clearly choosing [math]\displaystyle{ \neg P(0) }[/math] to extend will complete the proof.

However, if we choose [math]\displaystyle{ \neg P(x) \lor P(s(x)) }[/math] to do extension, the algorithm will generate an infinite branch [math]\displaystyle{ P(0), P(s(0)), P(s(s(0))) ... }[/math]. It is the task of reinforcement learning to guide the prover in such scenarios towards a successful proof.

In addition, the provability of first order sentences is generally undecidable (this result is named the Church-Turing Thesis), which sheds light on the difficulty of automated theorem proving.

Mizar Math Library

Mizar Math Library (MML) is a library of mathematical theories. The axioms behind the library is the Tarski-Grothendieck set theory, written in first-order logic. The library contains 57,000+ theorems and their proofs, along with many other lemmas, as well as unproven conjectures. Figure 4 shows a Mizar article of the theorem "If [math]\displaystyle{ p }[/math] is prime, then [math]\displaystyle{ \sqrt p }[/math] is irrational."

The training and testing data for this paper is a subset of MML, the Mizar 40, which is 32,524 theorems proved by automated theorem provers. Below is an example from the Mizar 40 library, it states that with d3_xboole_0 and t3_xboole_0 as premises, we can prove t5_xboole_0.

Monte Carlo Guidance

In the reinforcement learning, the action is defined as one inference (either reduction or extension). The proof state is defined as the whole tableaux. To implement Monte-Carlo tree search, each proof state [math]\displaystyle{ i }[/math] needs to maintain three parameters, its prior probability [math]\displaystyle{ p_i }[/math], its total reward [math]\displaystyle{ w_i }[/math], and number of its visits [math]\displaystyle{ n_i }[/math]. If no policy learning is used, the prior probabilities are all equal to one.

A simple heuristic is used to estimate the future reward of leaf states: suppose leaf state [math]\displaystyle{ i }[/math] has [math]\displaystyle{ G_i }[/math] open subgoals, the reward is computed as [math]\displaystyle{ 0.95 ^ {G_i} }[/math]. This will be replaced once value learning is implemented.

The standard UCT formula is chosen to select the next actions in the playouts \begin{align} {\frac{w_i}{n_i}} + 2 \cdot p_i \cdot {\sqrt{\frac{\log N}{n_i}}} \end{align} where [math]\displaystyle{ N }[/math] stands for the total number of visits of the parent node.

The bare prover is asked to play [math]\displaystyle{ b }[/math] playouts of length [math]\displaystyle{ d }[/math] from the empty tableaux, each playout backpropagates the values of proof states it visits. After these [math]\displaystyle{ b }[/math] playouts a special action (inference) is made, corresponding to an actual move, resulting in a new bigstep tableaux. The next [math]\displaystyle{ b }[/math] playouts will start from this tableaux, followed by another bigstep, etc.

Policy Learning and Guidance

From many runs of MCT, we will know the prior probability of actions in particular proof states, we can extract the frequency of each action [math]\displaystyle{ a }[/math], and normalize it by dividing with the average action frequency at that state, resulting in a relative proportion [math]\displaystyle{ r_a \in (0, \infty) }[/math]. We characterise the proof states for policy learning by extracting human-engineered features. Also, we characterise actions by extracting features from the clause chosen and literal chosen as well. Thus we will have a feature vector [math]\displaystyle{ (f_s, f_a) }[/math].

The feature vector [math]\displaystyle{ (f_s, f_a) }[/math] is regressed against the associated [math]\displaystyle{ r_a }[/math].

During the proof search, the prior probabilities [math]\displaystyle{ p_i }[/math] of available actions [math]\displaystyle{ a_i }[/math] in a state [math]\displaystyle{ s }[/math] is computed as the softmax of their predictions.

Training examples are only extracted from bigstep states, making the amount of training data manageable.

Value Learning and Guidance

Bigstep states are also used for proof state evaluation. For a proof state [math]\displaystyle{ s }[/math], if it corresponds to a successful proof, the value is assigned as [math]\displaystyle{ v_s = 1 }[/math]. If it corresponds to a failed proof, the value is assigned as [math]\displaystyle{ v_s = 0 }[/math]. For other scenarios, denote the distance between state [math]\displaystyle{ s }[/math] and a successful state as [math]\displaystyle{ d_s }[/math], then the value is assigned as [math]\displaystyle{ v_s = 0.99^{d_s} }[/math]

Proof state feature [math]\displaystyle{ f_s }[/math] is regressed against the value [math]\displaystyle{ v_s }[/math]. During the proof search, the reward of leaf states are computed from this prediction.

Features and Learners

For proof states, features are collected from the whole tableaux (subgoals, matrix, and paths). Each unique symbol is represented by an integer, and the tableaux can be represented as a sequence of integers. Term walk is implemented to combine a sequence of integers into a single integer by multiplying components by a fixed large prime and adding them up. Then the result integer is reduced to a smaller feature space by taking modulo by a large prime.

For actions the feature extraction process is similar, but the term walk is over the chosen literal and the chosen clause.

In addition to the term walks they also added several common features: number of goals, total symbol size of all goals, length of active paths, number of current variable instantiations, most common symbols.

The whole project is implemented in OCaml, and XGBoost is ported into OCaml as the learner.

Experimental Results

The authors splits Mizar40 dataset into 90% training examples and 10% testing examples. 200,000 inferences are allowed for each problem. 10 iterations of policy and value learning are performed (based on MCT). The training and testing results are shown as follows. In the table, mlCoP represents for the bare prover with iterative deepening (i.e. a complete automated theorem prover with connection calculi), and bare prover stands for the prover implemented in this paper, without MCT guidance.

As shown by these results, reinforcement learning leads to a significant performance increase for automated theorem proving, the 42.1% performance improvement is unusually high, since the published improvement in this field is typically between 3% and 10%. [1]

Conclusions

In this work the authors developed an automated theorem prover that uses no domain engineering and instead replies on MCT guided by reinforcement learning. The resulting system is more than 40% stronger than the baseline system. The authors believe that this is a landmark in the field of automated reasoning, demonstrating that building general problem solvers by reinforcement learning is a viable approach. [1]

Critique

Until now, automated reasoning is relatively new to the field of machine learning, and this paper shows a lot of promise in this research area.

The feature extraction part of this paper is less than optimal. It is my opinion that with proper neural network architecture, deep learning extracted features will be superior than human-engineered features, which is also shown in [4, 5].

Also, the policy-value learning iteration is quite inefficient. The learning loop is:

Loop
- Run MCT with previous model on entire dataset
- Collect MCT data
- Train a new model

If we adapt this to an online learning scheme by learning as soon as MCT generates new data, and update the model immediately, there might be some performance increase.

It would also be interesting to study automated theorem proving in other logic system, like high order logic.

References

[1] C. Kaliszyk, et al. Reinforcement Learning of Theorem Proving. NIPS 2018.

[2] J. Otten and W. Bibel. leanCoP: Lean Connection-Based Theorem Proving. Journal of Symbolic Computation, vol. 36, pp. 139-161, 2003.

[3] C. Kaliszyk and J. Urban. FEMaLeCoP: Fairly Efficient Machine Learning Connection Prover. Lecture Notes in Computer Science. vol. 9450. pp. 88-96, 2015.

[4] S. Loos, et al. Deep Network Guided Proof Search. LPAR-21, 2017.

[5] A. Alemi, et al. DeepMath-Deep Sequence Models for Premise Selection. NIPS 2016.

[6] Mizar Math Library. http://mizar.org/library/