# Neural ODEs

## Continuous Normalizing Flows

Section four tackles the implementation of continuous-depth Neural Networks, but to do so, in the first part of section four the authors discuss theoretically how to establish this kind of network through the use of normalizing flows. The authors use a change of variables method presented in other works (Rezende and Mohamed, 2015), (Dinh et al., 2014), to compute the change of a probability distribution if sample points are transformed through a bijective function, $f$.

$z_1=f(z_0) \Rightarrow log⁡(p(z_1))=log⁡(p(z_0))-log⁡|det⁡\frac{\partial f}{\partial z_0}|$

Where p(z) is the probability distribution of the samples and $det⁡\frac{\partial f}{\partial z_0}|$ is the determinant of the Jacobian which has a cubic cost in the dimension of z or the number of hidden units in the network. The authors discovered however that transforming the discrete set of hidden layers in the normalizing flow network to continuous transformations simplifies the computations significantly, due primarily to the following theorem:

Theorem 1: (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable with probability p(z(t)) dependent on time. Let dz/dt=f(z(t),t) be a differential equation describing a continuous-in-time transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and continuous in t, then the change in log probability also follows a differential equation:

$\frac{\partial log(p(z(t)))}{\partial t}=-tr\left(\frac{df}{dz(t)}\right)$

The biggest advantage to using this theorem is that the trace function is a linear function, so if the dynamics of the problem, f, is represented by a sum of functions, then so is the log density. This essentially means that you can now compute flow models with only a linear cost with respect to the number of hidden units, $M$. In standard normalising flow models, the cost is $O(M^3)$, so they will generally fit many layers with a single hidden unit in each layer.

Finally the authors use these realizations to construct Continuous Normalizing Flow networks (CNFs) by specifying the parameters of the flow as a function of t, ie, $f(z(t),t)$. They also use a gating mechanism for each hidden unit, $\frac{dz}{dt}=\sum_n \sigma_n(t)f_n(z)$ where $\sigma_n(t)\in (0,1)$ is a separate neural network which learns when to apply each dynamic f_n.

### Section 4.1: implementation

The authors construct two separate types of neural networks to compare against each other, the first is the standard planar Normalizing Flow network (NF) using 64 layers of single hidden units, and the second is their new CNF with 64 hidden units. The NF model is trained over 500,000 iterations using RMSprop, and the CNF network is trained over 10,000 iterations using Adam. The loss function is $KL(q(x)||p(x))$ where $q(x)$ is the flow model and $p(x)$ is the target probability density.

One of the biggest advantages when implementing CNF is that you can train the flow parameters just by performing maximum likelihood estimation on $log(q(x))$ given $p(x)$, where $q(x)$ is found via the theorem above, and then reversing the CNF to generate random samples from $q(x)$. This reversal of the CNF is done with about the same cost of the forward pass which is not able to be done in an NF network. The following two figures demonstrates the ability of CNF to generate more expressive and accurate output data as compared to standard NF networks.

Figure 4 shows clearly that the CNF structure exhibits significantly lower loss functions than NF. In figure 5 both networks were tasked with transforming a standard gaussian distribution into a target distribution, not only was the CNF network more accurate on the two moons target, but also the steps it took along the way are much more intuitive than the output from NF.

## Scope and Limitations

Section 6 mainly discusses the scope and limitations of the paper. Firstly while “batching” the training data is a useful step in standard neural nets, and can still be applied here by combining the ODEs associated with each batch, the authors found that that controlling error in this case may increase the number of calculations required. In practice however the number of calculations did not increase significantly.

So long as the model proposed in this paper uses finite weights and Lipschitz nonlinearities, then Picard’s existence theorem (Coddington and Levinson, 1955) applies, guaranteeing the solution to the IVP exists and is unique.

In controlling the amount of error in the model, the authors were only able to reduce tolerances to approximately $1e-3$ and $1e-5$ in classification and density estimation respectively without also degrading the computational performance.

The authors believe that reconstructing state trajectories by running the dynamics backwards can introduce extra numerical error. They address a possible solution to this problem by checkpointing certain time steps and storing intermediate values of z on the forward pass. Then while reconstructing, you do each part individually between checkpoints. The authors acknowledged that they informally checked the validity of this method since they don’t consider it a practical problem.