---
title: "Mize"
author: "James Melville"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Mize}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE, echo = FALSE, message = FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, comment = "#>")
library(mize)
```

`mize` is a package for doing unconstrained numerical optimization. This 
vignette describes the use of the `mize` function, which takes as input
a function to optimize, its gradient and a starting point and returns the 
optimized parameters.

First, I'll describe the required structure of the function and gradient
to optimize, then the various convergence options that apply no matter what
optimization methods are being used.

In the last section, I'll describe the methods which are available and any
options that apply to them specifically.

## The `fg` list

If you look at `stats::optim`, you'll see you pass the function to be optimized
and its gradient separately. By contrast, `mize` asks you to bundle them up
in a list, which I lazily refer to as an `fg` list. Here, for example, is the 
venerable Rosenbrock function:

```{r Definining a function and gradient to optimize}
rb_fg <- list(
   fn = function(x) { 100 * (x[2] - x[1] * x[1]) ^ 2 + (1 - x[1]) ^ 2  },
   gr = function(x) { c( -400 * x[1] * (x[2] - x[1] * x[1]) - 2 * (1 - x[1]),
                          200 *        (x[2] - x[1] * x[1])) })
```

For minimizing a simple function, there's not a lot of advantage to this. But
for more complex optimizations, there are some advantages:

* If `fn` and `gr` require extra parameterization, you can store the extra
state in a function that creates and returns the `fg` list, and then `fn`
and `gr` can access that state via the created closure. `stats::optim` takes
an alternative approach where extra data is passed as `...` to the `optim` 
function and then internally this is passed to `fn` and `gr` when needed.
* For some optimization methods (e.g. those using a strong Wolfe line search),
both `fn` and `gr` need to be evaluated for the same value of `par`. If there's
a lot of shared work between `fn` and `gr`, this can be quite inefficient.
If you provide an optional routine called `fg`, which calculates the function
and gradient simultaneously, this will be used by some parts of `mize` in 
preference to calling `fn` and `gr` separately, which can save a bit of time.

A version of `rb_fg` with an optional `fg` function could look like:

```{r A function list with an optional fg item}
rb_fg <- list(
   fn = function(x) { 100 * (x[2] - x[1] * x[1]) ^ 2 + (1 - x[1]) ^ 2  },
   gr = function(x) { c( -400 * x[1] * (x[2] - x[1] * x[1]) - 2 * (1 - x[1]),
                          200 *        (x[2] - x[1] * x[1])) },
   fg = function(x) {   
     a <- x[2] - x[1] * x[1]
     b <- 1 - x[1]
     list( 
       fn = 100 * a ^ 2 + b ^ 2,
       gr = c( -400 * x[1] * a - 2 * b,
                200 * a)
     )
   }
)
```

In the case of the Rosenbrock function, I'm not suggesting you should actually
do this. See the vignette on [Metric Multi-Dimensional Scaling](mmds.html) for 
the sort of scenario where it could make a difference.

For the rest of this vignette, we'll just assume you're only providing `fn`
and `gr`.

## The parameters to optimize

The parameters to optimize as passed as a numeric vector, `par`.

As well as using the `rb_fg` list we defined above to describe the Rosenbrock
function, we'll also start from the following starting point:

```{r Defining a starting point}
rb0 <- c(-1.2, 1)
```

## Basic options

If you don't particularly care about how your function gets optimized, the
default setting is to use the limited-memory Broyden-Fletcher-Goldfarb-Shanno
(L-BFGS) optimizer, which is a good choice for pretty much anything.

You may still need to tweak some options that are independent of the method,
like the maximum number of iterations to use, or other convergence criteria.

### Defaults

By default, `mize` will grind away on the function you give it for up to 100
iterations. But it will give up early if it senses that no further progress
is being made.

The return value provides both the optimized parameter as well as some 
information about how much work was done to get there.

```{r Defaults}
res <- mize(rb0, rb_fg)
# What were the final parameter values? (should be close to c(1, 1))
res$par

# What was the function value at that point (should be close to 0)
res$f

# How many iterations did it take?
res$iter

# How many function evaluations?
res$nf

# How many gradient evaluations?
res$ng

# Why did the optimization terminate?
res$terminate
```

As you can see from looking at `res$iter`, the L-BFGS didn't use all 100 
iterations it had available. The difference between consecutive function values,
`res$f` got sufficiently low, that the absolute tolerance criterion was passed: 
`res$terminate$what` will contain a string code that communicates what happened,
while `res$terminate$val` provides the specific value that causes the 
termination, although it's rarely interesting.

There is an entire vignette on the 
[different convergence criteria](convergence.html) available. The defaults 
should do a reasonable, albeit fairly conservative, job. 

### Logging progress to console

If you want to keep track of the progress of the optimization, set the `verbose`
option to `TRUE`:

```{r Verbose mode}
res <- mize(rb0, rb_fg, grad_tol = 1e-3, ginf_tol = 1e-3, max_iter = 10, 
            verbose = TRUE)
```

As you can see, information about the state of the optimization is logged to
the console every time the convergence criteria are checked. The values logged 
are:

* `f` - The current value of the function.
* `g2` - The current value of the gradient 2-norm (if you set a non-`NULL` 
`grad_tol`).
* `ginf` - The current value of the gradient infinity norm (if you set a 
non-`NULL` `ginf_tol`).
* `nf` - The total number of function evaluations so far.
* `ng` - The total number of gradient evaluations so far.
* `step` - The magnitude of the change between the previous version of `par` and
the current value.

Note that values are also provided for iteration 0, which represents the state
of `par` at its starting point. Because this requires a function evaluation that
might otherwise not be needed, you may get results where `nf` and `ng` are one
larger than you'd otherwise expect if `verbose` was `FALSE`.

Logging every iteration might be more output than you want. You can also set the 
`log_every` parameter to make logging slightly less noisy:

```{r Log every 10 iterations}
res <- mize(rb0, rb_fg, grad_tol = 1e-3, verbose = TRUE, log_every = 10)
```

On the assumption that you probably want to see information about the last
iteration, this is also reported, even if it wouldn't normally be reported
according to the setting for `log_every`.

### Storing progress

You may wish to do more than stare at the progress numbers logged to the console
during optimization when `verbose = TRUE`. If you would like access to these
values after optimization, set the `store_progress` parameter to true, and
a data frame called `progress` will be part of the return value:

```{r Returning stored progress}
res <- mize(rb0, rb_fg, store_progress = TRUE, log_every = 10)
res$progress
```

As can be seen, `store_progress` will also obey the `log_every` parameter and
not store every iteration of convergence information if requested.

## Optimization Methods

The available optimization methods are listed below. I won't be describing the
algorithms behind them in any depth. For that, you should look in
the book by [Nocedal and Wright](https://doi.org/10.1007/978-0-387-40065-5).

Many of these methods can be thought of as coming up with a direction to move
in, but with the distance to travel in that direction as being someone else's
problem, i.e. a line search method needs to also be used to determine the
step size. We'll discuss the available step size methods separately.

### Steepest Descent (`"SD"`)

The good news is that you always get some improvement with steepest descent.
The bad news is everything else: it often takes a long time to make any 
progress, unless you are 
[very clever with the step size](https://doi.org/10.1093/imanum/8.1.141). 
It is however, the basis for most other gradient descent techniques, 
and many of them will revert back to this direction, at least temporarily, if 
things start going wrong.

```{r Steepest descent}
res <- mize(rb0, rb_fg, max_iter = 10, method = "SD")
```

### Broyden-Fletcher-Goldfarb-Shanno (`"BFGS"`)

The BFGS method is a quasi-Newton method: it constructs an approximation to the
inverse of the Hessian. This saves on the annoyance of working out what the
second derivatives are for your problem, as well as on the time it takes to
calculate them on every iteration, but not on the storage cost, which is O(N^2).
It's therefore not appropriate for large problems.

```{r BFGS}
res <- mize(rb0, rb_fg, max_iter = 10, method = "BFGS")
```

By default, after the first iteration, the inverse Hessian approximation is 
scaled to make it more similar to the real inverse Hessian 
(in terms of eigenvalues). If you don't want to do this, set the `scale_hess`
parameter to `FALSE`:

```{r BFGS without scaled Hessian}
res <- mize(rb0, rb_fg, max_iter = 10, method = "BFGS", scale_hess = FALSE)
```

### Limited-Memory BFGS (`"L-BFGS"`)

The L-BFGS method builds the inverse Hessian approximation implicitly, yielding
something close to the BFGS direction without ever directly creating a matrix.
This makes it suitable for larger problems. The closeness to the BFGS result
is determined by the size of `memory` parameter, which determines the number of
previous updates that are used to create the direction for the next iteration. 
The higher this value, the closer to the BFGS result will be obtained, but the 
more storage is required. In practice, the recommended `memory` size is between
3 and 7. The default is 5.

```{r LBFGS}
res <- mize(rb0, rb_fg, max_iter = 10, method = "L-BFGS", memory = 7)
```

The L-BFGS method also applies a similar Hessian scaling method as with the BFGS
method, but does so at every iteration. If you don't want it for some reason,
you can also set `scale_hess` to `FALSE`:

```{r LBFGS without scaled Hessian}
res <- mize(rb0, rb_fg, max_iter = 10, method = "L-BFGS", scale_hess = FALSE)
```

### Conjugate Gradient (`"CG"`)

When someone is pulling a decent default optimization off the shelf, if they're
not reaching for L-BFGS, they're probably going for a conjugate gradient method.
In their favor, they require less storage than the L-BFGS. However, they tend
to be a lot more sensitive to poor scaling of the function (i.e. when changing
different parameters causes very different changes to the objective function)
and so need tighter line search criteria to make progress.

There are in fact multiple different methods out there under the name 'conjugate
gradient', which differ in how the direction is determined from the current and 
previous gradient and the previous direction. The default method in `mize` is
the "PR+" method (sometimes referred to as "PRP+"), which refers to its 
inventors, Polak and Ribiere (the third "P" is Polyak who independently 
discovered it). The "+" indicates that a restart to steepest descent is included
under certain conditions, which prevents the possibility of convergence failing.

```{r CG with PR+}
res <- mize(rb0, rb_fg, max_iter = 10, method = "CG")
```

Other CG update methods are available via the `cg_update` parameter. Methods 
which seem competitive or even superior to the `"PR+"` update is the `"HS+"`
method (named after Hestenes and Stiefel) and the `"HZ+"` update (named
after Hager and Zhang, and part of the 
[CG_DESCENT](https://users.clas.ufl.edu/hager/papers/Software/) optimizer):

```{r CG with HZ+}
res <- mize(rb0, rb_fg, max_iter = 10, method = "CG", cg_update = "HZ+")
```

### Nesterov Accelerated Gradient (`"NAG"`)

The Nesterov Accelerated Gradient method consists of doing a steepest
descent step immediately followed by a step that resembles momentum (see below),
but with the previous gradient term replaced with the current gradient.

```{r NAG}
res <- mize(rb0, rb_fg, max_iter = 10, method = "NAG")
```

Be aware that the convergence of NAG is not guaranteed to be monotone, that is
you may (and probably will) see the function value increase between iterations.
Getting the optimum convergence rates guaranteed by theory requires you to be
able to estimate how strong the convexity of the function you are minimizing is.

To see this in action, let's increase the iteration count to 100 and plot the 
progress of the function:

```{r NAG with 100 steps}
res <- mize(rb0, rb_fg, max_iter = 100, method = "NAG", store_progress = TRUE)
plot(res$progress$nf, log(res$progress$f), type = "l")
res$f
```

Yikes. If you let this optimization continue, it does eventually settle down,
but it's a real test of patience. Clearly, in this case, there's probably
a better estimate of the strong convexity parameter.

You can control the strong convexity estimate with the `nest_q` parameter, which
is inversely related to the amount of momentum that is applied. It should take 
a value between `0` (you think your function is strongly convex) and `1` 
(method becomes steepest descent). The default is zero, and this is in fact
the assumed value in most applications and papers.

Let's try turning down the momentum a little bit and repeat:

```{r NAG with 100 steps and less aggressive momentum}
resq <- mize(rb0, rb_fg, max_iter = 100, method = "NAG", nest_q = 0.001, 
            store_progress = TRUE)
plot(res$progress$nf, log(res$progress$f), type = "l",
     ylim = range(log(res$progress$f), log(resq$progress$f)))
lines(resq$progress$nf, log(resq$progress$f), col = "red")
resq$f
```

The red line is the result with `nest_q = 0.001`. That's better. Unfortunately, 
I have no advice on how you should go about estimating the right value for 
`nest_q`. However, see the section on Adaptive Restart for ways to ameliorate 
the problem.

To get the theoretical guarantees about convergence, you actually need to follow
a fairly strict formula about what the step size should be (it's related to the
reciprocal of the Lipschitz constant of the function), and 
[Sutskever](https://hdl.handle.net/1807/36012) notes that for non-convex 
functions (with sparse stochastic gradient) the theoretically optimal momentum 
and step size values are too aggressive. Alas, the best advice (at least for
deep learning applications) given was "manual tuning".

There are connections between NAG and classical momentum, and there is some
evidence (at least in deep learning) that replacing classical momentum
with the NAG method, but using the same momentum schedule (see below) can 
improve results. I have written more (much more) on the subject 
[here](https://jlmelville.github.io/mize/nesterov.html).

### Classical Momentum (`"MOM"`)

Momentum methods use steepest descent and then add a fraction of the previous
iteration's direction. Often the effectiveness of this approach is likened
to inertia and has the effect of damping the oscillation in direction that 
slows the progress of steepest descent with poorly-scaled functions.

Originally popular for training neural networks, the momentum value is often
a constant and set to a fairly high value, like `0.9`. You can set a constant
momentum by setting the `mom_schedule` parameter:

```{r Momentum}
res <- mize(rb0, rb_fg, max_iter = 10, method = "MOM", mom_schedule = 0.9)
```

Momentum-based methods share the same issues as Nesterov Accelerated Gradient:
convergence need not be monotone. The plot below demonstrates:

```{r Momentum plot}
res <- mize(rb0, rb_fg, max_iter = 100, method = "MOM", mom_schedule = 0.9,
            store_progress = TRUE)
plot(res$progress$nf, log(res$progress$f), type = "l")
res$f
```

To be fair, there is little reason to believe that the sort of settings used
for neural network training would be effective for minimizing the Rosenbrock
function.

Non-constant momentum schedules are also available, by passing a string to 
`mom_schedule`. These include:

* A `switch` function, that steps the momentum from one value (`mom_init`), 
to another (`mom_final`) at a specified iteration (`mom_switch_iter`):
```{r momentum with a switch function}
# Switch from a momentum of 0.4 to 0.8 at iteration 5
res <- mize(rb0, rb_fg, max_iter = 10, method = "MOM", mom_schedule = "switch",
            mom_init = 0.4, mom_final = 0.8, mom_switch_iter = 5)
```
* A `ramp` function, the linearly increases the momentum from `mom_init`
to `mom_final` over `max_iter` iterations:
```{r momentum with a ramp function}
res <- mize(rb0, rb_fg, max_iter = 10, method = "MOM", mom_schedule = "ramp",
            mom_init = 0.4, mom_final = 0.8)
```
* The schedule used in NAG can be used here by asking for a `"nsconvex"` 
schedule.
```{r momentum with nesterov schedule}
res <- mize(rb0, rb_fg, max_iter = 10, method = "MOM", mom_schedule = "nsconvex")
```
* The `nest_q` parameter can also be used here as it can be with NAG:
```{r momentum with nesterov schedule and non-zero q}
res <- mize(rb0, rb_fg, max_iter = 10, method = "MOM", mom_schedule = "nsconvex",
            nest_q = 0.001)
```

If none of this meets your needs, you can pass a function that takes the current
iteration number and the maximum number of iterations as input and returns a 
scalar, which will be used as the momentum value, although the value will be 
clamped between 0 and 1.

For example, you could create a momentum schedule that randomly picks a value
between 0 and 1:
```{r momentum with random momentum}
mom_fn <- function(iter, max_iter) {
  runif(n = 1, min = 0, max = 1)
}
res <- mize(rb0, rb_fg, max_iter = 10, method = "MOM", mom_schedule = mom_fn)
```
I'm not saying it's a *good* idea.

### Simplified Nesterov Momentum

Interest in NAG in the neural network community (specifically deep learning),
was sparked by papers from 
[Sutskever and co-workers](https://proceedings.mlr.press/v28/sutskever13.html)
and [Bengio and co-workers](https://arxiv.org/abs/1212.0901) who showed that 
it could be related to classical momentum in a way that made it easy to 
implement if you already had a working classical momentum implementation, and
more importantly, that it seemed to give improvements over classical momentum.

As `mize` already offers NAG by setting `method = "NAG"`, being able to use it
in a different form related to classical momentum may seem superfluous. But in 
fact, the use of what's often called "NAG" in deep learning is slightly more
general (and as far as I'm aware, has no convergence theory to go with it, so
improvements are not guaranteed). It's probably better to refer to it as 
(simplified) Nesterov momentum.

At any rate, the key observation of Sutskever was that NAG could be considered
a momentum method where the momentum stage occurs before the gradient descent
stage. In this formulation, you can have whatever momentum schedule you like,
and the hope is that the gradient descent stage can more easily correct if the 
momentum schedule is too aggressive because it gets to "see" the result of the 
momentum update and hence the descent occurs from a different location.

To get Nesterov momentum, you need only add a `mom_type = "nesterov"` parameter
to a momentum optimizer:

```{r Simplified Nesterov momentum}
res <- mize(rb0, rb_fg, max_iter = 10, method = "MOM", mom_schedule = 0.9, 
            mom_type = "nesterov")
```

I know you're itching to see whether Nesterov momentum can calm down the bad 
case of non-monotone convergence we got with a large classical momentum value
on the Rosenbrock function:

```{r Nesterov versus classical momentum}
resc <- mize(rb0, rb_fg, max_iter = 100, method = "MOM", mom_schedule = 0.9, 
             store_progress = TRUE)
resn <- mize(rb0, rb_fg, max_iter = 100, method = "MOM", mom_schedule = 0.9, 
             mom_type = "nesterov", 
             store_progress = TRUE)
# Best f found for Nesterov momentum
resn$f
# Best f found for classical momentum
resc$f
plot(resc$progress$nf, log(resc$progress$f), type = "l",
     ylim = range(log(resc$progress$f), log(resn$progress$f)))
lines(resn$progress$nf, log(resn$progress$f), col = "red")
```

The Nesterov momentum is in red. And yes, it's a lot smoother. On the other 
hand, you can see that the cost function increases and then levels off.
Clearly, some tweaking of the correct momentum schedule to use is needed.

Sutskever also came up with a simplified approximation to the momentum schedule
used in NAG. It doesn't have a big effect on the run time, only applies for
the default `nest_q` of 0, and at early iterations the momentum is much larger.
To use it, set the `nest_convex_approx = TRUE` when using the 
`mom_schedule = "nesterov"`:

```{r Nesterov momentum with convex approximation}
res <- mize(rb0, rb_fg, max_iter = 10, method = "MOM", 
            mom_schedule = "nsconvex", nest_convex_approx = TRUE, 
            mom_type = "nesterov")
```

## Line Searches

All the above methods need a line search to find out how far to move in the
gradient descent (i.e. non-momentum) part.

### Wolfe Line Search

The default for all methods mentioned so far is a Wolfe line search that
satisfies the strong Wolfe conditions. Specifically,
[More-Thuente line search](https://doi.org/10.1145/192115.192132), 
converted from 
[Dianne O'Leary's Matlab code](https://www.cs.umd.edu/users/oleary/software/).
This uses cubic interpolation to find a point close to a minimize along the
search direction for each iteration. This requires a function and gradient
evaluation at each candidate point.

If the More-Thuente line search fails to work for some reason, alternative
Wolfe line search method from Carl Edward Rasmussen's 
[conjugate gradient routine](https://gaussianprocess.org/gpml/code/matlab/doc/),
Mark Schmidt's 
[minimization routines](https://www.cs.ubc.ca/~schmidtm/Software/minFunc.html),
and an implementation of [Hager-Zhang](https://doi.org/10.1137/030601880)
line search are also available (although they are all quite similar):

```{r other Wolfe line search}
res <- mize(rb0, rb_fg, max_iter = 10, method = "CG", line_search = "Rasmussen")
# Use Mark Schmidt's minFunc line search 
res <- mize(rb0, rb_fg, max_iter = 10, method = "CG", line_search = "Schmidt")
# Hager-Zhang line search  
res <- mize(rb0, rb_fg, max_iter = 10, method = "CG", line_search = "Hager-Zhang")
# Hager-Zhang can be abbreviated to "HZ"
res <- mize(rb0, rb_fg, max_iter = 10, method = "CG", line_search = "HZ")
# You can explicitly set More-Thuente too
res <- mize(rb0, rb_fg, max_iter = 10, method = "CG", line_search = "More-Thuente")
# More-Thuente can be abbreviated to "MT"
res <- mize(rb0, rb_fg, max_iter = 10, method = "CG", line_search = "MT")
```

A variety of interpolation methods are available in the Matlab minfunc code, but 
mize only uses the default: use cubic interpolation and extrapolation methods,
falling back to back-tracking Armijo search (also using cubic interpolation) if
a non-finite value is encountered.

Default values for the line search tolerance use the advice given in the Nocedal
and Wright book: the `c1` parameter is set to `1e-4`, while `c2` is set to `0.9`
for quasi-Newton methods (BFGS and L-BFGS) and `0.1` for everything else. For a
line search involving the strong Wolfe conditions, the `c2` parameter measures
how strict the line search is: the smaller it is (it cannot be set smaller than
`c1` or greater than 1), the tighter the line search. This is an important
property for some conjugate gradient methods to maintain their convergence
properties.

You may be able to get away with a looser line search (smaller `c2`) for some 
methods, e.g.:

```{r Line search parameters}
res <- mize(rb0, rb_fg, max_iter = 10, method = "CG", cg_update = "HZ+", 
            c2 = 0.5, c1 = 0.1)
```

#### Initial Step Size Guess

Another important choice is what step size to start each iteration from. Nocedal
and Wright suggest two methods based on the result achieved for the previous 
iteration, one involving the ratio of the slopes at consecutive iterations, and 
one involving a quadratic interpolation. By default, for Wolfe line searches
other than `"Hager-Zhang"`, the quadratic interpolation method is tried.

To try the slope ratio method, set `step_next_init = "slope"`.

```{r Line search with slope ratio}
res <- mize(rb0, rb_fg, max_iter = 10, step_next_init = "slope")
```
In the [CG_DESCENT](https://doi.org/10.1145/1132973.1132979) software, Hager
and Zhang suggest using a "QuadStep" method. In this case, a small trial step is
attempted followed by a quadratic interpolation. If the resulting quadratic is
strongly convex (for a 1D quadratic $Ax^2 + Bx + c$, this just means $A > 0$),
then the minimizer of the quadratic is used as the initial guess. Otherwise,
a multiple of the previous step length is used (in `mize`, the old step length
is doubled). This is the default if `line_search = "hz"` is used. To explicitly
set it for other methods, use `step_next_init = "hz"` (or `"hager-zhang"`):

```{r Line search with Hager-Zhang QuadStep}
res <- mize(rb0, rb_fg, max_iter = 10, step_next_init = "hz", 
            line_search = "mt")
```
For the budget-conscious, be aware that using this method always incurs an 
extra function evaluation per iteration, even if the quadratic minimizer isn't
used as the guess.

#### Initial Step Size Guess on First Iteration

This only leaves what to do on the very first iteration, where there is no
previous iteration to look at. You don't really have much information at that
point apart from the gradient, so all the implementations I've seen use a step 
size that is effectively inversely proportional to a norm of the gradient:

* Carl Edward Rasmussen's 
[conjugate gradient routine](https://gaussianprocess.org/gpml/code/matlab/doc/)
uses $\frac{1}{1+\left\| g \right\|_2^2}$
* [SciPy](https://scipy.org/)'s `optimize.py` uses
$\frac{1}{\left\| g \right\|_2}$
* Mark Schmidt's 
[minimization routines](https://www.cs.ubc.ca/~schmidtm/Software/minFunc.html)
uses $\frac{1}{1+\left\| g \right\|_1}$
* Hager and Zhang use a slightly more complicated procedure for their 
[CG_DESCENT](https://doi.org/10.1145/1132973.1132979) software: 
use $\psi_0\frac{\left\|x\right\|_\infty}{\left\|g\right\|_\infty}$, where $x$
is the initial parameter vector and $\psi_0$ is a small positive value (e.g.
0.01), or if that doesn't produce a finite positive value, use 
$\psi_0\frac{\left|f\right|}{\left\|g\right\|_{2}^{2}}$ where $\left|f\right|$ 
is the absolute value of the function evaluated at $x$. And if that doesn't 
work, just use 1.

By default, the Hager-Zhang line search uses the CG_DESCENT method. For other 
Wolfe line searches I have arbitrarily plumped for the Rasmussen method. If you
think it will make a huge difference you can explicitly choose by passing one of
`"rasmussen"`, `"scipy"`, `"schmidt"` or `"hz"` to the `step0` argument, e.g.:

```{r Line search with scipy initialization}
res <- mize(rb0, rb_fg, max_iter = 10, step0 = "scipy")
```

You may also pass a number to `step0` and it will be used as-is:

```{r Line search with initial step length of 1}
# An initial guess of 1 for the step length isn't bad for L-BFGS
res <- mize(rb0, rb_fg, max_iter = 10, step0 = 1, method = "L-BFGS")
```

The L-BFGS and BFGS methods will also attempt the "natural" Newton step length
of 1, in the way suggested in Nocedal and Wright. If you don't want to do this
for some reason, it can be explicitly turned off or on via the 
`try_newton_step` parameter:

```{r BFGS with no Newton step}
res <- mize(rb0, rb_fg, max_iter = 10, method = "BFGS", try_newton_step = FALSE)
```

For every other method this is off by default anyway with no particular reason
to turn it on.

All Wolfe line searches except `Hager-Zhang` use the strong curvature condition
to determine when to terminate the line search. The Hager-Zhang line search uses
the standard curvature conditions by default. It also uses an approximation to 
the Armijo sufficient descent conditions which may prevent premature termination
of a line search when the minimizer lies close to the initial step size under
some circumstances. Use of these variations on the Wolfe conditions can be
applied to any of the Wolfe line searches by supplying the `strong_curvature`
and `approx_armijo` options:

```{r alternative Wolfe conditions}
# Rasmussen line search with standard Wolfe conditions
res <- mize(rb0, rb_fg, max_iter = 10, method = "CG", line_search = "Rasmussen",
            strong_curvature = FALSE)
# Hager-Zhang with strong Wolfe conditions
res <- mize(rb0, rb_fg, max_iter = 10, method = "CG", line_search = "HZ",
            strong_curvature = TRUE, approx_armijo = FALSE)
# More-Thuente with approx Armijo conditions
res <- mize(rb0, rb_fg, max_iter = 10, method = "CG", line_search = "MT",
            approx_armijo = TRUE)
```

It's not obvious to me that any proof of convergence that exists for a given
line search method will hold up when the curvature and Armijo conditions are
altered, so be careful when using specifying these options.

### Other Line Searches

The Wolfe line search is your best bet, but there are some alternatives
(of somewhat dubious value). 

You can set a constant value for the step size. But note that the total step
size is actually determined by the product of the step size and the length
of the gradient vector, so you may want to set `norm_direction` too:

```{r constant step size}
res <- mize(rb0, rb_fg, max_iter = 10, method = "SD", line_search = "constant",
            norm_direction = TRUE, step0 = 0.01)
```

There is also a backtracking line search, which repeatedly reduces the step size
starting from `step0`, until the sufficient decrease condition (also known as
the Armijo condition) of the Wolfe line search conditions is fulfilled. This is
controlled by `c1`, the larger the value, the larger the decrease in the
function value is needed for a step size to be accepted. By default, the cubic
interpolation method from 
[minFunc](https://www.cs.ubc.ca/~schmidtm/Software/minFunc.html) is used to
find the step size.

```{r backtracking with cubic interpolation}
res <- mize(rb0, rb_fg, max_iter = 10, line_search = "backtracking", step0 = 1, 
            c1 = 0.1)
```

You will often see a simple fixed step size reduction used in combination 
with this line search, mainly due to ease of implementation. Cubic interpolation
is probably always preferable, but if you want to use a constant step size 
reduction, provide a non-`NULL` value to `step_down`. The step size will be 
multiplied by this value so it should be a value between zero and one. Set it
to `0.5`, for example, to halve the step size while backtracking:

```{r backtracking with halved step size}
res <- mize(rb0, rb_fg, max_iter = 10, line_search = "backtracking",
            step0 = 1, c1 = 0.1, step_down = 0.5)
```

A similar approach is the `"bold driver"`, which also backtracks, but starts
from the step size found for the last iteration, increased by a factor 
`step_up`.

```{r bold driver}
# increase step size by 10%, but reduce by 50%
res <- mize(rb0, rb_fg, max_iter = 10, line_search = "bold",
            step0 = 1, step_down = 0.5, step_up = 1.1)
```

Unlike backtracking line search, the bold driver accepts any function decrease,
and does not use `c1`.

### Maximum function or gradient evaluations per line search

It's a good idea to specify a maximum number of function evaluations that can
be spent in any one line search. This guards against the possibility of poorly
chosen parameters, badly behaved functions or other numerical issues causing
the line search to fail to converge. By default, no more than 20 function
evaluations are allowed per line search. The `ls_max_fn`, `ls_max_gr` and 
`ls_max_fg` can be used to control this:

```{r max line search functions}
# No more than 10 gradient evaluations allowed per line search
res <- mize(rb0, rb_fg, max_iter = 10, ls_max_gr = 10)
```

## Adaptive Learning Rate Methods

So far, the `method` and `line_search` arguments can be freely mixed and 
matched. However, some methods choose both the direction and the step size
directly. The 
[Delta-Bar-Delta method](https://doi.org/10.1016/0893-6080(88)90003-2) has its 
roots in neural network research but was also used in the currently popular 
[t-Distributed Stochastic Neighbor Embedding](https://lvdmaaten.github.io/tsne/) 
data visualization method.

The way DBD works is to consider each component of the gradient and a weighted
average of previous iterates gradients. If the sign has changed for that 
component, that suggests the minimum has been skipped, so the step size (where 
each component is initialized with value `step0`) is reduced (by a factor 
`step_down`) in that direction. Otherwise the step size is increased 
(by a factor `step_up`). The weighting of the average of previous gradients is 
controlled by `dbd_theta`, which varies between 0 and 1.

The same line search parameters that control the bold driver method can be used
in DBD:

```{r}
res <- mize(rb0, rb_fg, max_iter = 10, method = "DBD",
            step0 = "rasmussen", step_down = 0.5, step_up = 1.1,
            dbd_weight = 0.5)
```

One extra parameter is available: `step_up_fn`. In the original DBD paper, 
when the new step size is to be increased for a component, it is done by adding
the value of `step_up` to the current value of the step size.
[Janet and co-workers](https://doi.org/10.1109/IJCNN.1998.687205) noted that
when the step sizes are small compared to the value of `step_up` this can lead
to far too large a step size increase. Instead, they suggest multiplying by
`step_up` so that a percentage increase in the step size is achieved. This is
the default behavior in `mize`, but the t-SNE optimization method uses the 
addition method. To get this behavior, set the `step_up_fun` value to `"+"`:

```{r t-SNE style DBD parameters}
res <- mize(rb0, rb_fg, max_iter = 10, method = "DBD",
            step0 = "rasmussen", step_down = 0.8, step_up = 0.2,
            step_up_fun = "+")
```

The DBD method can be used with momentum. If it is, then the `dbd_weight`
parameter is ignored and instead of storing its own history of weighted 
gradients, the update vector stored by the momentum routines is used instead.
This is not mentioned in the original paper, but is used in the t-SNE
implementation. 

The DBD method is interesting in that it uses no function evaluations at all
in its optimization. It doesn't even avail itself of the values of any of the
gradient components, only their signs. However, as a result, it's really, really
easy for DBD to diverge horribly and quickly. You will therefore want to ensure
that you are checking the convergence. This is one of the methods where using
the `grad_tol` or `ginf_tol` is a better idea than `abs_tol` and `rel_tol`,
because it provides some check on divergence without calculating the function
ever iteration only for convergence checking purposes.

```{r}
# DBD with rel_tol and abs_tol is explicitly set
res <- mize(rb0, rb_fg, max_iter = 10, method = "DBD",
             step0 = "rasmussen", step_down = 0.8, step_up = 0.2,
             step_up_fun = "+", rel_tol = 1e-8, abs_tol = 1e-8)
# 10 gradient calculations as expected
res$ng
# But 10 function calculations too, only used in the tolerance check
res$nf

# Turn off the rel_tol and abs_tol and let max_iter handle termination
res <- mize(rb0, rb_fg, max_iter = 10, method = "DBD",
            step0 = "rasmussen", step_down = 0.8, step_up = 0.2,
            step_up_fun = "+", rel_tol = NULL, abs_tol = NULL,
            grad_tol = 1e-5)
# 11 gradient calculations
res$ng
# Only one function evalation needed (to calculate res$f)
res$nf
```

Using `grad_tol` results in one extra gradient calculation (because it using
gradients for the convergence check), but only one function evaluation. The
more iterations this runs for, the better trade off this is, although in the
event of a diverging optimization, the "best" `par` that is returned may not
be as good as would have been returned if we were also using function 
evaluations.

## Adaptive Restart

[O'Donoghue and Candes](https://arxiv.org/abs/1204.3982) described a restart
scheme for NAG as a cure for the non-monotonic convergence. They suggest using
either a function-based check (did the function value after the gradient descent
stage decrease between iterations?) or a gradient-based check (is the momentum
direction a descent direction?). If the check fails, then the momentum is 
restarted: the previous update vector is discarded and the momentum set back to
zero.

`mize` allows these checks to be used to restart any momentum scheme, not just 
NAG. The only change is that the function-based check is carried out at the
beginning and end of each iteration, rather than at the end of the gradient
descent stage. Restarting is applied using the `restart` argument, supplying
either `fn` or `gr` for function-based and gradient-based restarting,
respectively:

```{r momentum with restart}
resc <- mize(rb0, rb_fg, max_iter = 100, method = "MOM", mom_schedule = 0.9, 
             store_progress = TRUE)
resf <- mize(rb0, rb_fg, max_iter = 100, method = "MOM", mom_schedule = 0.9, 
             store_progress = TRUE, restart = "fn")
resg <- mize(rb0, rb_fg, max_iter = 100, method = "MOM", mom_schedule = 0.9, 
             store_progress = TRUE, restart = "gr")
plot(resc$progress$nf, log(resc$progress$f), type = "l", 
     ylim = range(log(resc$progress$f), log(resf$progress$f),
                  log(resg$progress$f)))
lines(resf$progress$nf, log(resf$progress$f), col = "red")
lines(resg$progress$nf, log(resg$progress$f), col = "blue")
```

Adaptive restart (red and blue lines) certainly seems to help here, although
with either gradient or function-based restart seeming equally effective.

The only downside to adaptive restart is that you may want to choose the type
of restart to not require more overhead of function and gradient calls than
necessary. In practice, the gradient-based restart is economical because in
most cases, the gradient calculated at the end of the iteration can be re-used
for the gradient descent at the start of the next iteration. The one exception
here is when using nesterov momentum, where the gradients aren't calculated
in the right location to be of use. In that case, you are better off using
function-based restarting, although if you aren't checking convergence
with a function calculation or using a line search method that makes use of
the function at the starting point, you will obviously incur the overhead
of the function call. This is one reason to favour using the NAG method directly
rather than emulating it with Nesterov momentum.

There is also a `restart_wait` parameter. This determines how many iterations 
to wait between attempted restarts. By default, using the value from the paper
of Su and co-workers, we wait 10 iterations before allowing another restart.
Making the wait time too short may lead to premature convergence:

```{r momentum with restart and wait time}
resfw <- mize(rb0, rb_fg, max_iter = 100, method = "MOM", mom_schedule = 0.9, 
             store_progress = TRUE, restart = "fn", restart_wait = 1)
resgw <- mize(rb0, rb_fg, max_iter = 100, method = "MOM", mom_schedule = 0.9, 
             store_progress = TRUE, restart = "gr", restart_wait = 1)
plot(resc$progress$nf, log(resc$progress$f), type = "l", 
     ylim = range(log(resc$progress$f), log(resf$progress$f),
                  log(resg$progress$f), log(resfw$progress$f),
                  log(resgw$progress$f)))
lines(resf$progress$nf, log(resf$progress$f), col = "red")
lines(resfw$progress$nf, log(resfw$progress$f), col = "blue")
lines(resgw$progress$nf, log(resgw$progress$f), col = "orange")
```

Here, reducing the wait time to 1 had a great effect on gradient restarting
(the orange line), it substantially outperforms the other methods.
On the other hand, function restarting more often (the blue line) led to worse 
performance.

There is also a `"speed"` restart type, related to a restart strategy introduced
by [Su and co-workers](https://papers.nips.cc/paper/5322-a-differential-equation-for-modeling-nesterovs-accelerated-gradient-method-theory-and-insights),
which restarts if the update vector doesn't increase in length (as measured by
Euclidean 2-norm) between iterations. This seems less performant in their
experiments than gradient-based restarting, but might provide more stability
under some circumstances.
