Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Gradients Without Backpropagation (arxiv.org)
87 points by hcks on March 2, 2022 | hide | past | favorite | 35 comments


Whenever I see an ML paper with surprising results I'm usually sceptical, as the results often come down to one of the following:

1. The assumptions don't generalize

2. The observations don't imply the conclusions from the paper

3. Hyper-parameter shenanigans: It's usually possible to choose hyper-parameters where method A is better than B, even if generally B is better.

In this case the lower-left part of figure 4 makes me suspicious, since forward mode seems to be better than reverse mode even on a per-iteration basis, which makes me suspect that they chose hyper-parameters where reverse mode has a disadvantage.


Yes, I've worked with this exact algorithm before, and others like it, and it should be much worse. Better than finite diffs or reinforce, but way worse than backprop's exact gradients, to the point of being unusable. Moreover the variance of the estimate grows with the number of parameters, so it gets worse still on bigger problems.


Quite smart. No perpetual motion machine here. This forward differentiation method is not as powerful as the backward one, it does not give you the gradient of a function. Instead, you choose a direction, and it gives you the derivative along that direction. You could essentially get that by “bump and recalc”, at the cost of an additional function evaluation. I even doubt this comes at a cheaper compute cost than the brute force bump and recalculate.

How about the stochastic gradient descent, if you don’t know the gradient? Well, you know the sign of the directional derivative, so you know which way is up and which is down. At each step you pick a random direction, and you move down along it. That’s it.

Edit: I just implemented this descent algorithm, let’s call it Random Direction Descent. I was sure that the gradient descent outperforms it more and more for higher and higher dimensions. The exact oposite happens: this one outperforms the gradient descent, and the higher dimension, the more. This is a huge surprise for me. This descent algorithm might revolutionize ML.


That does not make sense, can you expand? The steepest descent direction is far more useful than a random direction in high dimensions because that random direction has a greater chance of being nearly orthogonal to the steepest descent direction.


The random directions don’t have unit length. They are drawn from a multivariate normal. They have equal variance along all directions, including the direction of the gradient. The descent along the orthogonal directions somehow cancels out. The descent along the gradient direction somehow becomes more efficient, I don’t know why yet.


I agree, in expectation, the orthogonal directions cancel out. But the variance does not cancel out. Are you using some sort of momentum-like optimizer?


> But the variance does not cancel out.

They are multiplied with the derivative in that direction, which is zero.


Oh, you're talking about using the forward gradient method. This still does not make much sense because if the projection of the gradient in that direction is zero or near zero, that is a very weak step in reducing the loss.


Ok, to recap: this whole article that we are discussing here is about the forward "gradient" method. Which, despite the name, only calculates a single directional derivative. If this were all, that would be a bit interesting, but not much else.

The article goes on to show how to make use of this in optimization. I called their descent algorithm the "random direction descent", because that's what it is. You chose a random direction, from a multivariate normal distribution with zero mean and identity covariance matrix (section 3.4). You chose a "step size" eta. Then you go along the chosen direction by eta times the directional derivative (times -1, so you go "downhill"). That is all explained at the top of page 4, Algorithm 1.

Why does this work, if, as you correctly noticed, the chosen direction is almost perpendicular to the gradient (one of the "features" of high dimensional spaces -> most of the volume of a unit sphere is close to the equator).

The answer is this: if you split the chosen random direction into the component along the gradient and the orthogonal component, then the first one has variance 1, and the second variance (N-1), since the overall variance is N, where N is the number of dimensions. The orthogonal component makes you move downhill, while the orthogonal component makes you move level-wise. The orthogonal component doesn't make things any better or worse.

Another commenter in this thread claims that this method adds noise, and the noise is sometimes so high that the method is useless. Well, that's not what I observed.

Why did I observe that the random direction descent outperforms the gradient descent. With the argument so far, you'd expect to do no worse, but why does it do better? This I don't have an explanation for, but maybe it's the fact that you don't have a constant step-size, but rather a stepsize that has variance 1. With the classical gradient descent, as you approach the minimum, the step becomes smaller and smaller. With this one, it still becomes smaller and smaller, but sometimes it's bigger, and sometimes even smaller. It appears there's some gain from the extra-stochasticity, so you end up getting faster at the minimum. I'm not sure why, but that's what I observed.


The more dimensions, the simpler the surface. With enough dimensionality, everything is a straight line. I'd expect "random direction" grad desc to improve the more dimensionality the data has, as in that space, the data surface itself is simple.


> The more dimensions, the simpler the surface. With enough dimensionality, everything is a straight line.

Even assuming this statement were true (which is a big if). That does not explain how it would outperform gradient descent.


As dimensionality grows, data is more linearly sperable, so fewer dimensions are signficant in distinguishing the data (though for any pair of points, those dimensions will be different).

Grad desc, as an exhaustive search over all those dimensions "maybe" less performant in very high dimensions, where we have enough data to be "right some of the time" when randomly choosing which dim to discriminate on.

If we force zero loss, ie., an interpolation regime, then we're getting interpolation as usual. Can we get there faster when dimensionality increases?

It's plausible if count(relevant dimensions) << count(dimensions), and if they discriminating dimensions for any two random points is itself random.


I believe what you are describing is grid search, not gradient descent. Gradient descent is not an exhaustive search.


It is if you see vec_a . vec_b, as a traversal over the dimensions of both vectors.


I skimmed the paper, so I'm not sure I got this right, but my understanding is the following:

1. pick a random direction

2. compute the derivative along that direction using forward-mode differentiation

3. update the parameters along that direction based on the derivative

The idea being that this gives an unbiased (albeit noisy) approximation of the actual gradient. You thus need a smaller learning rate, but you also need less memory and computation and, net net, they argue it's a win. Is this correct?


I wonder if there's any value in picking not one but a couple random directions, and then finding the update in their span. Increases memory consumption but reduces the noise. It could be that the trade-off is best when picking a single direction, but who knows?

If you pick 2 directions at each pass, one of them could be the direction of the last update and the other a random one, allowing for some kind of momentum.


Yes there is value, but there is also some cost, and this value can be grabbed naturally by using an optimizer that can make use of multiple direction.

Optimizer like O-LBFGS maintain internally a low-dimension quadratic approximation of the cost function, that is updated with each noisy gradient evaluation, and then the optimizer consist of taking a step in the direction that would minimize that approximation.

In the same line of thought, one can ask whether a line-search like in usual optimization algorithms is worth it or not.

The memory cost is usually storing the parameters multiple times and an increase computing cost per iteration. Whether the tradeoff is worth it or not is usually problem dependent. Using a quadratic approximation is more costly than using a linear interpolation but it often pays off when encountering saddle points.

There are also technique like Polyak averaging (parameter tracking, Teacher-Student...) that have some success in Reinforcement Learning, where they help to regularize the energy landscape.

OpenAI "Evolution Strategy" used a low number of random directions, and they found that they needed roughly ~100 (= implicit dimension of the problem) directions to have equivalent performance than backprop, but they were computing the derivative with finite difference instead of forward differentiation.


Choosing the direction of travel to measure he gradient in or sampling around it makes sense to me too.


Copy of the abstract for quick local reference:

> Using backpropagation to compute gradients of objective functions for optimization has remained a mainstay of machine learning. Backpropagation, or reverse-mode differentiation, is a special case within the general family of automatic differentiation algorithms that also includes the forward mode. We present a method to compute gradients based solely on the directional derivative that one can compute exactly and efficiently via the forward mode. We call this formulation the forward gradient, an unbiased estimate of the gradient that can be evaluated in a single forward run of the function, entirely eliminating the need for backpropagation in gradient descent. We demonstrate forward gradient descent in a range of problems, showing substantial savings in computation and enabling training up to twice as fast in some cases.

--

They have an implementation of the "forward gradient" for PyTorch. Is any implementation in C available?


http://ceres-solver.org/ works well, in my experience. It does require code to be templated, which might increase computation time. It is possible to explicitly define derivatives for individual functions, this can help preventing numerical instabilities in corner cases.


Forward-mode differentiation is easy to implement in C++ with templates, operator overloading, and dual numbers (https://en.wikipedia.org/wiki/Automatic_differentiation#Auto...). Some libraries such as autodiff (https://github.com/autodiff/autodiff) and CppAD (https://github.com/coin-or/CppAD) use this method to calculate gradients and Jacobians.


You can use Tapenade in tangent mode if you want to generate forward derivative in the desired direction. http://www-tapenade.inria.fr:8080/tapenade/


Seems to be almost identical to the previous paper for ICLR 2022: Learning by Directional Gradient Descent (by Silver et al.) (https://openreview.net/pdf?id=5i7lJLuhTm)

The paper has some additional experimental analysis of the properties of the algorithms (including how many samples you should estimate per epoch), and also extends the method for training RNNs or reinforcement learning.


I would like to have some expert feedback on how meaningful this line of research is. The paper is quite straightforward, but the implications are unclear to me. Is there a chance that backpropagation is superseded by this method for some use cases ? Another question is, if this result is as interesting as it seems to me, how come this is only published in 2022 given the past research effort in the area of deep learning ?


It's not that novel, they're just computing the "directional derivative" in a random direction. It's similar to the research OpenAI has done on Evolution Strategies - basically, trading off a more noisy gradient but getting optimization with less inter process communication, achieving scale. This work is somewhere in between ES and normal backprop. Usually progress moves in a see-saw fashion, with scaling up (and the techniques needed) bringing improvement, then they becoming compressed/found unnecessary as the real cause is found, etc..


I'm sure you know this but for others ES is gradient free. Population sampling for optimization is distinct from 'gradient based' techniques in the literature.


They cited some of LeCun's solo work from the 80's on related matters, note that he also collaborated with Geoffrey Hinton on a related paper from NIPS 89 in which they used random perturbations to derive a gradient signal. One of the limitations of that approach was that it required each layer to have fewer units than the preceding one.

Gemini: Gradient Estimation Through Matrix Inversion After Noise Injection Yann Le Cun and Conrad C. Galland and Geoffrey E. Hinton https://proceedings.neurips.cc/paper/1988/file/a0a080f42e6f1...

I cited this in a recent paper because it was representative of different cases for stochastic injections in neural networks. Interesting to see that similar lines of inquiry are continuing to this day.


The idea is interesting. Unfortunately the authors didn’t bother to test it, and haven’t published the code. So until they do, or someone implements and tries it with something like resnet-50 and imagenet we don’t know how well it works.


Could this be closer to how the brain works? Have we ruled it brains using back propagation?


This paper from a couple of weeks ago suggested that neurons may learn as part of maximizing their energy efficiency. Which suggests that each neuron learns on its own, but also helps neurons by giving some feedback. I don't think it fully answered how that feedback happens (backprop or multiple forward passes), but it may be of interest to you nonetheless.

https://www.nature.com/articles/s42256-021-00430-y


I think there's a chance it's similar, I'm not a neuroscientist though. The paper says:

"This could potentially change the computational complexity of typical ML training pipelines, reduce the time and energy costs of training, influence ML hardware design, and even have implications regarding the biological plausibility of backpropagation in the brain (Bengio et al., 2015; Lillicrap et al., 2020)."


How does this approach relate to the astonishing Hinton results from 3-5 years ago?


Already discussed a few days ago on HN (don't have the link...)


Click the "past" link under the article title above to search for previous discussions.


Looks like it was submitted a couple of times but without traction or comments

https://hn.algolia.com/?dateRange=pastMonth&page=0&prefix=tr...




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: