Ok, to recap: this whole article that we are discussing here is about the forward "gradient" method. Which, despite the name, only calculates a single directional derivative. If this were all, that would be a bit interesting, but not much else.
The article goes on to show how to make use of this in optimization. I called their descent algorithm the "random direction descent", because that's what it is. You chose a random direction, from a multivariate normal distribution with zero mean and identity covariance matrix (section 3.4). You chose a "step size" eta. Then you go along the chosen direction by eta times the directional derivative (times -1, so you go "downhill"). That is all explained at the top of page 4, Algorithm 1.
Why does this work, if, as you correctly noticed, the chosen direction is almost perpendicular to the gradient (one of the "features" of high dimensional spaces -> most of the volume of a unit sphere is close to the equator).
The answer is this: if you split the chosen random direction into the component along the gradient and the orthogonal component, then the first one has variance 1, and the second variance (N-1), since the overall variance is N, where N is the number of dimensions. The orthogonal component makes you move downhill, while the orthogonal component makes you move level-wise. The orthogonal component doesn't make things any better or worse.
Another commenter in this thread claims that this method adds noise, and the noise is sometimes so high that the method is useless. Well, that's not what I observed.
Why did I observe that the random direction descent outperforms the gradient descent. With the argument so far, you'd expect to do no worse, but why does it do better? This I don't have an explanation for, but maybe it's the fact that you don't have a constant step-size, but rather a stepsize that has variance 1. With the classical gradient descent, as you approach the minimum, the step becomes smaller and smaller. With this one, it still becomes smaller and smaller, but sometimes it's bigger, and sometimes even smaller. It appears there's some gain from the extra-stochasticity, so you end up getting faster at the minimum. I'm not sure why, but that's what I observed.
The article goes on to show how to make use of this in optimization. I called their descent algorithm the "random direction descent", because that's what it is. You chose a random direction, from a multivariate normal distribution with zero mean and identity covariance matrix (section 3.4). You chose a "step size" eta. Then you go along the chosen direction by eta times the directional derivative (times -1, so you go "downhill"). That is all explained at the top of page 4, Algorithm 1.
Why does this work, if, as you correctly noticed, the chosen direction is almost perpendicular to the gradient (one of the "features" of high dimensional spaces -> most of the volume of a unit sphere is close to the equator).
The answer is this: if you split the chosen random direction into the component along the gradient and the orthogonal component, then the first one has variance 1, and the second variance (N-1), since the overall variance is N, where N is the number of dimensions. The orthogonal component makes you move downhill, while the orthogonal component makes you move level-wise. The orthogonal component doesn't make things any better or worse.
Another commenter in this thread claims that this method adds noise, and the noise is sometimes so high that the method is useless. Well, that's not what I observed.
Why did I observe that the random direction descent outperforms the gradient descent. With the argument so far, you'd expect to do no worse, but why does it do better? This I don't have an explanation for, but maybe it's the fact that you don't have a constant step-size, but rather a stepsize that has variance 1. With the classical gradient descent, as you approach the minimum, the step becomes smaller and smaller. With this one, it still becomes smaller and smaller, but sometimes it's bigger, and sometimes even smaller. It appears there's some gain from the extra-stochasticity, so you end up getting faster at the minimum. I'm not sure why, but that's what I observed.