I couldn't disagree more. The defaults don't just work, and the architecture of the network could also be considered a hyper parameter in which case what would be a reasonable default for all the types of problems ANN are used for?
Are you using batch normalization? If you are, an issue I see all the time is folks not setting the EMA filter coef correctly. In keras, it defaults to something like 0.99 which in my mind makes no sense. I use something around 0.6 and life is good. You want to get an overall good measurement of the statistics and in my mind the frequency cutoff when coef=0.99 is just way too high for most application. You usually want something that filters out just about everything except very close to DC.
The response to "the defaults should work just fine without any hyperparameter tuning" is "try fiddling with the EMA filter coefficient hyperparameter" ?
It's like the joke of the mathematician giving an exposition of a complex proof. At one point he says "It is obvious that X", pauses, scratches his head, does a few calculations. Leaves room for twenty minutes and returns. Then continues "it is obvious that X" and goes to the next step.
Deep in the field, it's fine for machine learning experts to say "everything just works" [if you've mastered X, Y, Q esoteric fields and tuning methods] since they're welcome to "humble brag" as much as they want. But when this gets in the way of figuring out what really "just works" it's more of a problem.
I think they're referring to the momentum parameter at [1]. The exponential moving average (EMA) of the batch mean/variance is used in the batch normalizing transform (Algorithm 1 in [2]).
The momentum ranges from 0 to 1. If it's close to 1, which the default of 0.99 is, the EMA of the batch mean/variance will change slowly across batches. If it's close to 0, the EMA will be close to the mean/variance of the current batch.
The EMA acts as a low-pass filter. With a momentum close to 1, the EMA changes slowly, filtering out high frequencies and leaving only frequencies close to DC. Note that this is opposite to what grandparent says: 0.99 has a lower frequency cutoff than 0.6 does. So I'm not really sure what they're getting at there.