In terms of trying to break free of dependence on hand optimized kernels: a few people, myself included, have been working on some theoretical approaches to generating cache-efficient rearrangements for neutral net like problems. We've worked it out for convolution like problems [1] and have some upcoming results generalizing these techniques to other problems. Please feel free to email if you'd like to talk.
[1] https://arxiv.org/abs/1802.06905