I believe the transformer-xl pre-trained model can also be downloaded, to provid...

gwern · on Feb 11, 2020

Yeah. I didn't mention Transformer-XL because I'm not sure how much of a long-range dependency it actually learns to handle. The only papers I've seen on recurrency indicate that they tend to learn very short-range dependencies, while something like Reformer with direct access to thousands of timesteps seems more likely to actually be making use of them.

ColanR · on Feb 10, 2020

Wow, that's a lot of models. Thanks for pointing that out.