I believe the transformer-xl pre-trained model can also be downloaded, to provide a similar long term memory functionality as the compression transformer. I don't have a direct link, but it's available via huggingface https://huggingface.co/transformers/pretrained_models.html
Yeah. I didn't mention Transformer-XL because I'm not sure how much of a long-range dependency it actually learns to handle. The only papers I've seen on recurrency indicate that they tend to learn very short-range dependencies, while something like Reformer with direct access to thousands of timesteps seems more likely to actually be making use of them.