Okay, this is a serious question. For me, not an official RH position. In my tim...

bayindirh · on Dec 9, 2020

Not much, but in our setup the image is not something which can evolve or change over time. This practice has some very practical reasons though.

Scientific applications can be very picky about the libraries they use or need, down to minor version since the results they produce are very, very precise. Even if not very accurate, you need to know the inaccuracy. An optimization in a math library can change this and, it's not something we want. Also program verification and certification generally includes versions of the libraries used.

Piecewise upgrades are a no go too. Your cluster generally can't work well in heterogeneous configurations (due to library mismatches) and draining a node is not a straightforward task (due to length of the jobs). If your cluster has a steady stream of incoming jobs, reducing resources also means queue bloat and recovering it is not easy sometimes. If you want to drain the whole cluster, it takes almost 2-3 weeks so, you lose ~1 month of productivity. When you start an empty cluster to churn its queues, its saturation takes time so, it doesn't go to 11 directly.

Also, worker nodes are highly isolated from the user's point of view. No users can log-in, only known people submit jobs, etc. Unless there's a rogue academic trying to do nefarious things, the place is pretty safe and worry-free. In past 15 years, we got two rootkit infections due to a server which can be world-accessible by design. Other than that, nothing ever got infected.

At the end of the day, this approach has some valid reasons to be alive. It's not that we're a bunch of lazy academics who refrain from applying good system administration practices. :D

Addendum: The images generally get updated when new hardware is added, since new processors tend to work better with newer kernels. Also sometimes we bit the bullet and update all the cluster at once. XCAT helps a lot in this space. If your image is sane, you can install batches of 150+ servers in 15 minutes while sipping your coffee.

mattdm · on Dec 9, 2020

Right, so: for this case, CentOS Stream will be virtually identical to the CentOS Linux RHEL rebuild.

bayindirh · on Dec 9, 2020

We will certainly try. Need to mirror a repo, freeze it and update our installation infra so it looks to the local repo rather than the national mirror.

All repo settings will look to local repo so we'd have no dependency problem or version creep if we need to install an additional package.

Didn't completely think how to handle the occasional emergency update though.

Also, we need to compile in some packages. Hope they won't break. High performance stuff needs optimized/customized compilations.

I just want to add: Hope that the packages in CentOS stream won't end up too cutting edge for the scientific software community. These communities move slow due to stability requirements. We'll certainly see but it might be another potential problem.

mattdm · on Dec 9, 2020

I can totally reassure you on your last concern: everything that goes into Stream is approved for a minor release in RHEL. That's not changing at all. Cutting edge is still Fedora's turf. :)

bayindirh · on Dec 9, 2020

Thanks, because that last point would be actually breaking in some cases.

I think HN is the only place where you can casually provide feedback and get answers about an OS project from one of the core people in it. Fun!

Glad to meet you, BTW.

mattdm · on Dec 12, 2020

To be clear, I'm RHEL and CentOS _adjacent_, rather than actively _in_ them. But I think (rough launch and more than a few communication issues) aside this is generally gonna be positive.

shiftpgdn · on Dec 9, 2020

I think that's because HPC users are largely non-technical developers. We changed a DHCP schema at one point and had a bunch of angry academics in the IT office because their Matlab scripts were broken. Many of them had been hard coding IP addresses into the code itself.

cozzyd · on Dec 9, 2020

The login nodes on our cluster (UChicago) can reach uptimes over a hundred days (which my tmux sessions love).

Seems like the kernel was last updated in May.

    $ uname -r -v
    3.10.0-1127.8.2.el7.x86_64 #1 SMP Wed May 13 10:45:47 CDT

bromonkey · on Dec 8, 2020

No, they haven't changed in my experience.