Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd say we have about 60% library code and 40% application code internally, so making it easier to write libraries is really nice. E.g. you don't want to have to write different statistical calculations for every type you use.

In particular, the fact that these types would silently give you different answers if you called `.std()` was a big headache

It was very common for us to want to be generic over pandas series and numpy arrays. A bit less so with pytorch tensors, but that was because we just aggressively converted them to numpy arrays. Fundamentally these three are all very similar types so it's frustrating to have to treat them differently.



which implementation of `std()` did you go with?

I was writing unit tests once against histograms. That code is super finnicky, and I couldn't get pandas and polars numbers to tie out. I wasn't super concerned with the exact output for my application, just that number of buckets was the same and they were roughly the same size. Just bumping to a new version of numpy would result in test breakages because of floating point rounding errors. I'm sure there are much more robust numerical testing things I could have done


Yeah this kind of stuff drove us crazy. Numpy uses things like SIMD everywhere with no option to turn it off, which is good for performance but makes life really hard when you want consistently reproducible results across different machines.

Before switching to Julia we eventually standardized on numpy.std with ddof=1, and with some error tolerances in our tests.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: