Every time I build something complex with dataframes in either R or Python (Pand...

chrisaycock · on March 8, 2025

Statically typed dataframes are exactly why I created the Empirical programming language:

It can infer the column names and types from a CSV file at compile time.

Here's an example that misspells the "ask" column as if it were plural:

  let quotes = load("quotes.csv")
  sort quotes by (asks - bid) / bid

The error is caught before the script is run:

  Error: symbol asks was not found

I had to use a lot of computer-science techniques to get this working, like type providers and compile-time function evaluation. I'm really proud of the novelty of it and even won Y Combinator's Startup School grant for it.

Unfortunately, it didn't go anywhere as a project. Turns out that static typing isn't enough of a selling point for people to drop Python. I haven't touched Empirical in four years, but my code and my notes are still publicly available on the website.

noworriesnate · on March 8, 2025

Wow this is amazing!! Thanks for sharing!

I love how you really expanded on the idea of executing code at compile time. You should be proud.

You probably already know this but for people like me to switch "all" it would take would be:

1. A plotting library like ggplot2 or plotnine

2. A machine learning library, like scikit

3. A dashboard framework like streamlit or shiny

4. Support for Empirical in my cloud workspace environment, which is Jupyter based, and where I have to execute all the code, because that's where the data is and has to stay due to security

Just like how Polars is written in Rust and has Python bindings, I wonder if there's a market for 1 and 2 written in Rust and then having bindings to Python, Empirical, R, Julia etc. I feel like 4 is just a matter of time if Empirical becomes popular, but I think 3 would have to be implemented specifically for Empirical.

I think the idea of statically typed dataframes is really useful and you were ahead of your time. Maybe one day the time will be right.

theLiminator · on March 8, 2025

Does this require that the file is available locally or does it do network io at compile time?

chrisaycock · on March 8, 2025

The inferencing logic needs to sample the file, so (1) the file path must be determined at compile time and (2) the file must be available to be read at compile time. If neither condition is true---like the filename is a runtime parameter, for example---then the user must supply the type in advance.

There is no magic here. No language can guess the type of anything without seeing what the thing is.

theLiminator · on March 10, 2025

Yeah, i think that's what limits the utility of such systems. Polars does typechecking at query planning time. So before you really do computation. I don't expect that much can improve over this model due to the aforementioned limitations.

I think needing network access or file access at compile time is a semi-hard blocker for statically typed dataframes.

lamp_book · on March 8, 2025

Scala Spark - a bit absurd if you don't need the parallelism, though. Most of the development can be done simply in quick compilation iterations or copied from the sbt REPL. Python/pandas feels Stone Age in comparison - you absolutely waste a lot of time iterating with run-time testing.

Centigonal · on March 8, 2025

Why scala spark over pyspark?

smu3l · on March 8, 2025

Scala (and Java) has a typed Dataset api.[0] pyspark only provides the Dataframe API, which is not typed.

[0] https://spark.apache.org/docs/latest/sql-programming-guide.h...

Centigonal · on March 8, 2025

thanks!

akdor1154 · on March 8, 2025

The pandas mypy stubs attempt to address this to some extent, but to be honest.. It's really painful. Not helped by pandas' hodgepodge API design to be fair, but i think even a perfect API would still be annoying to statically type. Imagine needing to annotate every function that takes a data frame with 20 columns...

A tantalising idea i have not explored, is to try and hook up polars' lazy query planner to a static typing plugin. The planner already has basically complete knowledge of the schema at every point, right?

So in theory this could be used to give the really good inference abilities that a static typing system needs to be nice to use.

theLiminator · on March 8, 2025

Depends, it's resolved at runtime, so there's no way to have truly "compile-time" static schema (unless you specify a schema upfront).

enugu · on March 8, 2025

Polars is also usable as a Rust library. So, one can use that for static typing. Wonder what the downsides are - maybe losing access to the Python data science libraries.

antonvs · on March 8, 2025

Polars dataframes in Rust are still dynamically typed. For example:

    let df = df![
        "name" => ["Alice", "Bob", "Charlie"],
        "age" => [25, 30, 35]
    ]?;

    let ages = df.column(“age”)?;

There’s no Rust type-level knowledge of what type the “age” or “name” column is, for example. The result of df.column is a Series, which has to be cast to a Rust type based on the developer’s knowledge of what the column is expected to contain.

You can do things like this:

    let oldies = df.filter(&df.column("age")?.gt(30)?)?;

So the casting can be automatic, but this will fail at runtime if the age column doesn’t contain numeric values.

One type-related feature that Polars does have is because the contents of a Series is represented as a Rust value, all values in a series must have the same type. This is a constraint compared to traditional dataframes, but it provides a performance benefit when processing large series. You can cast an entire Series to a typed Rust value efficiently, and then operate on the result in a typed fashion.

But as you said, you can’t use Python libraries directly with Polars dataframes. You’d need conversion and foreign function interfaces. If you need that, you’d probably be better off just using Python.

lmeyerov · on March 8, 2025

Pandas, dask, etc use also have runtime typed cols (dtypes), which is even stronger in pandas 2 and when used with arrow to go to data representation typing for interop/io. (Half of the performance trick of polars.)

And yeah my ??? with all these is, lacking dependent typing or equivalent for row types, it's hard for mypy and friends to statically track individual columns existing and being specific types. And even if we are willing to be explicit about wrapping each DF with a manual definition, basically an arrow schema, I don't think any of these libraries make that convenient? (And is that natively supported by any?)

In louie.ai, we generate python for users, so we can have it generate the types as well... But we haven't found a satisfactory library for that so far...

enugu · on March 8, 2025

Thanks, I am in the process of choosing a dataframe library and just naively assumed that the Rust interface would be statically typed.

TheTaytay · on March 8, 2025

I agree with this so much! I recently started using patito, which is a typesafe pydantic based library for Polars. I’m not really deep into it yet, but I prefer polars syntax and the extra functions that Patito adds to the dataframes. (https://patito.readthedocs.io/en/latest/)

Otherwise, it feels so broken to just pass a dataframe around. It’s like typing everything as a “dict” and hoping for the best. It’s awful.

dharmatech · on March 8, 2025

Frames is a type safe dataframe library for Haskell:

https://hackage.haskell.org/package/Frames

ants_everywhere · on March 8, 2025

I agree, and I suspect there are large numbers of unknown bugs in a lot of data frame based applications.

But to do it right you'd need a pretty good type system because these applications implicitly use a lot of isomorphisms between different mathematical objects. The current solution is just to ignore types and treat everything as a bag of floats with some shape. If you start tracking types you need a way to handle these isomorphisms.

jamesblonde · on March 8, 2025

If you use a feature store to store your DataFrames (most provide APIs for storing Polars, Pandas, PySpark DataFrames in backing Lakehouse/real-time DBs), then you get type checks when writing data to the DataFrame's backing Feature Group (Lakehouse + real-time tables).

Many also add an additional layer of data validation on top of schema validation, using frameworks like Great Expectations. For example, it's not enough to know 'age' is an Integer, it should be an integer in the range 0..150.

Disclaimer: i work for Hopsworks.

ehsantn · on March 10, 2025

Bodo is a JIT compiler for Pandas/Numpy code that statically types the code during compilation time. However, it infers types automatically (looks at file metadata etc) and not really built for manually typing everything.

https://github.com/bodo-ai/Bodo (I work on Bodo)

Centigonal · on March 8, 2025

It's really not the same as inbuilt strict typing, but we addressed this issue by running all of our "final" data products through a Great Expectations[1] suite that was autogenerated from a YAML schema.

[1] https://docs.greatexpectations.io/docs/core/introduction/gx_...