More

motiwari · on April 7, 2023

Hmmmm.... not sure exactly what you mean. I believe the setting you're describing is: we have a dataset we're trying to fit with a GMM, we know the number of components k, and we're trying to determine the parameters of the GMM, correct?

I suppose that you could adaptively sample points from the dataset to update your parameters of the GMM, and sample more points for parameters of the GMM that you're less certain about.

(To understand how the parameter estimates would converge to their true values, you'd likely need to use the delta method; see Appendix 3 in https://ar5iv.org/pdf/2212.07473.pdf for an example)

Is that what you had in mind?

motiwari · on April 7, 2023

Definitely possible, but it would require some extensions to the algorithm. More specifically, as new datapoints enter the stream, they could be compared with the existing medoids to see if swapping them would lower the clustering loss.

This would be a nontrivial engineering effort and I likely won't be able to do it myself (I'm a PhD student about to graduate), but if you or your team is interested in adapting BanditPAM to the streaming setting, please feel free to reach out! My email's motiwari@stanford.edu

motiwari · on April 7, 2023

Where is this benchmark from? We'd be happy to run BanditPAM on these datasets and report the results

hoseja · on April 11, 2023

Looks like here?

https://scikit-learn.org/0.15/auto_examples/cluster/plot_clu...

motiwari · on April 7, 2023

A strength if you're strictly looking to minimize squared L2 loss from each point to its closest mean -- but for a lot of other applications, it's a weakness! As the other poster mentioned, with KMedoids you can use arbitrary loss functions and cluster exotic objects (not restricted to metrics on a vector space)

motiwari · on April 7, 2023

You got me! As @dang mentioned, once I got feedback from him I was allowed to repost and enter the second-chance pool

(My real first post was submitted too hastily without receiving @dang's feedback)

motiwari · on April 7, 2023

Hey, thanks! Shout out to Daniel @ HN who gave me a lot of great feedback on how to make this post better.

motiwari · on April 7, 2023

Thanks for bug report and repro steps! I've filed this issue at https://github.com/motiwari/BanditPAM/issues/244 on our repo.

I suspect that this is because the scikit-learn implementation of KMeans subsamples the data and uses some highly-optimized data structures for larger datasets. I've asked the team to see how we can use some of those techniques in BanditPAM and will update the Github repo as we learn more and improve our implementation.

motiwari · on April 7, 2023

Right! Though to clarify a few nits: we use successive elimination instead of successive halving. And we talk about the Maximum Inner Product Search problem (very similar to NN problem) in our followup work: https://ar5iv.org/abs/2212.07551

motiwari · on April 7, 2023

We talk exactly about clustering pictures in our blog post! https://ai.stanford.edu/blog/banditpam/

motiwari · on April 5, 2023

One thing that's important to note is that k-medoids supports arbitrary distance metrics -- in fact, your dissimilarity measure need not even be a metric (it can be negative, asymmetric, not satisfy the triangle inequality, etc.)

An implication of this is that if you were to do some invertible data transformation and then perform clustering, that's equivalent to doing clustering with a different dissimilarity measure (without the data transformation in the first place). It should be possible to avoid doing the invertible data transformation in the first place if you're willing to engineer your dissimilarity measure.

Without more details, it's hard to say exactly what would happen to the clustering results under custom dissimilarity measures or data transformations -- but our package supports both use cases!