Hacker Newsnew | past | comments | ask | show | jobs | submit | motiwari's commentslogin

Hmmmm.... not sure exactly what you mean. I believe the setting you're describing is: we have a dataset we're trying to fit with a GMM, we know the number of components k, and we're trying to determine the parameters of the GMM, correct?

I suppose that you could adaptively sample points from the dataset to update your parameters of the GMM, and sample more points for parameters of the GMM that you're less certain about.

(To understand how the parameter estimates would converge to their true values, you'd likely need to use the delta method; see Appendix 3 in https://ar5iv.org/pdf/2212.07473.pdf for an example)

Is that what you had in mind?


Definitely possible, but it would require some extensions to the algorithm. More specifically, as new datapoints enter the stream, they could be compared with the existing medoids to see if swapping them would lower the clustering loss.

This would be a nontrivial engineering effort and I likely won't be able to do it myself (I'm a PhD student about to graduate), but if you or your team is interested in adapting BanditPAM to the streaming setting, please feel free to reach out! My email's motiwari@stanford.edu


Where is this benchmark from? We'd be happy to run BanditPAM on these datasets and report the results



A strength if you're strictly looking to minimize squared L2 loss from each point to its closest mean -- but for a lot of other applications, it's a weakness! As the other poster mentioned, with KMedoids you can use arbitrary loss functions and cluster exotic objects (not restricted to metrics on a vector space)


You got me! As @dang mentioned, once I got feedback from him I was allowed to repost and enter the second-chance pool

(My real first post was submitted too hastily without receiving @dang's feedback)


Hey, thanks! Shout out to Daniel @ HN who gave me a lot of great feedback on how to make this post better.


Thanks for bug report and repro steps! I've filed this issue at https://github.com/motiwari/BanditPAM/issues/244 on our repo.

I suspect that this is because the scikit-learn implementation of KMeans subsamples the data and uses some highly-optimized data structures for larger datasets. I've asked the team to see how we can use some of those techniques in BanditPAM and will update the Github repo as we learn more and improve our implementation.


Right! Though to clarify a few nits: we use successive elimination instead of successive halving. And we talk about the Maximum Inner Product Search problem (very similar to NN problem) in our followup work: https://ar5iv.org/abs/2212.07551


We talk exactly about clustering pictures in our blog post! https://ai.stanford.edu/blog/banditpam/


One thing that's important to note is that k-medoids supports arbitrary distance metrics -- in fact, your dissimilarity measure need not even be a metric (it can be negative, asymmetric, not satisfy the triangle inequality, etc.)

An implication of this is that if you were to do some invertible data transformation and then perform clustering, that's equivalent to doing clustering with a different dissimilarity measure (without the data transformation in the first place). It should be possible to avoid doing the invertible data transformation in the first place if you're willing to engineer your dissimilarity measure.

Without more details, it's hard to say exactly what would happen to the clustering results under custom dissimilarity measures or data transformations -- but our package supports both use cases!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: