Kafka's not a message queue, it's a distributed log.
The main advantage of Kafka is that you can:
1. Write a shitload of data to it fast
2. Experience a cluster failure or degradation with low risk of data loss
3. Retain absolute ordering, with caveats, if you're careful.
4. Run large numbers of concurrent consumers without imposing significant overhead on the cluster
That's pretty much it. I've made a good career the last few years out of Kafka, but there are _so_ many companies using it that don't need it, and what strikes me as passing odd is when I've been consulting on their Kafka adoption and tell them that, they don't want to hear it.
Which is really weird IMO because I stand to make far more money by encouraging them to use it unnecessarily because...
...Kafka is a complex solution to a complicated problem, and often people using it don't have the sheer scale of data that would necessitate taking on that complexity.
> In 2009, I was processing a TB per month of 837 claims on even worse hardware.
Kafka starts to shine when you're creating X TiB of data a day, and would like not to lose it.
It's best mentally modelled as a series of rather resilient tubes, that way people can stop thinking it's just a super duper scalable message queue, and they can just drop it in as a black box and get all the semantics they're used to from ActiveMQ / RabbitMQ etc.
Even Pulsar, which tries to emulate MQ semantics, isn't a drop-in replacement, at all. And it has even more moving parts than Kafka.
I've had the most luck at enterprises with large legacy foundations that they are desperate to modernize. These companies have tried to migrate to the cloud. They have tried building replacements for their legacy systems. They have tried paying millions of dollars to consulting companies to build a one size fits all replacement for their legacy mess. Kafka has been the fastest way to start taming some of the ridiculous amounts of data going through these systems that are so convoluted that no one knows where all the bodies are buried. I've found it gives a very good layer of separation where you can still interface with and get data from these legacy systems while also enabling modern application development with said data.
Company after company has spent years trying to just classify and normalize all of their data. These big data warehouse style environments always end up so brittle and take so much longer to get anything useful out of compared to the sort of immediate and incremental improvement you can get with a Kafka style migration away from these older systems. You're right that it's overkill for so many companies though. I'm curious to know where you've seen success with Kafka.
> I'm curious to know where you've seen success with Kafka.
Much the same as you, as a replacement for brittle data pipelines where multiple services are generating data, and then a cron moves it across a network file system, or the apps are pushing to something like ES directly, which has caused data loss when their ES wasn't configured correctly, etc. etc.
The nice thing about Kafka is that it decouples those dependencies, which gives you room to scale the ingestion side, and also allows for failures without data loss.
The one caveat though, is that Kafka becomes the dependency, but in my years of using Kafka in anger (since 0.8), I've only ever encountered full cluster unavailability a few times.
Memorable ones:
* a sysop who decided that the FD limit for processes on the boxes running their self-managed Kafka should be a nice low number, which prevented Kafka from opening sufficient sockets
* a 3 node cluster, where to "prevent data loss", they were using acks=all and had configured replicas to be 3, and min insync replicas to be 3, and then a broker had gone down. Yeah, not great that.
* A 2.5 stretch cluster (brokers and ZKs in two AZs, tie-breaker ZK in a 3rd) suffered a gradual network partition when an AWS DC in Frankfurt overheated, and they ended up with two leaders for the same topic partition in each of their main AZs, that were still accepting writes as they could hit minISR within their own DCs. There was some other weirdness in how they interacted with the tie-breaker ZK. And when the network partition was ended, neither of them could be elected leader. Had to dump the records from both brokers from the time the partition began, roll the brokers to allow an unclean leader election, then roll them again to turn it off once the election had succeeded, and then the client got to sift through the data and figure out what to do with it - as they weren't relying on absolute ordering, it was reasonably trivial for them to just write it again IIRC.
So for absolutely essential do not lose data, if Kafka connectivity was lost, the app should make itself unready, and then fail over to writing any uncommitted data to disk for a side-car to upload to S3 or similar.
Kafka's great, but you have to code deliberately for it, which is where I saw the biggest mistakes occur - strapping Kafka to ActiveMQ with Camel and expecting Kafka to work the same as ActiveMQ...
Disclaimer: I work on NATS and am employed by Synadia, maintainer of NATS.
NATS is much more lightweight than the alternatives, and has a much simpler operation story and can run on low resource hardware very easily.
Kafka fits a very specific use case, and NATS tries to solve a much larger set of distributed computing use cases, including micro-services, distributed KV, streaming, pub/sub, object store and more.
It’s a very different philosophy, but can be very powerful when it’s deployed as a utility inside an organization.
We’ve seen so many success stories, especially companies moving to the edge, we continue to see NATS used as a common transport for all kinds of use cases in finance, IoT, edge, Industry 4.0, and Retail
how robust is nats persistence layer ? strenght of kafka relies on its proven ability to maintain a high load while making sure you won't loose the events you're streaming.
Disclaimer: I'm a NATS maintainer working in Synadia (the company behind NATS)
It is robust and performant, with subject-based querying, HA, scalability, streams, key-value store, mirrors and at-least-once and exactly-once semantics.
Used in production on a large scale by companies across many industries.
The main advantage of Kafka is that you can:
1. Write a shitload of data to it fast
2. Experience a cluster failure or degradation with low risk of data loss
3. Retain absolute ordering, with caveats, if you're careful.
4. Run large numbers of concurrent consumers without imposing significant overhead on the cluster
That's pretty much it. I've made a good career the last few years out of Kafka, but there are _so_ many companies using it that don't need it, and what strikes me as passing odd is when I've been consulting on their Kafka adoption and tell them that, they don't want to hear it.
Which is really weird IMO because I stand to make far more money by encouraging them to use it unnecessarily because...
...Kafka is a complex solution to a complicated problem, and often people using it don't have the sheer scale of data that would necessitate taking on that complexity.
> In 2009, I was processing a TB per month of 837 claims on even worse hardware.
Kafka starts to shine when you're creating X TiB of data a day, and would like not to lose it.
It's best mentally modelled as a series of rather resilient tubes, that way people can stop thinking it's just a super duper scalable message queue, and they can just drop it in as a black box and get all the semantics they're used to from ActiveMQ / RabbitMQ etc.
Even Pulsar, which tries to emulate MQ semantics, isn't a drop-in replacement, at all. And it has even more moving parts than Kafka.