We've been running Kubernetes (500+ containers) in production for over a year no...

contingencies · on Dec 22, 2016

Look at the list of PV types for Kube

What I see is a lot of complex network filesystems, vendor-specific solutions and gateway protocols to expensive SAN solutions, which are already chalk and cheese in terms of features and performance.

Arguably one of the best features of unix-style systems is support for arbitrary mount points, filesystem drivers and (network or local) blockstores. Storage is, essentially, a well-solved problem at the OS level. The fact that this option is marked "single node testing only – local storage is not supported in any way and WILL NOT WORK in a multi-node cluster" raises eyebrows.

By choosing to expose individual remote storage model semantics as Kube-level PV drivers instead of just leaving this to the OS, what I would argue we essentially see here is the legacy of a cluster orchestration system that came out of Google... a system optimized for large, homogenous, dynamic workloads to provide organization-internal IaaS, and not reduced feature-set systems with simpler architectural properties (eg. no multi-client network-aware filesystem locking).

I would argue that, in fact, what many people actually want is simpler, and the current pressure to use 'one size fits all' cluster orchestration systems with a high minimum bar of functionality and nodecount (read: minimum hardware investment) is misplaced. At the very least, there's some legitimacy to this line of thinking.

cookiecaper · on Dec 22, 2016

Yes. k8s is cool but it is vastly overcomplicated for the needs of the non-Googles. We've been porting my company's production infrastructure to it over the last year and while it's been fun, I don't think it's been the correct thing for us.

Since suggesting your company is not in the same class of the companies that see literally billions of unique users every day, and thus may not need such overcomplicated solutions, is sure to make your boss irate, it's a good idea to familiarize with whatever new hotness has Facebook or Google's name attached to it.

Your clueless colleagues will race each other to announce the latest Google/FB engineering blog post in Slack so they can look the smartest and then convince your boss that since your Google-dom will be upon you tomorrow, you must adopt HotNewStuff today. This impulse is behind the proliferation of Hadoop and "Big Data", containers and orchestration, and MongoDB and NoSQL. All of these are useful tools that are valuable and good as necessary, but widely abused because people who don't really know what they're doing think this will give them an out.

You'll be stuck maintaining something interesting but really not mature or production-ready like k8s for years, just about long enough for it to become smooth and stable, at which time something else will come along to repeat the cycle. :)

ownagefool · on Dec 22, 2016

Out of interest, what are you migrating from?

cookiecaper · on Dec 22, 2016

Deployment across EC2 nodes, managed with devops scripts from a few different tools and monitored with conventional monitoring solutions like Nagios/Munin. We migrated from colocated racks to that a few years back.

Personally, while there is undoubtedly a convenience factor with being pure EC2 and a cool factor with k8s, I think 80% of our stuff would be better off in the racks (which included a couple of hypervisors, so we still had some cloud-style flexibility and could do things like auto-scaling).

ownagefool · on Dec 23, 2016

Whilst I often agree with you that we're a hype driven machine that more than often just creates more work for ourselves, I actually think kubernetes is an improvement over tying an app directly to EC2.

Obviously I imagine you know other tools better and it depends how you do it, but kube gives you a lot more by default. Arguably more importantly, I can lift and shift kubernetes and put it in any cloud or on-premise. I'm not really sure what the benefit of running a VMs would give you, other than possibly live migrations.

TheIronYuppie · on Dec 23, 2016

May I ask - what's the biggest issue you've been facing? Anything we can do to make it easier/more useful? We've found that there are a ton of things that people just end up reinventing unless it comes in the box (e.g. autoscaling, rolling deployments, roll backs, replications, aggregated logging/monitoring, etc).

Disclosure: I work at Google on Kubernetes

cookiecaper · on Dec 23, 2016

To be honest, I haven't gotten super-into-the-weeds on Kubernetes. Another guy is the main k8s guy, but I have used the cluster he's configured and deployed a few containers on it. I've also had to troubleshoot a few nodes. A lot of these complaints may be things that are already solved, but we just don't know how/where/why yet. I think we're also using a relatively "old" version of k8s (in young technology, "old" is anything more than a few months old), so some of these issues may have already been addressed.

First issue for me: the recommended way to run k8s for local testing, etc., is minikube. I run a hybrid Windows-Linux desktop env since June (full-time Linux for 10+ years before that), where Windows is the host OS and my Linux install is running as a VBox guest with raw disk passthrough. I have it configured essentially so that Windows acts like a Linux DE that can run Photoshop and play games, while I do all my real work through an SSH session to the local VM, which is my Linux install (and which I can boot into natively if desired, but dual-booting always impairs workflow, which was the reason I switched to this setup in the first place; previously, I would reboot into Windows maybe once a year even though there were games and things I wanted to try and photo editing in VMs hosted on my Linux box was painfully slow).

This means that minikube, itself dependent on VMs to spin up fake cluster members, won't work because VM hardware extensions aren't emulated through VirtualBox's fake CPU. So that's the first hurdle that has stopped me from tinkering more seriously with k8s clusters. I know there is "k8s the hard way" and stuff like that too, but it'd be really nice if we had a semi-easy way to get a test/local k8s up and running without requiring VM extensions, as I imagine (but don't actually know) most cloud rentals don't support nested VMs either.

Besides this big hurdle to starting out, many of the issues are high-level complexity things that create a barrier to entry more than things that actively get in the way of daily use once you understand them.

For example, we have 3 YAML files per service that need to be edited correctly before something can be deployed: [service]-configmap.yaml, [service]-deployment.yaml, and [service]-service.yaml. We have dozens of services deployed on this cluster, so we have hundreds of these things floating around. They're well-organized, but this alone is a headache. The specific keys have to be looked up, they have to be in the right type of configuration; if something that is supposed to be in the configmap is in the deployment file, k8s will be unhappy, the right env variable won't get set (more dangerous than it sounds sometimes), the wanted shared resource won't get mounted correctly (and my experience is that it's not always obvious when this is the case, and the mount behavior is not always consistent), or whatever. Keys must be valid DNS names, or something like that, because etcd, which runs under the covers here somewhere, doesn't accept names that would be invalid DNS entries. This means no underscores. There's nothing wrong with any of that per se, but it's a lot to wield/remember.

I also remember mostly thinking the errors related to k8s configurations and commands were unhelpful. For example, it took me a long time (a frustrating 60-90 minutes probably) to realize that `kubectl create --from-file` wasn't reading in my maps as config structures, but rather as literal strings. This seems like something that should've been made obvious through something like a warning on import ("--from-file imports your map as a literal; if you want to parse the contents and use it as a config, please use `apply -f`" (and `apply -f` means apply the config read and parsed from the file, not "apply with force", while `create --from-file` means "create a literal string as a resource instead of parsing and creating this config as a config object; however, be careful with kubectl apply, because it will try to silently merge existing configs with new values, which is sometimes helpful, and sometimes can drive you nuts if you forget about this behavior; I dunno if kubectl delete configmap my-configmap.yaml and recreate is always feasible or if that would give dependency conflicts or what?)).

To deploy, kubectl apply -f changed-yaml.yaml, which sometimes does and sometimes doesn't clean up the running pod (service configuration thing? or is it a matter of which config type I'm applying, cm, deployment, or service?), `kubectl delete pod old_pod_id` if not automatically reaped, restarting is automatic under our config after a delete which I'd guess is configurable too, then you have to `kubectl list pods | grep service_name` to get the new pod id, and `kubectl logs pod_id` to get the logs and make sure everything started up normally, though this just shows the logs output by the container's stdout, not necessarily the relevant/necessary logs. Container-level issues won't show on `kubectl logs`, but require `kubectl describe pod pod_id -o wide`.

Then you have to `kubectl exec -it pod_id /bin/whatever` to get into the right container if you need to poke around in the shell (and I know, you're not supposed to need to do this often). Side note here, tons of people trying to containerize their apps that run on Ubuntu or Debian today onto Alpine, another mostly-unnecessary distraction, and seems to result in just grabbing a random container image from Docker Registry that claims to provide a good Ruby runtime on Alpine or something without looking into the Dockerfile to confirm, which IMO is a much larger security risk than just running a full Ubuntu container.

Lots of extended options like `kubectl get pods -o [something]` are non-intuitive. I guess they're JSON pathspecs or something like that? Again, that probably makes sense, but it's pretty unwieldy. Often have to do `kubectl describe pod pod_id -o wide` to get useful container state detail.

When a running pod was going bananas, we had to `kubectl describe nodes`, again a long and unwieldy output format, and we have to try to decipher from the 4 numbers given there what kind of performance profile a pod is encountering. This leads us into setting resource quotas to make sure that pods on the same node don't starve each other out, which is something I know the main k8s guy has had to tinker with a lot to get reasonably workable.

Yes, we have frontend visualizers like Datadog that help smooth some of this over by giving a near-real-time graph with performance info, but there's still a lot of requisite kubectl-fu before we can get anything done. I also know that there are a ton of k8s and container ecosystem startups that claim to offer a sane GUI into all of this, but I haven't tried many yet, probably because I'm not really convinced any of this is necessary as opposed to just cool, which it undoubtedly is, but that's not how engineers are supposed to run production environments.

I mean all of this doesn't even scratch the surface, and I know they're not huge complaints, but they just speak to the complexity of this, and a reasonable person has to have some incentive to do it besides "It makes us more like Google". Haven't talked about configuring logging (which requires cooperation from the container to dump to the right place), inability to set a reliable and specific hostname for a container in a pod that will persist through deployments, YAML/JSON/etcd naming and syntax peculiarities in the deployment configs, getting load balancing right, crash recovery, pod deployments breaking bill-by-agent services like NewRelic and Datadog and making account execs mad, misguided people desparetely trying to stuff things like databases into this system that automatically throws away all changes to a container whenever it gets poked, because everything MUST be using k8s, since you already promised the boss you were Google Jr. and he will accept nothing less, and a whole bunch of other stuff.

All of this ON TOP OF the immaturity and complexity of Docker, which itself is no small beast, on top of EC2.

That's QUITE the scaffolding to get your moderate-traffic system running when, to be honest, straightforward provisioning with more conventional tooling like Ansible would be more than sufficient -- it would be downright sane!

SOOOOOOOOO ok. Again, I'm not saying there's anything wrong with how any of this is done per se, and I'm sure some organizations really do need to deal with all of this and build custom interfaces and glue code and visualizers to make it grokable and workable, and of course Google is among them as this is the third-generation orchestration system in use there. None of this should be taken as disrespectful to any of the engineers who've built this amazing contraption, because it truly is impressive. It's just not necessary for the types of deployments we're seeing everyone doing, which has nothing to do with the k8s team itself.

I'm sure that given the popularity of k8s, people will develop the porcelain on top of the plumbing and make it pretty reasonable here in the not-so-distant future (3-5 years). However, like I said in my original post in this thread, I don't think this is benefiting many of the medium-sized companies that are using it. I think, to be completely frank, most deployments are engineers over-engineering for fun and resume points. And there's nothing wrong with that if their companies want to support it, I guess. But there's no way it's necessary for non-billion-user companies unless you REALLY want to try hard to make it that way.

I could write something extremely similar to this about "Big Data". Instead of concluding with suggesting Ansible, we could conclude with suggesting just using a real SQL server instead of Hadooping it up with all of those moving parts and quirky Apache-Something-New-From-Last-Week gadgets and then installing Hive or something so you can pretend it's still a SQL database.

Is there a way to make over-engineering unsexy? That's the real problem technologists who value their sanity should be focusing on.

justinsb · on Dec 23, 2016

If you're deploying kubernetes to AWS, you should probably be using kops (but then I would say that, because I started the project. But OTOH I started it because nothing else fit the bill!)

Also, if you aren't already a member, come join us in #sig-aws on the kubernetes slack - we're a group of Kubernetes on AWS users, mostly happy - and working together to figure out the pieces where things could be better!

toomuchtodo · on Dec 23, 2016

Bookmarking this comment and yours further down thread. Very insightful. Regret I only have one upvote.

Disclaimer: DevOps going through the same process (everyone wanting to move to Kubernetes/k8 because its the new hotness/"orchestrated containers").

tazjin · on Dec 22, 2016

> Storage is, essentially, a well-solved problem at the OS level. The fact that this option is marked "single node testing only – local storage is not supported in any way and WILL NOT WORK in a multi-node cluster" raises eyebrows.

Just to clarify this a bit: Persistent volumes as an API _resource_ in Kubernetes are independent of which node a container requesting them is scheduled on, which is why it makes little sense to have a host-independent host volume.

If you have your storage sorted out on the hosts you can use a "simple" volume to mount it correctly [1]. Scheduling can also be restricted to the correct nodes with that storage by using node selectors / labels.

1: http://kubernetes.io/docs/user-guide/volumes/#hostpath

TheIronYuppie · on Dec 23, 2016

Yep, can't repeat this enough. If you've solved node storage management, you've solved k8s storage management too. Use hostpath and call it a day.

Disclosure: I work at Google on Kubernetes

chrissnell · on Dec 23, 2016

Absolutely agree. We just haven't solved node storage management. We need a better way to deal with data storage than simply using StatefulSets and tying database cluster members to a given Kube node.

That's what I'm hoping for in 2017.

thockingoog · on Dec 23, 2016

You can either use local disk (thereby tying to a node) or network disk (fully supported, but apparently not good enough in any number of dimensions) or local+replication which kube does NOT currently solve cleanly.

The model I want to see is fast network-centric access to replicated local data. Some vendors are pushing into this space now. 2017 will be exciting.

thockingoog · on Dec 23, 2016

It "WILL NOT WORK" because we need to use additional information for scheduling against a PV that uses local data. That work isn't done because we are still trying to find the right balance of API for local vs durable.

It HAPPENS to work in a single node cluster because the scheduler doesn't have a whole lot of wiggle room to do the wrong thing :)

jtroyer · on Dec 22, 2016

Along with Docker's efforts, there are others working on container-based storage. This landscape[1] lists Ceph, Gluster, but also Portworx, Minio, Diamanti, Dell EMC's Rex-Ray, and SolidFire. I think also folks like StorageOS and Supergiant and frankly the whole storage & platform industry players are running in this direction. [1] https://github.com/cncf/landscape

luka-birsa · on Dec 22, 2016

Same here. Running ~100 app servers in K8S and the rest (databases & legacy apps) as regular GCE instances with PD drives. But long term going K8S-only is instrumental for us to prevent vendor lock-in.

I really hope that storage for K8S happens this year in a form that is simpler than Gluster/Ceph/etc and preferably integrated. Right now we're using NFS and it's ok for the simple applications but I right now I wouldn't dare deploy my databases in K8S and feel good about it.

What gives me hope is the speed the K8S is moving forward and the superb experience we had thus far.

brazzledazzle · on Dec 23, 2016

Correct me if I'm wrong, but if you have the storage solved for the rest why not just use that same storage mechanism? If you're using local storage on your legacy applications and are essentially bound to the one node why couldn't you tag one node and use hostpath in kubernetes?

Since these applications are fundamentally not designed for something like kubernetes what everyone seems to be asking for is Kubernetes to provide network storage that has the good things from NFS but faster. I could be way off base here but asking for a simple and performant solution to this problem from projects like kubernetes just doesn't seem reasonable. Network storage, particularly the type that provides traditional filesystem semantics, is a hard space all on its own.

To put it another way, it sounds like people are asking kubernetes to take single node applications dependent on traditional filesystems and provide a magical performant network storage layer to make those failover seamlessly between kubernetes nodes with no (or tunable?) data loss. For those asking for this how would _you_ go about creating such a system? Just thinking through that a little should make it clear that what you're asking for isn't just a solution to a very difficult problem, it's a solution that is, I think, worth quite a lot of money to the person/group that solves it elegantly.

I would genuinely love to be proven wrong here so by all means destroy my argument with extreme prejudice.

icebraining · on Dec 23, 2016

There's ZREP [1]. It uses the snapshots feature of ZFS, which can send them over the network, to continually (but asynchronously) mirror the local filesystem to a remote machine.

It seems to provide exactly the features required (fast local access with failover), at the expense of a window of lost data (since the replication isn't synchronous and confirmed).

[1] http://www.bolthole.com/solaris/zrep/

TheIronYuppie · on Dec 23, 2016

What would it look like to be integrated? Could you just use your existing node solutions and hostpath and be set?

Disclosure: I work at Google on Kubernetes

adictator · on Dec 23, 2016

I think NFS without its issues (performance & security) would be ideal as an integrated solution.

brazzledazzle · on Dec 23, 2016

If you can solve that elegantly and safely I suggest you build a startup around it.

bogomipz · on Dec 22, 2016

Any idea how does GCE handles PD drives underneath your K8? Is it network-attacched storage? Are those NVMe drives I wonder?

bogomipz · on Dec 23, 2016

I thought this was a valid question as I am curious how you you able to run databases using K8 on GCE.

Why would somebody downvote asking a question?

blorgle · on Dec 23, 2016

This doesn't really make much sense to me.

If your systems support software level replication (Elasticsearch, Cassandra, MySQL, MongoDB all do) then why do you need persistent storage? You just need container scheduling anti-affinity and enough replicas.

You only need persistent storage for systems which don't support that replication. Ceph can certainly be deployed as performant for DB workloads.

You say "Cinder has the stench of OpenStack" but Cinder is just a Python based webapp which povides an API to arbitrary storage backends (Ceph RBD, iSCSI, NetApp ONTAP, whatever). How can it be "better now"? It doesn't provide storage on its own. If your ops team was using the default "proof of concept" LVM backend then I could see how you might get a bad impression but that just means your ops team doesn't know much about OpenStack.

Am I missing something obvious?

chrissnell · on Dec 23, 2016

Yes, you're missing a few things. Maybe not obvious, though.

First off, the replication thing. It is true that ES, C*, and Mongo replicate within their cluster mostly automatically. However, this is not without cost. It takes non-trivial amounts of network capacity, disk I/O, and CPU cycles to migrate shards from a failed (or downed) node to a newly stood-up node. Often, many GBs must be moved and for something like ES, where shard replicas reside on many different nodes, that means much of your cluster feels the impact of this. The cluster can heal, but healing isn't easy.

Why would a cluster node go down? It's not always hardware failure. CoreOS regularly self-updates and reboots itself without intervention. In a Kubernetes cluster, this is a non-event because pods are simply rescheduled elsewhere the the degradation is momentary. If we were talking about 300 GB of persistent data, though, that's a serious amount of data that will get reshuffled every time there is a node reboot, especially when you consider that an Elasticsearch cluster may span dozens of physical nodes and experience dozens of node reboots in the course of a normal day. Maybe we could hack something that would disable shard reallocation in ES (there's a setting for this) when scheduled reboots happen but that's pretty hacky. Besides, ES is just one of a number of different datastores in use at my workplace.

As for Cinder, it's reliant on OpenStack APIs which (at least as of Juno) are reliant on things like RabbitMQ. We've seen a number of OpenStack failures due to RabbitMQ partitioning and split-brained scenarios. We're also back to the disk-on-network problem again: SCSI backplane ---ethernet---> client will never be as fast as local disk.

blorgle · on Dec 23, 2016

> First off, the replication thing. It is true that ES, C*, and Mongo replicate within their cluster mostly automatically. However, this is not without cost. It takes non-trivial amounts of network capacity, disk I/O, and CPU cycles to migrate shards from a failed (or downed) node to a newly stood-up node. Often, many GBs must be moved and for something like ES, where shard replicas reside on many different nodes, that means much of your cluster feels the impact of this. The cluster can heal, but healing isn't easy.

I'm talking about replication, not sharding though. If the data is actually lost then you have to bear the penalty of re-replicating it to match your replica count regardless, there's no magic wand here to do with "persistent storage". If the data isn't actually lost (e.g. due to CoreOS automagic reboots) then you absolutely should be putting the cluster into maintenance mode until the reboots are complete.

> As for Cinder, it's reliant on OpenStack APIs which (at least as of Juno) are reliant on things like RabbitMQ. We've seen a number of OpenStack failures due to RabbitMQ partitioning and split-brained scenarios.

Still pretty confused when you mention OpenStack. Cinder doesn't rely on OpenStack APIs per se, it provides an OpenStack API (for block storage). RabbitMQ clustering has longstanding issues with partitions which are mentioned explicitly in the documentation, nothing to do with OpenStack, everything to do with Erlang MNESIA DB. Any decent OpenStack team has learned by now to use singleton RabbitMQs with a master/slave configuration loadbalancer (i.e. haproxy) in front.

> We're also back to the disk-on-network problem again: SCSI backplane ---ethernet---> client will never be as fast as local disk.

Right. But wasn't the comment about persistent storage? You're never going to have persistent storage in your k8s cluster that magically avoids that problem, so not really sure what the point is here.

parasubvert · on Dec 22, 2016

I'd suggest looking at ScaleIO from EMC which is free for unsupported use I believe. It is blazing fast, runs on bog standard Linux, and supports K8S and Cinder. It's the most impressive block storage product for high performance I've seen.

http://cloudscaling.com/blog/cloud-computing/killing-the-sto...

pas · on Dec 28, 2016

Any chance to see a comparison on recent Ceph version(s)? (Maybe with BlueStore as Ceph's backend?)

tw04 · on Dec 22, 2016

>NFS? No way. Not performant.

Really?

https://www.spec.org/sfs2008/results/res2011q4/sfs2008-20111...

And those are nothing compared to modern systems.

chrissnell · on Dec 22, 2016

Let me clarify: not performant in any way that we want to implement. The NetApp referenced in that analysis would cost as much as we've spent on both of our OpenStack and Kube clusters. NetApp is great if you're a hospital or a bank but not an internet company.

We need something built on common PC chassis, either as distributed local storage or some type of high speed interconnect.

otterley · on Dec 22, 2016

Storage is definitely one of those businesses where "fast, good, cheap: pick any two" has always applied, and still does.

rickycook · on Dec 23, 2016

and as an industry, we are tending towards fast, cheap, and you can write code to work around not having "good"

bottled_poe · on Dec 23, 2016

In my experience, if it isn't good relatively quickly, it certainly doesn't remain cheap for long.

user5994461 · on Dec 23, 2016

It's not so much about the code as having a usage that can accommodate being run on "not good".

Banks and hospitals don't.

kureikain · on Dec 22, 2016

I have lots of issue to get NFS works on some particular db as well. Such as RethinkDB and InfluxDB...

user5994461 · on Dec 23, 2016

Last I checked, databases came with numerous warning in the docs "don't run me on a network drive".

Is there any reason why you are trying this at all?

tw04 · on Dec 23, 2016

Depends entirely on the database as to if they work and why. Oracle, for instance, has a native NFS client built into the database. Modern versions of MS SQL support running on SMB3. Postgres is fine with NFS as well.

The biggest issue people have with NFS is when they roll their own. To get performance they allow asynchronous writes | but this means the write cache can potentially be lost during a power failure. Enterprise NAS systems like the NetApp referenced above have battery backed cache so writes are never acknowledged until they're in a secure medium.

brazzledazzle · on Dec 23, 2016

I think that's a good rule of thumb but you can certainly do it if you know what you're doing (which includes having the right hardware).

kureikain · on Dec 23, 2016

I don't know how to solve problem of stateful container and data storage so I tried to use a NFS for this and failed.

tw04 · on Dec 24, 2016

Not to beat the NetApp horse to death, but they've solved that problem (they just do a poor job of advertising it):

https://github.com/NetApp/netappdvp

seanp2k2 · on Dec 22, 2016

I've had many problems with NFS, but they were not performance related. Mostly FS hangs with some program waiting for IO that never comes.

cookiecaper · on Dec 22, 2016

This is the biggest issue I've had with NFS. If the mount goes stale, programs can get stuck in an uninterruptable state that takes a reboot to clear out.

otterley · on Dec 22, 2016

Just mount the file system with "soft,intr" options and you won't have that problem. Otherwise NFS is (perhaps unrealistically) optimistic that the server will come back.

bogomipz · on Dec 22, 2016

I'm in a similar boat. Does Kubernetes have a persistent local storage API story on the horizon?

The one thing I have found on Mesos that I really liked are persisted volume resources:

http://mesos.apache.org/documentation/latest/persistent-volu...

I was hoping K8 has something similar but it didn't the last time that I looked.

Edit: I just looked at your 1st link. I see the PV docs, none of those fit my use case(local) unfortunately.

jat850 · on Dec 22, 2016

Curious if you've tried gluster at all. Using Kubernetes also and about to cross the threshold of stateful data - performance is important but not critical since we're using it in a fairly low volume, low throughput way, but it will grow over time and we want to future-proof it a bit. gluster seemed like the best-case fit for us but have done no empirical testing yet.

dankohn1 · on Dec 23, 2016

Hi, I'm the executive director of CNCF (which hosts Kubernetes) and the co-author of our landscape document, which has a section on storage [0].

I'll just state the obvious that you're very much correct about stateful storage still being the most immature aspect of the Kubernetes and cloud native story, but there are definitely tons of folks working on it. And in the meantime, people are succeeding in production using last-generation solutions on bare metal or provider offerings in the cloud.

I'll make a quick pitch that if you are a cloud native end user interested in engaging with the community, please email me about joining CNCF's end user board (my email is in my profile). We just dropped the price from $50 K to $4.5 K, and it now includes 5 tickets to CloudNativeCon/KubeCon ($1.7 K and 2 tickets for startups).

[0] https://github.com/cncf/landscape

cryptica · on Dec 23, 2016

Yes I think that Kubernetes is too stubborn with regards to how it wants to decouple Pods/Containers from specific Nodes.

Many new databases (especially NoSQL ones) already support clustering and rely on being tightly coupled to specific nodes/machines in order to work. I think K8s currently makes this a bit too difficult - They need to improve support/documentation for using host storage for these kinds of clustered DBs.

NFS doesn't make sense for storing structured data because it doesn't know what the best way to partition/search your data is going to be (the directory tree structure isn't always what we need) - I think that this can only be solved at the DB layer unfortunately.

user5994461 · on Dec 23, 2016

I don't want to break your hopes but stateful containers will only ever run on GCE and AWS.

The entire existence of stateful containers depends on having network storage at hands.

The only good network storages are GCE and AWS volumes, which are proprietary trade-secret technologies only available there.

If you want to play it old-school. You can run virtual machines with VmWare on bare metal servers with SAN disks (iSCSI/Fibber Channel). The Vm can be hot migrated from an host to another. It works and it's been battle tested for almost a decade. (IMO: Docker is not only new but a toy in comparison to that).

benley · on Dec 23, 2016

If virtual machines with live migration is what you're after, vmware is not your only option. You can get largely the same effect without involving any iscsi/FC/SAN tech with another open source project, which happens to originate from the same place as Kubernetes: http://www.ganeti.org/

Ganeti has been battle-tested for about a decade too, supports an assortment of storage backends, including some clustered ones like Ceph, can do live migration between hypervisor nodes, and it's a nicely maintained Python app with some Haskell parts in it.

ganeti came out of my team at google in the mid 2000s, fwiw. I did not work on it, but I've certainly used it. It's pretty nice.

WestCoastJustin · on Dec 23, 2016

Thank you. I've used Ganeti extensively and it help wrangle our Xen/DRBD rats nest very quickly. It's an awesome project.

davidopp__ · on Dec 23, 2016

> I don't want to break your hopes but stateful containers will only ever run on GCE and AWS.

Actually Kubernetes is starting to work on support for persistent local volumes; we know the lack of this feature is a significant barrier for running some stateful applications on Kubernetes, particularly on bare metal. The concrete proposal for how we are thinking to do it is at https://github.com/kubernetes/kubernetes/pull/30044

The high-level feature requests are at https://github.com/kubernetes/features/issues/121 and https://github.com/kubernetes/kubernetes/issues/7562

(Disclosure: I work on Kubernetes at Google.)

user5994461 · on Dec 23, 2016

Having containers with local volumes is counter productive. They're just pet that can't be moved around and killed/recreated whenever you want. (Though I understand that it can be useful at times for some testing).

IMO: It's a marketing and usage problems. You should re-focus people on running exclusively stateless containers. Sell the strengths of containers, what it's good at and what it's meant to do. Containers = stateless.

Stateful containers are an hyped aberration. People barely get stateless containers working but they want to do stateful.

davidopp__ · on Dec 24, 2016

I don't understand why you say "Having containers with local volumes is counter productive." I would agree it's probably not a good architecture if you're running a huge single-node Oracle database, but it's an excellent way to run data stores like Cassandra, MongoDB, ElasticSearch, Redis, etcd, Zookeeper, and so on. Many people are already doing this, and as one large-scale real-world example, all of Google's storage systems run in containers. The first containerized applications (both at Google and in the "real world") were indeed stateless, but there's nothing fundamental about containers that makes them fundamentally ill-suited for stateful applications.

user5994461 · on Dec 31, 2016

You don't understand because you are blinded and spoiled by Google.

Go see the outside world => They have none of your internal tech and services. Stateful containers do not exist there. "Containers" means "docker" which is experimental at best.

a-robinson · on Dec 22, 2016

Yeah, I'm really looking forward to a true local storage option. I'd recommend watching https://github.com/kubernetes/kubernetes/issues/7562 and https://github.com/kubernetes/kubernetes/pull/30044 if you want to keep up with how things develop.

colanderman · on Dec 22, 2016

Check out the startup I work for, ClearSky Data [1]. We provide cloud-hosted block storage with SAN-level performance to your private datacenter and enterprise-grade durability and availability at a competitive price. I'd be glad to answer any questions you have (I'm an engineer) or point you to someone who can.

[1] http://www.clearskydata.com/

hodgesrm · on Dec 22, 2016

Given the topic you are posting on, my question would be what happens if ClearSky goes out of business? Potentially any file system hosted on your storage would just disappear, right? (And the DR capability too, if I am reading your website correctly.)

I don't mean to be negative but I'm having some trouble seeing how the ClearSky feature set justifies assuming what looks like an existential risk to business-critical data on the service. Interested in your thoughts on this.

edit: typo

chrissnell · on Dec 22, 2016

Exactly. Besides the obvious technical implications of off-site, "cloud-based" storage, what happens if ClearSky can't pay it's hosting bill? Presumably, your many TBs of data (which could take weeks to transfer) would vanish into thin air.

colanderman · on Dec 23, 2016

This is a valid concern and I am the wrong person to answer it (not being on the business side of things) but I can get you in touch with someone who can.

chrissnell · on Dec 22, 2016

We don't use public cloud. We're hosted on dedicated hardware for a number of reasons. While we're not opposed to commercial solutions, we strongly prefer open source solutions for obvious reasons (like, this ClusterHQ situation).

colanderman · on Dec 22, 2016

You don't have to. We provide iSCSI or FC access points in your private datacenter.

mst · on Dec 22, 2016

Is the hardware storing the data in-datacentre too? I think that was the key thing.

colanderman · on Dec 22, 2016

We (ClearSky) store your data off-site (except a small cache), but provide performance as if the data were on-site.

chrissnell · on Dec 22, 2016

I don't believe you. There's no way you're coming even close to local disk performance with an off-prem solution.

Feel free to prove me wrong but I don't think that this is a reasonable solution for backing storage for a database.

brazzledazzle · on Dec 23, 2016

Well it seems clear to me that they can only provide that level of performance for blocks already in their onsite cache. Presumably they're using a novel compression/de-duplication scheme and maybe prioritizing blocks using historical and/or predictive analysis to cache the right data at the right time but you can only transfer data as fast as you can transfer it. I'm guessing a full export of all of your data (that isn't cached) is going to go as fast as the line/compression allows.

colanderman · on Dec 23, 2016

Bandwidth is much less a problem for OLTP workloads than latency is. (If all you care about is bandwidth, S3 is your friend.)

With ClearSky, even for workloads that don't fit in the edge cache (which we believe are few), you'll still see single-digit millisecond random read (and write) latencies. This is made possible by our points of presence (PoPs) located in each metro area we serve. These PoPs house the lion's share of data in a private cloud, and are connected to each customer site with private lines with sub-millisecond latency.

In other words, the speed of light is very fast when it goes in a straight line with nothing in its way ;) While we do have some secret sauce in the data pipeline, it is because we own the network to the customer that we can provide the performance we do.

(Fancy PDF with a few more details here: http://cdn2.hubspot.net/hubfs/445689/2015_assets/ClearSky-Da...)

colanderman · on Dec 23, 2016

Check out this report if you don't believe me: http://www.clearskydata.com/clearsky-takes-primary-storage-t...

Several members of our team were core developers of EqualLogic (pre-Dell buyout). We have significant investment from Akamai. I promise you this is not snake oil.

chrissnell · on Dec 23, 2016

Check out this report? A link to a marketing data collection form? Come on. Link me to the PDF. Let's see the technical details of exactly how your solution is built out.

colanderman · on Dec 23, 2016

Technical details are here [1], though if entering a name and e-mail in a form is off-putting to you, I'm not sure anything could convince you to take the step of switching enterprise storage vendors.

Aside, I can't express how validating it is how much you (and others, given the downvotes) disbelieve me. It makes me quite proud to have helped develop a service considered so impossible that it is written off as black magic. It does make it hard to market the damn thing though ;)

[1] http://www.clearskydata.com/clearsky-global-storage-network-...

jlgaddis · on Dec 24, 2016

Wow.

First, this is a technical audience. Plenty (most?) of us couldn't give a shit about what marketing says. I believe about half of what comes out of the mouths of sales/marketing folks -- for good reason. We don't care what Gardner or some firm you paid to write up a report says.

Personally, I have a real dislike of sales/marketing folks and I will avoid them at all costs... so, no, I don't want to give you my name or e-mail address. I don't want your people calling me, interrupting real work. I don't want to view your webinar. I want to look at the technical details -- the facts -- and decide for myself and then, quite possibly, completely forget I ever heard about your company and go about my day.

Last, don't fool yourself. The downvotes aren't "validation" -- at all -- but, hey, go on living in your fantasy world. If it really was as awesome as you seem to think it is, it wouldn't be hard to market. To the contrary, the damn thing would sell itself and you wouldn't even need a marketing department.

colanderman · on Dec 24, 2016

Your loss buddy.

monkeywork · on Dec 22, 2016

So you are essentially doing AWS Storage Gateway then?

http://docs.aws.amazon.com/storagegateway/latest/userguide/G...

colanderman · on Dec 23, 2016

No. ClearSky provides several features that Storage Gateway does not:

* ClearSky has a dedicated private network to every customer ensuring low latency.

* ClearSky fully manages disaster recovery; you lose no data if your datacenter is destroyed.

* ClearSky provides transparent instantaneous data mobility: you can move volumes to any other data center in a matter of seconds.

* ClearSky provides and manages the edge cache.

All of the above must be provided by the customer with Amazon's offering.

cryptarch · on Dec 22, 2016

So it's stored on a public cloud?

colanderman · on Dec 23, 2016

thockingoog · on Dec 23, 2016

I get that local storage is good for perf, and everyone has some, but you HAVE to understand how disastrously bad it is for availability (unless you do replication yourself, a la Cassandra).

That said, we hear you, we're contemplating options for local-disk volumes.

X-Istence · on Dec 22, 2016

Ceph RBD is plenty fast for having a database on... was able to get better performance than fibre channel or NFS to a NetApp using Ceph, ran some nice large Oracle instances on VM's on top of OpenStack backed by Ceph RBD.

chrissnell · on Dec 22, 2016

My testing showed otherwise but I'd love to see what you've done. What sort of equipment did you use, what kind of network, and what how many IOPS did you see?

X-Istence · on Dec 23, 2016

SuperMicro has their IOPS optimized Ceph storage SKU's, that is what was used. Looks like they have updated since we purchased:

https://www.supermicro.com/solutions/storage_ceph.cfm

We went for upgraded network capacity though, 20 Gbit/sec cluster backend, 20 Gbit/sec cluster frontend...

12 * 8 TB drives, with 800 GB NVMe for the Ceph journal. Fast, large Ceph journal was key.

Total installation was about 3 PB raw, that is 1 PB useable with replication size 3. 33 Ceph OSD nodes, 3 Ceph monitor nodes and Juniper low latency switching using the QFX5100.

Full IPv6 network on both frontend/backend. 11 nodes per rack, each rack being it's own /64 routed domain. 3 racks.

I'm no longer doing contract work for the company, but last I heard they were expanding it out to 6 racks with an additional 3 PB raw capacity added on because of growing datasets.

It's an OpenStack cluster that is connected to this Ceph cluster, 40 Gbit/sec storage backend network, with 40 Gbit/sec front-end that VM's have all their traffic on. So storage and standard traffic don't mix.

The performance and IOPS even virtualized were enough that the entire company is moving their bare metal databases to VM's. I am unable to disclose IOPS or Oracle database performance due to contractual obligations unfortunately.

geertj · on Dec 22, 2016

> The options just aren't good enough.

I'd be curious to understand your storage requirements as a production Kubernetes user. What would you like to see for performance, cost and RPO/RTO?

bogomipz · on Dec 22, 2016

local NAND flash?

basch · on Dec 22, 2016

does VsphereVolume mean you can use data virtualization slash software defined storage like vSAN?

https://en.wikipedia.org/wiki/Data_virtualization

https://en.wikipedia.org/wiki/Software-defined_storage

sandGorgon · on Dec 22, 2016

curious to know what you are doing right now ? I'm planning to deploy a redis cluster on a private cloud and am wondering what to use.

do you use hostpath right now ?

chrissnell · on Dec 22, 2016

We don't put any persistent data into Kube. Everything goes into OpenStack instances (Ubuntu), orchestrated by Chef. We hate it. OpenStack SDN has been flaky, Chef is a pain and doesn't support the latest Ubuntu releases well, none of the devs or technical ops engineers like it.

It's my #1 goal for 2017: figure out persistent volumes for Kube.

logronoide · on Dec 22, 2016

Change Chef for Ansible and I have the same architecture. And you know what? I have decided to put on hold K8S until persistence is properly managed. Meanwhile, we have decided to give a chance to serverless architectures with AWS lambda.

markbnj · on Dec 22, 2016

I agree that persistence in k8s is tricky, especially at scale, but at least in our case that doesn't drive us way from the platform. Kube is awesome for services, and if we have to keep a few things on gcloud instances bolted to reliable storage for the moment that's at least less heterogeneous than what we had before kube came along. In other words I don't think you have to kube all the things to see a lot of benefit.

thickice · on Dec 22, 2016

Are you running Openstack yourself or a vendor backed version ? My understanding is deploying OpenStack in production is a nightmare. Would you mind sharing your experience (cluster size, upgrade, support etc.) ?

parasubvert · on Dec 22, 2016

Disclaimer I work at Pivotal but - I'd suggest taking a look at http://bosh.io which can handle full stack automation ; there already are quality releases for MySQL, elastic, mongo (from anynines), and Cassandra, with commercially supported releases in some cases with fully orchestrated vm and volume management on Openstack.

These are for cloud foundry nominally but (yay actually collaborating foundations) the open service broker API ( https://www.openservicebrokerapi.org/ ) allows you to hook these into Kube and (the best part) standardize how your CI/CD pipelines manage the lifecycle of these services independently of whether they're backed by Openstack today or Kubernetes tomorrow.

Persistent volumes for these sorts of services on Kube will require the new PetSet primitive to mature.

objectivefs · on Dec 22, 2016

If you have access to an object store, you can get persistent file system storage for your containers using ObjectiveFS[0]. You can run it either on the hosts or inside the containers depending on your goal.

[0]: https://objectivefs.com