This is a question I’ve been toying with off and on since Kubernetes was made available. It goes hand-in-glove with a similar question about microservices, but I think of them as two separate questions, that happen to be related because there are loud voices urging us to deploy microservice architectures on Kubernetes, not as one question that can be looked at from two different angles.
The thesis is that, yes, Kubernetes does actually encourage you to build fragile systems because it does such a poor job of making it clear how traffic is flowing into, within, and out of the cluster. The mere existence of service meshes is proof that, while k8s is a good scheduler of work, it lacks native aptitudes to effectively disaggregate traffic.
First off, k8s doesn’t really allow you to, in any practical or operational way, create a single cluster that spans data centers, so unless you are forced to either build a system that is inherently unable to deal with data-center outages, or you are rolling out some kind of global traffic management to direct users to a particular data center, and typically this is done with DNS and BGP tricks that were never intended by their designers (but they do work, and in any case, it’s a different topic).
Then, once you arrive at the data center, you likely have multiple clusters, so once again, you have some sort of traffic management that directs users to a particular cluster. This is typically done with a device that can see the traffic through protocols in layers 4 -7 of the OSI Networking Model, so think of next-generation firewalls, load-balancers, application delivery controllers, and application layer gateways.
And, once at the cluster, you have either the Kubernetes cluster ingress or cluster gateway which finally routes traffic to a particular service, which in turn abstracts all of the subordinate pods behind it as a virtual end-point.
Why is this fragile?
First, none of the Pick functions know anything about the other pick functions. In an ideal world there is information shared between the Data center Pick, the Cluster Pick, and the Service Pick that informs the decision that each makes, but there isn’t an inherent structure or data set that allows this sharing to be standardized, so any sharing that does happen is slow and beholden to the capabilities of making API calls that are essentially the picking function asking the members of it’s pool of answers if they are ready to receive traffic and then them responding that they are. Very slow to detect failures, very fragile because it is subject to all kinds of networking and security shenanigans, and limited by the first two in how much information can be exchanged.
Second, when a pod is down, it gets removed from the roster of available destinations for the service. If you have cross pod dependencies (like, for example, a pod with all the front-ends, and a different pod that needs GPUs) then you can’t accurately describe the readiness of the service. But this is precisely how you have to deploy hardware dependent applications — the GPUs may only exist on a particular node, so to get all your GPU workloads on it, you put them all in a pod and assign a node-affinity using labels or resources or extended resources. So, what you have to do it to think about the Service as the Unit of Availability and give it a threshold for the amount of resources that are required, in total, to be considered “up”, and configure it to be unavailable if that threshold isn’t met. That means that if you don’t have enough vector processor instances because there isn’t enough hardware available to meet the needs of the vector processor pod, the service is unavailable, and so on.
And it means that if you have either stateful applications – meaning they need to hold onto information between processing activities – or you have sequential processes that need to pass data between themselves to complete processing – then you have dependency trees and a failure at the end of a branch means the entire branch is unavailable.
And, circling back to microservices, if you have deployed your microservice application in a k8s service that is able to run in two or modes, one that has all the branches in working order, and one (or more) that doesn’t, then you really need to decompose that service and redeploy them as different Kubernetes Services so that you aren’t running degraded services in production and calling them “OK”.
Then, you can move the mode selection from inside your service to the Service Pick itself, by asking for a prioritization of the preferred mode, and using the alternate mode as a fallback when the scarce resource isn’t available.
Can you do all of this with Kubernetes today? Sure. Is it easy? It is not. And furthermore the way that Kubernetes is talked about doesn’t really leave you with the impression that this is the way to do it, which probably leads to design decisions and priorities that reinforce the lack of ease in deploying your services like this.
What might an easy deployment scheme look like?
Working back to front, being able to weigh and prioritize Service Pick decisions to get the service with the best, most available resources is a start.
control my-uri:
uri: "https://example.com/api/*"
match-site-schema:
"example.com/api/openapi-v3.schema.json"
list my-service:
service-with-gpu:
priority: 10
weight: 100
service-without-gpu:
priority: 25
weight: 250
service my-service:
admit: control.my-uri
next-hop: list.my-service
Then being able to do the same with Cluster Pick decisions to get the most available cluster in the datacenter seems obvious.
service my-service:
listen "https://example.com/api/"
next-hop-strategy:
dynamic: internet-location-then-dynamic-weight
next-hop-members:
cluster-1 10 dyn 2001:db8:ae21:77dc::/64
cluster-2 10 dyn 2001:db8:ae21:7781::/64
cluster-3 10 dyn 2001:db8:ae21:7739::/64
As does using DNS to hand out dynamic responses to get the user to the closest, best datacenter.
service my-service:
listen "https://example.com"
next-hop-strategy:
dynamic: internet-location-then-dynamic-weight
next-hop-members:
datacenter-1 10 dyn "example.com AAAA IN 90 2001:db8:ae21:7700::/56"
datacenter-2 10 dyn "example.com AAAA IN 90 2001:db8:0478:2800::/56"
datacenter-3 10 dyn "example.com AAAA IN 90 2001:db8:c001:e400::/56"
As does using anycast addressing to get clients to the DNS servers themselves.
process dns:
run systemd.bind9.service
listen: 2001:db8::53
process announce-routes:
dns:
check: dns
routes neighbor 2001:db8:0478:2853::5353
local-address 2001:db8:c001:e453::5353;
local-asn 64511;
peer-asn 65536;
bgp: announce-routes.dns
And, while all this is great, the final thing we should be able to do is to tie all these together with a nice bow by having the various control points report their availability rather than passively waiting to be polled.
Again, this isn’t unlike what Kubernetes lets you do today within the cluster. It also doesn’t remove the need to perform reachability tests with synthetic monitors to detect network problems. But what it does let you do is to manage your service deployment’s workload scheduling and traffic disaggregation holistically, which creates an environment that doesn’t try to hand-off between the two. With this you might be able to think about deployment more like this:
delivery global
anycast-dns:
infra.application.anycast-dns
publish:
presence @availability as otel.event
readiness @availability as otel.event
utilization @telemetry as otel.metric
errors @telemetry as otel.event
subscribe:
presence @availability to membership
readiness @availability to dynamic-weight
utilization @telemetry to dynamic-weight
datacenter-pick-service my-service:
listen: "https://example.com"
next-hop-strategy:
dynamic: internet-location-then-dynamic-weight
next-hop-members:
dynamic: membership.my-service.datacenter
datacenter-1
cluster-pick-service:
control my-uri:
uri: "https://example.com/api/*"
match-site-schema:
"example.com/api/openapi-v3.schema.json"
my-service:
admit: control.my-uri
next-hop-strategy:
dynamic: internet-location-then-dynamic-weight
next-hop-members:
dynamic: membership.my-service.cluster
cluster-1
service-pick-service:
list my-service:
service-with-gpu:
priority: 10
weight:
dynamic from dynamic-weight.my-service
members:
dynamic from membership.service-with-gpu
service-without-gpu:
priority: 25
weight:
dynamic from dynamic-weight.my-service
members:
dynamic from membership.service-without-gpu
next-hop-strategy:
priority-then-weight
next-hop:
list.my-service
Now, you, the owner of the application, are able to set the delivery of the application all the way out to the anycast DNS servers, and the membership of each potential next-hop being selected is actively controlled by an affirmative signal (what you get when you send updates from the edge to the core) rather than being passively controlled by a negative signal (what you get when you send probes from the core to the edge) and you don’t have a hand-off between deploying the service in Kubernetes and getting traffic to it from the Internet. That hand-off is the subtle difference between writing software that you run on Kubernetes and building products that are delivered on the Internet.
And, I suppose it is the persistence of, and acceptance of, that hand-off in the direction of development that Kubernetes has taken since about 2018 that makes me feel like Kubernetes encourages us to build systems that are fragile in their totality even if they are locally robust – and that’s a big “if” when you consider how complex Kubernetes has become in it’s short life.