Skip to main content

Deciphering the hype of Service Mesh

Service Mesh is not a new topic anymore. Most of us in the industry are already familiar with it. There are also tons of article in the internet about its why and how. In my opinion, it has a significant influence on the application architecture. Here's a DevSecOps humor to start the discussion (and it will make sense as you read along). 


This is part 1 of my 3 parts blog posts on Service Mesh.

Part 1: Deciphering the hype of Service Mesh

Part 2: Understanding The Ingress and The Mesh components of Service Mesh.

Part 3: Understanding the observability component of Service Mesh (TBD). 


In this post, I am going approach Service Mesh from an application architecture point of view. I will also score some of its basic features on a scale of 1 to 5, where 1 being the least important to me and 5 being the most important. 

Table of contents:

  1. Common Q&As
  2. Features
  3. Implementing using Tanzu Service Mesh
  4. Conclusion


Some common questions and my short answers:

Q: Does micro-services architecture inherently warrants for service mesh? 
A: No.

Q: Is it a networking tool? Is it a security tool?
A: Yes and No. In my opinion, it has application service discovery, application resiliency, connectivity and observability at the very least. Thus, networking, security are parts of it, but not all of it.

Q: If the application architecture is API driven does still need service mesh?
A: May be. Unnecessarily exposing API endpoints just so an internal service can access it is, in fact, an anti-pattern. Service Mesh can help here to standardise and simplify a great deal. To me, it absolutely makes sense to have a combination of API gateway and service mesh for the application architecture. I described this topic in this post.

Q: Does service mesh warrant for change in application source code to allow for side cars?
A: It should not. I touched on it briefly in this post but it deserves its own technical post.

Q: I need security for my application. Can service mesh provide that?
A: Yes. However, similar degree of application security can be achieved without service mesh too. In my opinion, consideration need to be made to adopt service mesh beyond just security context. Read along the rest of post to understand this better. 


Now let's jump right into it.

Feature #1 - mTLS:

This is most sought after feature of a Service Mesh and everybody talks about it. The main usecase for this feature is to secure communication between workloads. To understand how it protects workloads see below diagram:


Importance score: 3

Below are my reasons for scoring low:
  1. If the workloads of the application are deployed in a secured K8s cluster with appropriate security policies, such as registry policy, image signing, RBAC etc and the deployment is only allowed through secured supply chain or approved CD process then that will prevent malicious or imposter app from getting deployed in the first place. mTLS of a Service Mesh has very little contribution in this context.

  2. If a bad actor gets access to a cluster and the cluster does not have appropriate security policies then it is pretty easy to deploy the malicious/imposter app inside the mesh. In which case, the mTLS will not be able to prevent the attacks.

  3. If the workloads of the app are deployed within a single cluster with appropriate security policies and secured supply chain then there's very little use for mTLS.

However, mTLS becomes super effective and a critical requirement from app security perspective if the below conditions are fulfilled: 
  1. The micro-services of the app are spread across multiple cluster. In most cases it will be.
  2. Workloads are only deployed via approved supply chain during the CI/CD process. This can be restricted using appropriate RBAC.
  3. K8s cluster must have appropriate security policies. 3 of the minimum requirements of such policies are registry policy, signed image policy and RBAC.
If and only if 
   the above are fulfilled correctly 
Then 
   Importance score will be: 5


Feature #2 - Service discovery:

In my opinion, this should be the number 1 feature behind choosing Service Mesh for an application's architecture. See below diagram:


Importance score: 5

Below are my reasons for scoring high:
  1. There are additional overhead for service to service communication without service mesh using K8s default DNS, which follows this pattern: {svc_name}.{namespace}.svc.cluster.local or API gateway. It is impractical to map micro-services using the default DNS pattern as they will be developed and deployed by different teams at different times across different namespaces and often spread across different clusters. If all the micro-services are deployed in the same namespace then it works as they can be called by names. But no body does that. In short, it is not practical and cannot be managed or scaled. 
    On the contrary, when using Service Mesh this becomes easy to roll out and highly scalable. This is because, Service Mesh exposes services in a the mesh and enables communication among services by service names (the control plane takes care of this out-of-the-box).

  2. I have seen API Gateway being used for exposing services (although, it was not its original intention but mutated over time and had gone out of control). In my opinion, unnecessarily creating API endpoints for discoverability or connectivity or security among internal services is an anti-pattern and scope creep to list of APIs. This can scale but slows down the process of delivery, over time leading to huge number of unused APIs. Managing and maintaining the unnecessary APIs (eg: version control, API definition, documentation etc) becomes overheads. A combination of Service Mesh and API Gateways can (and will) standardise and simplify the application and infra architecture significantly from both development and operation perspective and reduce complexity and overhead by a lot.

  3. In most implementations, micro-services are usually spread across multiple K8s. This is so that we can achieve things like workloads isolation and workloads grouped by service category (eg: cache service, queue processing service etc may be in a cluster different to inventory process, order management etc) or service placed in clusters for compliance reasons (eg: the payment processing service needs to remain on prem for compliance) etc. For these scenarios, a cross cluster service mesh adds simplicity to the app and infra architecture. Without service mesh there will be additional overhead and complexity (and possibly unnecessary, in my opinion) of exposing services or endpoints through API Gateway or worst through LBs and Ingress (even if they may be internal by nature). 

  4. In multi-cluster (and multi-cloud) scenario Service Mesh provides security by default, leveraging its mTLS functionality that enables communications among micro-services using mTLS over network. This requires no change to the application architecture or source code and is handled by by sidecars or proxies. Without it, maintaining TLS for app security becomes super complex and non-scalable. 

Feature #3 - Meshing:

I am familiar with Istio's features for good meshing capabilities. These could be true for other tools as well. Below are 2 aspects of considerations when planning micro-services and service to service communication (from meshing perspective) which, I think, are uniquely available via service mesh:
  1. Peer Authentication: Using peer authentication we can apply mTLS within a mesh in appropriately. This allows fine grain controls over how workloads outside the mesh talks to workloads inside the mesh without compromising the mesh integrity or burdening application lifecycle. See below diagram:




  2. Circuit Breaker: Circuit breaking is an important consideration when planning a micro-service architecture for an application. It provides the below advantages: 
    • Limits the impact of cascading failure by immediately failing the faulty service in the mesh,
    • Limits the impact of latency spikes. 
    • This increases the overall up time of an application and protects its underlying infrastructure from exhaustions. 

 

Importance score: 4

Below are my reasons for scoring high:
  1. Dev and Ops segregation. eg: No source code change needed to implement security restriction as well as application resiliency.
  2. Fine grained control of mTLS mode (eg global by applying to istio-system ns, overridden by at namespace level, overridden by applying with workload selector) provides flexibility to apply security context to a mesh fitting almost 95% of the usecase.
  3. App resiliency feature (such as circuit breaker) available out-of-box (without having to need to change in source code).

Feature #4 - Ingress, Gateway, VS etc:

This feature often gets mis-interpreted. Service Mesh (like istio, linkerD etc) provides these features out-of-the-box which has advanced capabilities such as traffic routing for canary or blue-green deployment, AB testing etc. But this has nothing to do with Service Mesh. I would rather think about this as a functionalities needed for a user request to "ingress into the mesh". In my opinion these are "good to have" features because, I can achieve the similar functionalities through modern ingress controllers such as Contour, NginX etc without having the need for Service Mesh like Istio, LinkedD etc. Below diagram shows the ingress via ingress controller vs ingress via mesh gateway (istio):


Important: There must be a design decision (and carefully considered) on the topic of where to offload/terminate TLS from user request. Ofcourse, I have my opinion on this too but that's a topic for another day. I short, below are my compelling reasons for offloading TLS at the AWS Application Load Balancer:
  • Keep the load off of the ingress controller or gateway from processing TSL and shift that responsibility to the dedicated LB itself. --> This allows me to use AWS managed certs (and minimise certificate cost) and free my K8s from managing certs for user request.

  • Attach AWS WAF rule to the AWS ALB --> This is the primary reason for offloading TLS at the LB.
This, TLS offload at the LB, scenario is true for any other LoadBalancers (such as VMware NSX ALB) with TLS offloading capability or acting at L7.


Importance score: 2

Below are my reasons for scoring low:
  1. I can achieve AB testing using Contour Ingress. So, I do not need to choose Service Mesh for it.
  2. I can achieve traffic distribution using Contour Ingress. So, I do not need Istio Gateway.
  3. I can achieve TLS offload with Contour Ingress. So, I do not need Istio Gateway.
  4. And many more...
I have looked into and tried out the most commonly sought after ingress/gateway features of a service mesh (istio) and was able to replicate all of it using Contour Ingress. In fact, in scenario, it might warrant for using an existing ingress with Service Mesh to avoid changes in the deployment process.

Feature #5 - Telemetries:

Many opts in for Service Mesh for its telemetry feature alone. Service Mesh out-of-the-box provides some really good telemetries about the micro-services of an application and these telemetries are important for SRE practices in the operational context. Telemetries includes (but not limited to) service behaviors (such as metrics, traces, logs etc) without having the need to anything extra to the micro-services source code. See below screen shot of an example app I deployed in the Tanzu Service Mesh:


In this screen shot I can see:
  • the p99 telemetry 
  • the failure percentage
  • graph of the Service Mesh (eg: user request originating from ingress and the services involved to respond to the request).
  • there are many other telemetry data I can get as well which are not shown in the screen shot here (that would be another post for another day). 

Importance score: 3.5

Below are my reasons for scoring medium:
  1. This feature is completely loose coupled, in terms of value add, from an app architecture perspective. In my opinion, this by itself should not warrant for service mesh adoption. We can get similar level (if not more) from specialised observability tools like Tanzu Intelligence, New relic etc. But, I must admit that it is an important good to have feature from an application's (and its micro-services) lifecycle perspective.

Feature #6 - Enterprise products and offerings:

Now, this is where everything I have described so far comes together. There are several enterprise grade commercial product for Service Mesh out there. These products are typically backed by an open source Service Mesh technology. For example, Tanzu Service Mesh is a commercial enterprise grade platform that uses Istio as its Service Mesh component. Tanzu Service Mesh takes the core capabilities of Istio and adds additional value on top of it. Below are some of my favorites:

  1. Global Namespace: This is, in my opinion, the critical component for choosing Service Mesh for applications. It extends the service mesh capability (discovery, connectivity, security and observability) across multiple clusters (and clouds [public and private]). I described the importance of it in feature 1 and 2.
  2. Global control plane:  It provides a unified management interface for Istio deployments across multiple clusters, allowing operators to manage their Istio environment in a centralized manner.
  3. Application SLO: It allow to apply SLO rule to a mesh which performs autoscale (out and down) on services based on predictable application response times. This allows to maintain the required uptime for services and improves app resiliency.
  4. API security and visibility: This is is achieved through API vulnerability detection and mitigation, API baselining, and API drift detection (including API parameters and schema validation). API security is monitored using API discovery, security posture dashboards, and rich event auditing.
  5. PII segmentation and detection: PII data is segmented using attribute-based access control (ABAC) and is detected via proper PII data detection and tracking, and end-user detection mechanisms.
Image source: Tanzu (by Broadcom) documentation.

See more details in the official documentation here.

Implementing Service Mesh using Tanzu Service Mesh:

I used Tanzu Service Mesh (backed by Istio) to implement service mesh capability across multiple K8s cluster and deployed my application into the mesh. See below diagram:


Here's how I implemented the above:
  1. The 3 micro-services in the diagram are part of a larger application (I am using only 3 here for brevity). The functionality these 3 micro-services provide is: it displays/accepts account information only if the username is verified based on few different conditions. 2 micro-services (frontend and account service) are written in such a way so that they take the services names from environment variables (eg: http://account/get)
  2. I am using SaaS org of Tanzu Mission Control (aka TMC) and Tanzu Service Mesh (aka TSM). My TMC platform is also integrated with my TSM platform. 
  3. I created 2 EKS clusters using TMC. Then from TMC console I rolled out Istio service mesh through TSM. Here, TSM manages Istio and maintains cross cluster mesh capability. It deployed  AWS NLB per cluster (total 2) for meshing purposes. For service to service communication across cluster it uses the NLB to do TLS passthrough for mTLS connection between services across cluster. TSM does not handle or keep any data communicated among services (across clusters); it simply keeps tracks of (and stores) K8s and associated LB addresses.
  4. I deployed the micro-services as usual with an additional annotation at the deployment definition for service mesh called sidecar.istio.io/inject: 'true'. This prompted Istio to inject sidecars in the deployed pods. The frontend micro-service also needed to accept user request, so I deployed Gateway and Virtual Service for the frontend micro-service only. The account and verification mircro-services do not require user request ingress from outside (they only accepts request from other micro-services), so they only have service definition. In reality, I did not write the yaml file by hand or deployed it manually rather it was auto generated and auto deployed by Tanzu Application Platform as part of my CI and CD process, which deserved its own post.
  5. When TSM deploys Istio its default behavior is to use NLB. But I wanted to using AWS ALB for ingress (for reason specified in the Feature #4 section). For this I needed to deploy another Kind: Service of Type: NodePort in the istio-system namespace and point AWS ALB to it (eg: associated target group the exposed ports).
  6. I applied security policies through TSM console to my mesh in order to achieve zero trust architecture. These are the enterprise features of TSM. Below are some important ones: 
    • PeerAuthentication globally and at pod level.
    • Global Namespace level service group isolation. This is all about implementing with services are allowed to communicate enforcing workload isolation in a micro-service landscape.
    • Several threat detection policies such as protocol attacks, injection, session fixation, few data leakage etc.
Here's a screen shot of some of the deployment definitions:

Conclusion:

If service mesh is such a great technology why isn't every body using it? Good question. Although, as I described (and possibly advocated for Service Mesh) here are some of the reasons, I think, are challenges to service mesh adoption:
  1. Not every application designed as micro-services. Now, I feel, there might be a chicken and egg situation for the applications that are already decoupled from monolith and have adopted API Gateway pattern and weirdly morphed into masking service mesh usecases. For the rest, they might not be "big" enough for service mesh.
  2. For every new technology adoption it inherently incurs cost of ownership (OpEx + CapEx + ChangeEx). Although service mesh is a great piece of technology it is no different to this rule. Hence, the land scape of application or applications needs to be "big" enough for service mesh cost justification.
  3. The human element is also another critical piece. For Developers service mesh is probably going to be a good thing and win since they won't need to write the weird logic in the source code for service security and connectivity. However, for SRE or OPS it may be a love and hate scenario as they will have the deal with things like what happens when sidecars starts to play up, what if the global meshing components for cross clusters mesh starts to not get deployed (which it probably will in some context in a 5 year run way) etc. Sure, the vendor (commercial products) will offer supports but the first line of defense are the SRE or Platform OPS. So if they are not up for it then service mesh will be another chaos to the mix.
So, here's what I think about service mesh in general: 
  1. Most FOSS (such as Istio, LinkerD etc) service mesh provide only in cluster mesh capability. Which, in my opinion, does not add that much value for "big" applications. The value of service mesh is truly realised when used the commercial offerings (such as VMware's Tanzu Service Mesh) which magnifies the OSS mesh's (eg: Istio) capability across multiple clusters (and multiple clouds) through global control plane and augments it with advanced policy management features through a central console / control plane. So, I would only choose service mesh for my app portfolio if I have budget to buy the commercial offering of my favorite service mesh (eg: Istio).
  2. Before adopting it as the shiny new toy one must weigh in the cost of ownership and its architectural benefits in the context of the existing and near future app portfolio. I do not feel that looking at it from a single lense like security tooling or network tooling or observability tooling is the right way to evaluate it. Hence, I scored its features, at least the ones that I think are important, from app architecture perspective. And thus the obvious fact is, if the benefits outweighs the cost then bob is your uncle. But please, keep it is simple, don't overbake the bake off. 
That's it. Thanks for reading. 

Happy meshing.


Popular posts from this blog

The story of a Hack Job

"So, you have hacked it" -- Few days ago one of the guys at work passed me this comment on a random discussion about something I built. I paused for a moment and pondered: Do I reply defending how that's not a hack. OR Do I just not bother I picked the second option for 2 reasons: It was late. It probably isn't worth defending the "hack vs" topic as the comment passed was out of context. So I chose the next best action and replied "Yep, sure did and it is working great.". I felt like Batman in the moment. In this post I will rant about the knowledge gap around hacking and then describe about one of the components of my home automation project (really, this is the main reason for this post) and use that as an example how hacking is cool and does not always mean bad. But first lets align on my definition of hacking: People use this term in good and bad, both ways. For example: "He/she did a hack job" -- Yeah, that probably

Smart wifi controlled irrigation system using Sonoff and Home Assistant on Raspberry Pi - Part 1

If you have a backyard just for the sake of having one or it came with the house and you hate watering your garden or lawn/backyard then you have come to the right place. I genuinely believe that it is a waste of my valuable time. I would rather watch bachelorette on TV than go outside, turn on tap, hold garden hose in hand to water. Too much work!! Luckily, we have things like sprinkler system, soaker etc which makes things a bit easy. But you still have to get off that comfy couch and turn on tap (then turn off if there's no tap timer in place). ** Skip to the youtube video part if reading is not your thing   When I first moved into my house at first it was exciting to get a backyard (decent size), but soon that turned on annoyance when it came down maintaining it, specially the watering part. I laid bunch sprinklers and soaker through out the yard and bought tap timer but I still needed to routinely turn on the tap timer. Eventually few days ago I had enough of this rub

Exception Handling With Exception Policy

This is how I would think of an application at the very basic level: Now this works great. But one thing that is missing in this picture is Exception Handling . In many cases we pay very less attention to it and take it as "we'll cross that bridge when it'll come to that". We can get away with this as in many application as exceptions does not stop it from being in the state "is the application working" as long as we code it carefully and at the very least handling the exceptions in code blocks. This works. But we end up having try catch and if else everywhere and often with messy or no direction to what type of exception is to be handled where and how. Nonetheless, when it comes down an enhancement that depends upon different types exceptions, we will end up writing/modifying code every where, resulting in even messier code. I'm sure no one wants that. Even, in scenarios, a custom handler is not the answer either. Cause this way we will s