Spring Retry - Way To Handle Failures


In Microservice world, we do have services talking to each other. One of the known way of communication is Synchronous. However, in Cloud Computing world, the fact is that we cannot avoid Network glitches, temporary service down (due to restart or crash. not more than few seconds). So when clients need real time data and your downstream service is not responding momentarily, may impact the users so you would like to create retry mechanism. There are many solution options available in Java to try out. I am going to talk about Spring-Retry in this blog. We will build a small application and see how Spring Retry works. Before we start that, let's first understand few basics about Retry Pattern:
  • Retry should be tried only if you think that it may suffice the requirement. You should not put it for each use case. This is something you don't build right from the begining but based on the learning while doing development or testing. For example, If you find while testing that when you hit a resource, it works one time but next time gives timeout error and works fine when hit again. After checking with downstream system, not able to find out any root cause or solution. So you might want to build a Retry feature to handle at your application side. But first attempt should be to fix at downstream side. Don't jump quickly to build solution at your end.
  • Retry may cause resource clogging and make things even worse preventing application to recover so number of retries have to be limited. You should try to start with minimum count e.g. 3 and not going beyond 5 or so.
  • Retry should not be done for each exception. It should be coded only for particular type of exception. For example, Instead of putting code around Exception.class, do it for SQLException.class.
  • Retry can cause to have multiple threads trying to access same shared resource and locking can be a big issue. So exponential backoff algorithm has to be applied to continually increase the delay between retries until you reach the maximum limit.
  • While applying Retry idempotency has to handled. Trigerring same request again, should not trigger double transaction in the system.
Now, let's build a simple Service showcasing how Spring-Retry helps to address Retry.
Pre-Requisites
  • Spring Boot 2.1.x
  • Gradle
  • Visual Studio Code/Eclipse
Gradle Dependencies
Spring Retry uses Spring AOP internally to work. So it is also required to be added as dependency.
dependencies {
implementation('org.springframework.boot:spring-boot-starter-web')
implementation('org.springframework.retry:spring-retry')
implementation('org.springframework.boot:spring-boot-starter-aop')

testImplementation('org.springframework.boot:spring-boot-starter-test')
}

Enable Retry
Put @EnableRetry annotation on the SpringBoot main class.
1
2
3
4
5
6
7
8
@EnableRetry
@SpringBootApplication
public class DemoApplication {

public static void main(String[] args) {
SpringApplication.run(DemoApplication.class, args);
}
}
Put Retryable Logic in Service
@Retryable annotation has to be applied on a method which needs to have Retry logic. In this code, I have put counter variable to show in logs how many times it is trying the retry logic.
  • You can configure on what exception it should trigger the retry.
  • It can also define how many retry attempts it can do. Default is 3 if you don't define. 
  • @Recover method will be called once all the retry attempts are exhausted and service still throws the Exception(SQLException in this case). Recover method should be handling the fallback mechanism for those requests.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
@Service
public class BillingService {
    private static final Logger LOGGER = LoggerFactory.getLogger(BillingService.class);
    int counter =0;

    @Retryable(value = { SQLException.class }, maxAttempts = 3)
    public String simpleRetry() throws SQLException {
        counter++;
        LOGGER.info("Billing Service Failed "+ counter);
        throw new SQLException();

    }

    @Recover
    public String recover(SQLException t){
        LOGGER.info("Service recovering");
        return "Service recovered from billing service failure.";
    }
}   
Create REST Endpoint To Test
This client is created just to hit the BillingService simpleRetry() method. 
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
@RestController
@RequestMapping(value="/billing")
public class BillingClientService {

    @Autowired
    private BillingService billingService;
    @GetMapping
    public String callRetryService() throws SQLException {
        return billingService.simpleRetry();
    }
}
Launch the URL And See the Logs
http://localhost:8080/billing
1
2
3
4
2018-11-16 09:59:51.399  INFO 17288 --- [nio-8080-exec-1] c.e.springretrydemo.BillingService       : Billing Service Failed 1
2018-11-16 09:59:52.401  INFO 17288 --- [nio-8080-exec-1] c.e.springretrydemo.BillingService       : Billing Service Failed 2
2018-11-16 09:59:53.401  INFO 17288 --- [nio-8080-exec-1] c.e.springretrydemo.BillingService       : Billing Service Failed 3
2018-11-16 09:59:53.402  INFO 17288 --- [nio-8080-exec-1] c.e.springretrydemo.BillingService       : Service recovering
The logs shows it has tried the simpleRetry method 3 times and then route to recover method.
Apply BackOff Policy
Now, as we discussed above that having back to back retry can cause locking of the resources. So we should add Backoff Policy to create a gap between retries. Change the BillingService simpleRetry method code as below:
1
2
3
4
5
6
7
  @Retryable(value = { SQLException.class }, maxAttempts = 3, backoff = @Backoff(delay = 5000))
    public String simpleRetry() throws SQLException {
        counter++;
        LOGGER.info("Billing Service Failed "+ counter);
        throw new SQLException();

    }

Capture the logs showing 5 seconds gap in all retries.
1
2
3
4
2018-11-17 23:02:12.491  INFO 53392 --- [nio-8080-exec-1] c.e.springretrydemo.BillingService       : Billing Service Failed 1
2018-11-17 23:02:17.494  INFO 53392 --- [nio-8080-exec-1] c.e.springretrydemo.BillingService       : Billing Service Failed 2
2018-11-17 23:02:22.497  INFO 53392 --- [nio-8080-exec-1] c.e.springretrydemo.BillingService       : Billing Service Failed 3
2018-11-17 23:02:22.497  INFO 53392 --- [nio-8080-exec-1] c.e.springretrydemo.BillingService       : Service recovering

So this is the way Spring Retry works.

To access the full Code click https://github.com/RajeshBhojwani/spring-retry

There are many other options for applying Retry pattern in Java. Some are as following:
  • AWS SDK - This can be used only if you are using AWS related service and AWS API Gateway through AWS SDK apis.
  • Failsafe - Failsafe is a lightweight, zero-dependency library for handling failures. It was designed to be as easy to use as possible, with a concise API for handling everyday use cases and the flexibility to handle everything else. 
  • Java 8 Function Interface 
Thats all for this blog. Let me know what all libraries you are using in Microservice to handle failures and retry.

Design Patterns for Microservice-To-Microservice Communication


In my last blog, I talked about Design Patterns for Microservices. Now, I want to deep dive more into the most important pattern in Microservice Architecture and that is Inter-Communication between the Microservices. I still remember when we used to develop monolithic application, communication used be a tough task. In that world, we had to carefully design relations between database
tables and map with object models. Now, in Microservice world, we have broken them down into separate services and that creates a mesh around it to communicate with each other. Lets talk about what all the communication styles and patterns have evolved so far to resolve this.
Many Architects have divided Inter-Communication between microservices into Synchronous and Asynchronous interaction. Let's take one by one.

Synchronous -

              When we say synchronous, it means Client makes a request to Server and waits for its response. The thread will be blocked until it receives communication back. The most relevant protocol to implement synchronous communication is HTTP.  HTTP can be implemented by REST or SOAP. Recently, REST is picking up rapidly for microservices and winning over the SOAP.  But, for me both are good to use. Now let's talk about different flows/use-cases in Synchronous style. What are the issues we would face and how to resolve them.
  1. Let's start with a simple one. You need a Service A calling Service B and waiting for response for live data. This will be a good candidate to implement synchronous style as there are not many downstream services involved. You would not need to implement any complex design pattern for this use case except Load Balancing if using multiple instances.

2. Now, let's make it little more complicated. Service A is making calls to multiple downstream services like Service B, Service C, Service D for live data. 
  • Service B, Service C and Service D all have to be called sequentially - This kind of scenario will be there when Services are dependent on each other to retrieve data or the functionality has a sequence of events to be executed through these services.
  • Service B, Service C and Service D can be called in parallel - This kind of scenario will be there when Services are independent of each other or Service A may be doing Orchestrator role.
This scenario brings the complexity while doing the communication. Let's discuss one by one.
  a.    Tight coupling - Service A will have tight coupling with each Service B,          C, D. It has to know each service endpoint, credentials.
         Solution Service Discovery Pattern is used to solve these kind of issues. It helps to decouple the consumer and producer app by providing lookup feature. Services B, C and D can register themselves as services. Service Discovery can be implemented server side as well as client side. For Server side, we have AWS ALB, NGINX tools which accept the requests from client, discover the service and route the request to the identified location. For Client side, we have Spring Eureka discovery service. The real benefit I do see using Eureka is that it caches the available services information at client side. So even if Eureka Server is down for sometime, it doesn't become single point of failure. Other than Eureka, there are other service discovery tools like etcd and consul are also used widely.
        b.    Distributed Systems - If Service B, C and D have multiple instances, Then it needs to know how to do the load balancing.
               Solution - Load Balancing generally goes hand-in-hand with Service Discovery. For Server side load balancer, AWS ALB can be used and for client side, Ribbon can be used along with Eureka both can do the same. 
        c.    Authenticating/Filtering/Handling Protocols - If Service B, C and D needs to be secured and  need authentication, need to filter through only certain requests for these services and if  Service A and other services understand different protocols.
                Solution - API Gateway Pattern helps to resolve these issues. It can handle authentication, filtering and can convert protocols from AMQP to HTTP or others. It can also help enabling  observability metrics like distributed logging, monitoring, distributed tracing, etc... Apigee, Zuul, Kong are some of the known tools which can be used for the same. Please note. This pattern i will suggest if Service B, C and D are part of managed APIs else its overkill to have API Gateway. Read further down for Service Mesh for other alternate solution.
       d.   Handling Failures - If any of the Services B, C or D is down and if Service A can still serve the request of client with some of the features, it has to be designed accordingly. Other problem is let's suppose if Service B is down and all the requests are still making call to Service B and exhausting the resources as its not responding. It can make whole system down and Service A will not be able to send requests to C and D as well.
             Solution Circuit Breaker and Bulkhead Pattern helps to address these concerns. Circuit Breaker Pattern identifies if a downstream service is down for a certain time and trips the circuit to avoid sending calls to it. It retries to check again after a defined period if service got up and close the circuit to continue the calls to it. This really helps to avoid network clogging and exhausting the resource consumption. Bulkhead helps               to isolate the resources used for a service and that helps to avoid the cascading of failures. Spring cloud Hystrix does the same job. It applies both Circuit Breaker and Bulkhead Patterns.
       e.   Microservice-To-Microservice Network Communication -  API Gateway  is generally used to for managed APIs where it handles the Requests from UIs or Other consumers and make downstream calls to multiple microservices and respond back. But when a microservice wants to call to another microservice in the same group, API Gateway is overkill and not meant for that purpose. So it ends up that individual microservice takes the responsibility to make network communication, do security authentication, handle timeouts, handle failures, load balancing, service discovery, monitoring, logging. Its a too much overhead for a microservice.
             Solution - Service Mesh Pattern helps you to handle these kind of NFRs. It can offload all the Network Functions we discussed above. With that, microservice will not call directly to other microservice but go through this Service Mesh and it will handle the communication with all features. The beauty of this Pattern is that now you can concentrate on writing business logic in any language Java, NodeJS, Python without  worrying if these languages have support to implement all network functions or not. Istio and Linkerd are picking up to address these requirements. The only thing i don't like about Istio is that it is limited to Kubernetes as of now.   

Asynchronous -

           When we talk about Asynchronous, it means Client makes call a request to Server and receive the acknowledgment of the request received and forget about it. Server will process the request and completes it. Let's now talk about when you would need asynchronous style. If you have an application which is Read heavy, synchronous style might be a good fit especially when it needs the live data. However, when you have Write heavy transactions, and you can't afford losing data records, you may want to choose for asynchronous. Because, if a downstream system is down and you are keep sending synchronous calls to it, you will lose the requests and business transactions. Thumb rule is never ever use Async for live data read and never ever use Sync for business critical write transactions until unless you need the data immediately after write. You need to choose between Availability of the data records and Strong Consistency of the data.
 There are different ways we can implement the asynchronous style :
           1.  Messaging - In this approach, Producer will send the messages to a Message Broker and Consumer can listen to the message broker to receive the message and process accordingly. In this also, there are 2 patterns; one-to-one and one-to-many.  We talked some of the complexity which Synchronous style brings but some of them are eliminated by default in Messaging style. For example, Service Discovery becomes irrelevant as Consumer and Producer both  talks to Message Broker only. Load Balancing is handled by scaling up messaging system. Failure handling is in-built by Message Broker mostly. Rabbitmq, ActiveMQ, Kafka are best known solutions in cloud platform for messaging.  
                
         2. Event Driven - Even Driven looks similar to Messaging but it solves different purpose. Instead of sending messages, it will send Event details to the Message Broker along with the payload. Consumers will identify if what is the event and how to react on it. This gives more loose coupling. There are different types of payload can be passed.    
  • Full payload - This will have all the data related to the event which is required by consumer to take further action. But this makes it more tightly coupled
  • Resource URL - This will be just a URL to a resource that represents the event.
  • Only Event - No Payload will be sent. Consumer would know based on on the event name how to retrieve relevant data from other sources like DB or queues 
Image title










There are other styles like Choreography style, which is used to rollback your transactions in case of failures. But i personally don't like that. It is too complicated to be implemented. I have seen that too working well with Synchronous style only. 
That's all for this blog. Let me know your experience on Microservice-To-Microservice communication.

Deployment Challenges and Solutions For PCF Platform




Many users have started PCF as platform to host their applications and microservices. More you use the platform and more you uncover the issues & challenges. So today, i am going to talk about Microservice Deployment issues on PCF platform.
I have been working with many teams who started breaking their monolith applications and convert into microservices. I have a team which has 20+ microservice to handle now. And they need to deploy these microservices to multiple environments as part of  SDLC e.g. Dev, System, Perf, QA, Prod, etc.. Now, for deploying a microservice to PCF, you need to use manifest.yml file. And, it is recommended to separate each manifest file for each environment. So for these 20+ microservices, they need to maintain 20*5 =100+ manifest files. Can you imagine, how hard it would be to handle so many files. One manual mistake can cause so much damage. Lets first talk about first why you need separate manifest.yml file for each environment.

1.     Each environment, your microservice would require diff configuration of the resources e.g. In Dev, you would need only 2 instances of the app with 512 gb memory but in Prod, you might  need 4 instances with 1gb memory.

2.     Each environment might be binding to a backing service with different plan e.g for Dev, you might need only basic/free plan of the DB service but in Prod, you would need to advance plan to with more resources so your service might be different. You might want to ensure putting the name of the service instance accordingly so that team understands which is free and which is chargeable. 

3.     There might be few env variables which you would like to setup while deploying the application and that may vary based on environment. e.g. spring. spring.profiles.active= dev

    There might be many more reasons as well. But let's go back to how to resolve the number of growing manifest files.

PCF used to have a feature called Inheritance for manifest file where you can create a parent manifest yml file and inherit the common properties from there. But this has been deprecated and anyhow that doesn't resolve the number of manifest files you create; though it reduces the content of each file significantly.

Inheritance has been replaced by Variable Substitution. 
This feature helps to create a template manifest file, put the placeholders for variables and replace the values dynamically from external file.

1.     Lets start with an simple example. Here is a sample manifest file for a GO app.

//manifest.yml
---
applications:
- name: sample-app
  instances: ((noofinstances))
  memory: ((memory))
  env:
    GOPACKAGENAME: go_calls_ruby
  command: go_calls_ruby



2. This template file has put placeholder for variables for instances and memory. Lets create another file data.yml which is going to have those variable values.
  


//data.yml
noofinstances: 2
memory: 1G





3.  Now, push the application to PCF using cf push command and use --vars-file argument to pass the data.yml file.

cf push -f ~/workspace/manifest.yml --vars-file ~/workspace/data.yml


Values of 'noofinstances' and 'memory' from data.yml file will replace in manifest file. This shows how to use variable substitution feature. Now, we can use this solution to replace multiple manifest files to one.
  
Step 1.
Create a Template manifest file


//manifest.yml
---
applications:
- name: sample-app
  instances: ((noofinstances))
  memory: ((memory))
  services:
    - mysql
    - newrelic
  env:
    spring.profile.active: ((env))



Step 2.
Create a data.yml file which will have data for each environment. I would recommend to keep only all non-prod environment data in one file. For Prod, you should always have a separate manifest.yml and data.yml file.


//dev env
dev_noofinstances: 2
dev_memory: 512M
dev_env: dev

//system env
system_noofinstances: 3
system_memory: 1G
system_env: system

//perf env
perf_noofinstances: 4
perf_memory: 2G
perf_env: perf

//qa env
qa_noofinstances: 3
qa_memory: 1G
qa_env: qa


Step 3.
In CI/CD Pipeline like bamboo, jenkins, write an script. Most PCF deployments are configured through pipeline only. This script will take environment value as input as per pipeline standards and based on the env value like 'dev', 'system', 'perf' and 'qa', it will read this data.yml file and retrieve all the values of the related environment (Please notice, data.yml file variables have env value as prefix. Need to follow that standard). Script would create a temporary file <env>_data.yml file.

e.g.

//dev_data.yml
noofinstances:2
memory: 512M
env: dev



//system_data.yml
noofinstances:3
memory: 1G
env: system

Step 4.
 Now, as per earlier steps, just push the application to PCF using cf push command passing the data file dynamically through pipeline for each environment.


cf push -f manifest.yml --vars-file=<env>_data.yml

This solution has replaced 4 manifest files to 1 manifest file. You can even go one step further and club all related microservices' manifest data in one data.yml file and it can reduce the count further. But please note, keep the Prod data.yml file separate to avoid any manual mistakes being touched so frequently.

That's all for this blog. To understand more about the PCF and Microservice concepts, read these blogs.


Design Patterns for Microservices




Microservice architecture has become the de facto choice for modern application development. Though it solves certain problems, it is not a silver bullet. It has several drawbacks and when using this architecture, there are numerous issues that must be addressed. This brings about the need to learn common patterns in these problems and solve them with reusable solutions. Thus, design patterns for microservices need to be discussed. Before we dive into the design patterns, we need to understand on what principles microservice architecture has been built:
  1. Scalability
  2. Availability
  3. Resiliency
  4. Independent, autonomous
  5. Decentralized governance
  6. Failure isolation
  7. Auto-Provisioning
  8. Continuous delivery through DevOps
Applying all these principles brings several challenges and issues. Let's discuss those problems and their solutions.

1. Decomposition Patterns

a. Decompose by Business Capability

Problem

Microservices is all about making services loosely coupled, applying the single responsibility principle. However, breaking an application into smaller pieces has to be done logically. How do we decompose an application into small services?

Solution

One strategy is to decompose by business capability. A business capability is something that a business does in order to generate value. The set of capabilities for a given business depend on the type of business. For example, the capabilities of an insurance company typically include sales, marketing, underwriting, claims processing, billing, compliance, etc. Each business capability can be thought of as a service, except it’s business-oriented rather than technical.

b. Decompose by Subdomain

Problem

Decomposing an application using business capabilities might be a good start, but you will come across so-called "God Classes" which will not be easy to decompose. These classes will be common among multiple services. For example, the Order class will be used in Order Management, Order Taking, Order Delivery, etc. How do we decompose them?

Solution

For the "God Classes" issue, DDD (Domain-Driven Design) comes to the rescue. It uses subdomains and bounded context concepts to solve this problem. DDD breaks the whole domain model created for the enterprise into subdomains. Each subdomain will have a model, and the scope of that model will be called the bounded context. Each microservice will be developed around the bounded context.
Note: Identifying subdomains is not an easy task. It requires an understanding of the business. Like business capabilities, subdomains are identified by analyzing the business and its organizational structure and identifying the different areas of expertise.

c. Strangler Pattern

Problem

So far, the design patterns we talked about were decomposing applications for greenfield, but 80% of the work we do is with brownfield applications, which are big, monolithic applications. Applying all the above design patterns to them will be difficult because breaking them into smaller pieces at the same time it's being used live is a big task.

Solution

The Strangler pattern comes to the rescue. The Strangler pattern is based on an analogy to a vine that strangles a tree that it’s wrapped around. This solution works well with web applications, where a call goes back and forth, and for each URI call, a service can be broken into different domains and hosted as separate services. The idea is to do it one domain at a time. This creates two separate applications that live side by side in the same URI space. Eventually, the newly refactored application “strangles” or replaces the original application until finally you can shut off the monolithic application.

2. Integration Patterns

a. API Gateway Pattern

Problem

When an application is broken down to smaller microservices, there are a few concerns that need to be addressed:
  1. How to call multiple microservices abstracting producer information.
  2. On different channels (like desktop, mobile, and tablets), apps need different data to respond for the same backend service, as the UI might be different.
  3. Different consumers might need a different format of the responses from reusable microservices. Who will do the data transformation or field manipulation?
  4. How to handle different type of Protocols some of which might not be supported by producer microservice.

Solution

An API Gateway helps to address many concerns raised by microservice implementation, not limited to the ones above.
  1. An API Gateway is the single point of entry for any microservice call.
  2. It can work as a proxy service to route a request to the concerned microservice, abstracting the producer details.
  3. It can fan out a request to multiple services and aggregate the results to send back to the consumer.
  4. One-size-fits-all APIs cannot solve all the consumer's requirements; this solution can create a fine-grained API for each specific type of client.
  5. It can also convert the protocol request (e.g. AMQP) to another protocol (e.g. HTTP) and vice versa so that the producer and consumer can handle it.
  6. It can also offload the authentication/authorization responsibility of the microservice.

b. Aggregator Pattern

Problem

We have talked about resolving the aggregating data problem in the API Gateway Pattern. However, we will talk about it here holistically. When breaking the business functionality into several smaller logical pieces of code, it becomes necessary to think about how to collaborate the data returned by each service. This responsibility cannot be left with the consumer, as then it might need to understand the internal implementation of the producer application.

Solution

The Aggregator pattern helps to address this. It talks about how we can aggregate the data from different services and then send the final response to the consumer. This can be done in two ways:
1. A composite microservice will make calls to all the required microservices, consolidate the data, and transform the data before sending back.
2. An API Gateway can also partition the request to multiple microservices and aggregate the data before sending it to the consumer.
It is recommended if any business logic is to be applied, then choose a composite microservice. Otherwise, the API Gateway is the established solution.

c. Client-Side UI Composition Pattern

Problem

When services are developed by decomposing business capabilities/subdomains, the services responsible for user experience have to pull data from several microservices. In the monolithic world, there used to be only one call from the UI to a backend service to retrieve all data and refresh/submit the UI page. However, now it won't be the same. We need to understand how to do it.

Solution

With microservices, the UI has to be designed as a skeleton with multiple sections/regions of the screen/page. Each section will make a call to an individual backend microservice to pull the data. That is called composing UI components specific to service. Frameworks like AngularJS and ReactJS help to do that easily. These screens are known as Single Page Applications (SPA). This enables the app to refresh a particular region of the screen instead of the whole page.

3. Database Patterns

a. Database per Service

Problem

There is a problem of how to define database architecture for microservices. Following are the concerns to be addressed:
1. Services must be loosely coupled. They can be developed, deployed, and scaled independently.
2. Business transactions may enforce invariants that span multiple services.
3. Some business transactions need to query data that is owned by multiple services.
4. Databases must sometimes be replicated and sharded in order to scale.
5. Different services have different data storage requirements.

Solution

To solve the above concerns, one database per microservice must be designed; it must be private to that service only. It should be accessed by the microservice API only. It cannot be accessed by other services directly. For example, for relational databases, we can use private-tables-per-service, schema-per-service, or database-server-per-service. Each microservice should have a separate database id so that separate access can be given to put up a barrier and prevent it from using other service tables.

b. Shared Database per Service

Problem

We have talked about one database per service being ideal for microservices, but that is possible when the application is greenfield and to be developed with DDD. But if the application is a monolith and trying to break into microservices, denormalization is not that easy. What is the suitable architecture in that case?

Solution

A shared database per service is not ideal, but that is the working solution for the above scenario. Most people consider this an anti-pattern for microservices, but for brownfield applications, this is a good start to break the application into smaller logical pieces. This should not be applied for greenfield applications. In this pattern, one database can be aligned with more than one microservice, but it has to be restricted to 2-3 maximum, otherwise scaling, autonomy, and independence will be challenging to execute.

c. Command Query Responsibility Segregation (CQRS)

Problem

Once we implement database-per-service, there is a requirement to query, which requires joint data from multiple services — it's not possible. Then, how do we implement queries in microservice architecture?

Solution

CQRS suggests splitting the application into two parts — the command side and the query side. The command side handles the Create, Update, and Delete requests. The query side handles the query part by using the materialized views. The event sourcing pattern is generally used along with it to create events for any data change. Materialized views are kept updated by subscribing to the stream of events.

d. Saga Pattern

Problem

When each service has its own database and a business transaction spans multiple services, how do we ensure data consistency across services? For example, for an e-commerce application where customers have a credit limit, the application must ensure that a new order will not exceed the customer’s credit limit. Since Orders and Customers are in different databases, the application cannot simply use a local ACID transaction.

Solution

A Saga represents a high-level business process that consists of several sub requests, which each update data within a single service. Each request has a compensating request that is executed when the request fails. It can be implemented in two ways:
  1. Choreography — When there is no central coordination, each service produces and listens to another service’s events and decides if an action should be taken or not.
  2. Orchestration — An orchestrator (object) takes responsibility for a saga’s decision making and sequencing business logic.

4. Observability Patterns

a. Log Aggregation

Problem

Consider a use case where an application consists of multiple service instances that are running on multiple machines. Requests often span multiple service instances. Each service instance generates a log file in a standardized format. How can we understand the application behavior through logs for a particular request?

Solution

We need a centralized logging service that aggregates logs from each service instance. Users can search and analyze the logs. They can configure alerts that are triggered when certain messages appear in the logs. For example, PCF does have Loggeregator, which collects logs from each component (router, controller, diego, etc...) of the PCF platform along with applications. AWS Cloud Watch also does the same.

b. Performance Metrics

Problem

When the service portfolio increases due to microservice architecture, it becomes critical to keep a watch on the transactions so that patterns can be monitored and alerts sent when an issue happens. How should we collect metrics to monitor application perfomance?

Solution

A metrics service is required to gather statistics about individual operations. It should aggregate the metrics of an application service, which provides reporting and alerting. There are two models for aggregating metrics:
  • Push — the service pushes metrics to the metrics service e.g. NewRelic, AppDynamics
  • Pull — the metrics services pulls metrics from the service e.g. Prometheus

c. Distributed Tracing

Problem

In microservice architecture, requests often span multiple services. Each service handles a request by performing one or more operations across multiple services. Then, how do we trace a request end-to-end to troubleshoot the problem?

Solution

We need a service which
  • Assigns each external request a unique external request id.
  • Passes the external request id to all services.
  • Includes the external request id in all log messages.
  • Records information (e.g. start time, end time) about the requests and operations performed when handling an external request in a centralized service.
Spring Cloud Slueth, along with Zipkin server, is a common implementation.

d. Health Check

Problem

When microservice architecture has been implemented, there is a chance that a service might be up but not able to handle transactions. In that case, how do you ensure a request doesn't go to those failed instances? 

Solution

With a load balancing pattern implementation, each service needs to have an endpoint which can be used to check the health of the application, such as /health. This API should check the status of the host, the connection to other services/infrastructure, and any specific logic.
Spring Boot Actuator does implement a /health endpoint and the implementation can be customized, as well.

5. Cross-Cutting Concern Patterns

a. External Configuration

Problem

A service typically calls other services and databases as well. For each environment like dev, QA, UAT, prod, the endpoint URL or some configuration properties might be different. A change in any of those properties might require a re-build and re-deploy of the service. How do we avoid code modification for configuration changes?

Solution

Externalize all the configuration, including endpoint URLs and credentials. The application should load them either at startup or on the fly.
Spring Cloud config server provides the option to externalize the properties to GitHub and load them as environment properties. These can be accessed by the application on startup or can be refreshed without a server restart.

b. Service Discovery Pattern

Problem

When microservices come into the picture, we need to address a few issues in terms of calling services:
  1. With container technology, IP addresses are dynamically allocated to the service instances. Every time the address changes, a consumer service can break and need manual changes.
  2. Each service URL has to be remembered by the consumer and become tightly coupled.
So how does the consumer or router know all the available service instances and locations?

Solution

A service registry needs to be created which will keep the metadata of each producer service. A service instance should register to the registry when starting and should de-register when shutting down. The consumer or router should query the registry and find out the location of the service. The registry also needs to do a health check of the producer service to ensure that only working instances of the services are available to be consumed through it. There are two types of service discovery: client-side and server-side. An example of client-side discovery is Netflix Eureka and an example of server-side discovery is AWS ALB.

c. Circuit Breaker Pattern

Problem

A service generally calls other services to retrieve data, and there is the chance that the downstream service may be down. There are two problems with this: first, the request will keep going to the down service, exhausting network resources and slowing performance. Second, the user experience will be bad and unpredictable. How do we avoid cascading service failures and handle failures gracefully?

Solution

The consumer should invoke a remote service via a proxy that behaves in a similar fashion to an electrical circuit breaker. When the number of consecutive failures crosses a threshold, the circuit breaker trips, and for the duration of a timeout period, all attempts to invoke the remote service will fail immediately. After the timeout expires the circuit breaker allows a limited number of test requests to pass through. If those requests succeed, the circuit breaker resumes normal operation. Otherwise, if there is a failure, the timeout period begins again.
Netflix Hystrix is a good implementation of the circuit breaker pattern. It also helps you to define a fallback mechanism which can be used when the circuit breaker trips. That provides a better user experience.

d. Blue-Green Deployment Pattern

Problem

With microservice architecture, one application can have many microservices. If we stop all the services then deploy an enhanced version, the downtime will be huge and can impact the business. Also, the rollback will be a nightmare. How do we avoid or reduce downtime of the services during deployment?

Solution

The blue-green deployment strategy can be implemented to reduce or remove downtime. It achieves this by running two identical production environments, Blue and Green. Let's assume Green is the existing live instance and Blue is the new version of the application. At any time, only one of the environments is live, with the live environment serving all production traffic. All cloud platforms provide options for implementing a blue-green deployment. For more details on this topic, check out this article.
There are many other patterns used with microservice architecture, like Sidecar, Chained Microservice, Branch Microservice, Event Sourcing Pattern, Continuous Delivery Patterns, and more. The list keeps growing as we get more experience with microservices. I am stopping now to hear back from you on what microservice patterns you are using.

Most Viewed Posts