Friday, August 30, 2019

All modern applications are nowadays being developed using either Serverless or Containers technology. However, it is always difficult to choose the one best suitable for a particular requirement.

In this article, we will try to understand how these two are different from each other and in what scenario we can use one or other.

Let us first start with understanding the basics of Serverless and Container technology.

What is Serverless?

Serverless is a development approach that replaces long-running virtual machines with computing power that comes into existence on demand and disappears immediately after use.
Despite the name, there certainly are servers involved in running your application. It’s just that your cloud service provider, whether it’s AWS, Azure, or Google Cloud Platform, manages these servers, and they’re not always running.
It is trying to resolve below issues:
  • Unnecessary charges for keeping the server up even when we are not consuming any resources
  • Overall responsibility for maintenance and uptime of the server.
  • Responsibility for applying the appropriate security updates to the server.
  • As our usage scales, we need to manage to scale up our server as well. And as a result, manage to scale it down when we don’t have as much usage.

What is Containers?

A container is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run it: code, runtime, system tools, system libraries, and settings.
Containers solve the problem of running software when it has been moved from one computing environment by essentially isolating it from its environment. For instance, containers allow you to move software from development to staging and from staging to production and have it run reliably regardless of the differences of all the environments.

Comparison between Serverless vs Containers

To start with, it’s worth saying that both — Serverless and Containers point out an architecture that is designed for future changes, and for leveraging the latest tech innovations in cloud computing. While many people often talk about Serverless Computing vs Docker Containers, the two have very little in common. That is because both technologies aren’t the same thing and serve a different purpose. First, let’s go over some common points:
  1. Less overhead
  2. High performance
  3. Requires less interaction at the infrastructure level to do provisioning.
Although Serverless is more innovative technology than Containers, they both have their disadvantages and of course, benefits that make them both useful and relevant. So let’s review the two.

Lambda Functions is "Short-lived". Once its executed, it will spin down. Lambda has a timeout threshold of 15 minutes. Long-running workloads cannot run on this. However, Step-Functions can be used to break the long-running Application logic into smaller steps (Functions) and run it. But, it might not apply to all kinds of long-running application.
ECS is "long-running" containers. It can run as long as you want.
If an application is having High throughput, say 1Million requests per day, Lambda would be costing higher compare to container solutions. The reason is, it would need a higher resource like Memory and execution time will be high. As Lambda charges based on memory and execution time, the cost will increase in the multiplication factor. The second reason is that 1 function can have maximum 3GB Memory and it might not be able to handle the high throughput and would need concurrent execution which may introduce latency due to cold start time.
ECS uses EC2 instances to host the applications. EC2 can handle high throughput more effectively than Serverless Functions as it has different types of instance types which can be used as per throughput requirement. Its cost will be comparatively less. Latency will also be better if a single EC2 instance can also handle such kind of load.
For lower Throughput, Lambda is a good choice in terms of cost, performance, and time to deploy.
For lower throughput also, EC2 works very well. While comparing with Lambda, need to consider other factors described in this table.
Lambda has auto-scaling as a built-in feature.
. It scales the functions with concurrent execution.
. However, there is a max limit (1000 concurrent execution) at the account level.
. Lambda horizontal scaling is very fast however, there will be very minimal latency due to cold start time.
Containers don't have any constraints on scaling. However,
. We would need to forecast the scaling requirements.
. Also, it has to be designed and configured manually or automate it through scripts.
. Scaling containers process is slower than scaling Lambda.
. Also, higher the number of worker nodes we have, more the problems it will add to the maintenance like handling latency, throttling issues.
Time to Deploy
Lambda Functions are smaller in size and take significantly less time compared to containers. It takes milliseconds to deploy compared to seconds in container case.
Containers take significant time initially to configure and set up as it would require system setting, libraries. However, once it is configured, it takes seconds to deploy.
In Serverless Architecture, infrastructure is not used unless the application is invoked. So, it will charge only for the server capacity that their application use during the uptime. Now, this can be cost-effective in some scenarios like:
. Application is used rarely (once or twice a day)
. Application has frequent scale up and down requirement due to the user request throughput changing frequently.
. An application needs fewer resources to run. Because Lambda cost depends on memory and execution time. If it is compared with Container cost running 24 hours, it always wins.
Containers are constantly running, and therefore cloud providers have to charge for the server space even if no one is using the application at the time.

If Throughput is high, Containers are better cost-effective compared to Lambda.

While comparing with EKS cluster, ECS cluster is free.
For Lambda, system security is taken care of by AWS itself. It only needs to handle application-level security using IAM roles and policies. However, if Lambda has to run in a VPC, then VPC level security has to apply here.
For Containers, we are also responsible for applying the appropriate security updates to the server. This includes patching OS, upgrades to software and libraries.
ECS supports IAM Roles for Tasks which is great to grant containers access to AWS resources. For example, to allow containers to access S3, DynamoDB, SQS, or SES at runtime. EKS doesn't provide IAM level security at pods level.
Vendor Locking
Serverless function brings Vendor Locking as if you need to move the Lambda function to Azure function, it would need significant changes at code and configuration level.
Containers are designed to run on any cloud platform which supports container technologies. So it brings the benefit to build once and run anywhere. However, the services being used for Security - IAM, KMS, Security Groups, and others are tightly coupled with AWS. It would need some rework to move this workload to other platforms.
Infrastructure Control
If a team doesn't have infrastructure skills, Lambda will be a good option. The team can concentrate on business logic development and let AWS handle the infrastructure.
With Containers, we get full control of server, OS, Network components. We can define and configure within the limitations put by Cloud providers. So, if an application/system needs fine-grained control of infrastructure, this solution works better.
Lambda doesn't need any maintenance work as everything at the server level is being taken care of by AWS.
Containers need for maintenance like patching, upgrade and that would require skilled resources as well. So, keep this in mind while choosing this architecture for deployment.
State persistence
Lambda is designed for serverless so it will not maintain any state. It is short-lived. Because of this reason, we cannot use caching for it and that may cause latency problem.
Containers can leverage the benefits of caching.
Latency & Startup Time
For Lambda, cold start and warm start time are key factors to be considered. As they may cause latency as well as add to the cost of executing functions.
Containers being running always doesn't have cold/warm start time. Also, using caching latency can be reduced.
While comparing with EKS, ECS doesn't have any proxy concept at the node level. Load balancing is just between ALB and EC2 instances. So no extra hop of latency.
If Lambda is deployed in a VPC, its concurrent execution is limited by ENI capacity of the subnets.
The number of ENIs per EC2 instance is limited from 2 to 15 depending on the instance type.
In ECS, each task is assigned only a single ENI so we can have a maximum of 15 tasks per EC2 instance with ECS.
Monolith Applications
Lambda is not fit for Monolithic application. It cannot run complex type of application
ECS can be used to run a monolith application
Testing is difficult in serverless based web applications as it often becomes hard for developers to replicate the backend environment in a local environment.
Since containers run on the same platform where they are deployed, it’s relatively simple to test a container-based application before deploying it to the production.
Lambda monitoring can be done through CloudWatch, X-Ray. Need to rely on Cloud vendor to provide monitoring capabilities. However, infrastructure level monitoring is not required in this case.
Container monitoring would require to capture Availability, System Errors, Performance and Capacity metrics to configure HA for the container applications.

When to use Serverless

Serverless Computing is a perfect fit for the following use-cases: 
  1. If the application team doesn’t want to spend much time thinking where your code is running and how!
  2. If the team doesn't have skilled infrastructure resources and worried about the cost of maintenance of servers and resources application consumes, serverless will be a great fit for such use-case.
  3. If the application's traffic pattern changes frequently, it will handle it automatically. It will also even shut down when there is no traffic at all.
  4. Serverless websites and applications can be written and deployed without handling the work of setting up infrastructure. As such, it is possible to launch a fully-functional app or website in days using serverless.
  5. If a team needs a small batch job which can be finished within Lambda limits, its a good fit to use.

When to use Containers

Containers are best to use for Application deployment in the following use cases:
  • If the team wants to use the operating system of their own choice and leverage full control over the installed programming language and runtime version.
  • If the team wants to use software with specific version requirements, containers are great to start with.
  • If the team is okay in bearing the cost of using big yet traditional servers for anything such as Web APIs, machine learning computations, and long-running processes, then they might also want to try out containers as well (They will cost you less than servers anyways)
  • If the team wants to develop new container-native applications
  • If the team needs to refactor a very large and complicated monolithic application, then it’s better to use the container as it’s better for complex applications.


In a nutshell, we learned that both the technologies are good and can complement each other rather than competing. They both solve different problems and should be wisely. If you need help how to design and architect your application, reach out to me. 

Saturday, August 17, 2019


AWS Lambda has gained good traction for building applications on AWS. But, is it really the best fit for all use cases? 
Since its introduction in 2014, Lambda has seen enthusiastic adoption - by startups and enterprises alike. There is no doubt that it marks a significant evolution in cloud computing, leveraging the possibilities of the cloud to offer distinct advantages over a more traditional model such as EC2.  
In this article, we are going to conduct a fair comparison between EC2 and Lambda covering various aspects of cloud-native features. Let's begin with a quick reminder of what these two services offer, and how they differ.

What is AWS EC2?

Amazon Elastic Compute Cloud (EC2) service was introduced to ease the provision of computing resources for developers. Its primary job is to provide self-service, on-demand, and resilient infrastructure.
- It reduces the time required to spin up a new server to minutes from the days or weeks of work it might have taken in the on-premise world.
- It can scale up and down instantly based on the computing requirement.
- It provides an interface to configure capacity with minimal effort.
- It allows complete admin access to the servers, making infrastructure management straightforward.
- It also enables monitoring, security and support for multiple instances types (wide variety of Operating Systems, Memory, and CPUs).

What is AWS Lambda?

AWS Lambda was launched to eliminate infrastructure management of computing. It enables developers to concentrate on writing the function code without having to worry about provisioning infrastructure. We don't need to do any forecasting of the resources (CPU, Memory, Storage, etc.). It can scale resources up and down automatically. It is the epitome of Serverless Architecture.
Before we start comparing the different features of both of these services, let's understand a few key things about Lambda:
- Lambda was designed to be an event-based service which gets triggered by events like a new file being added to an S3 bucket, a new record added in a DynamoDB table, and so on. However, it can also be invoked through API Gateway to expose the function code as a REST API.
It was introduced to reduce the idle time of computing resources when the application is not being used.
- Lambda logs can be monitored the same way as EC2, through CloudWatch.
- Lambda local development is generally done using AWS SAM or Serverless Framework. They use CloudFormation for deployment.
- Unlike EC2, it is charged based on the execution time and memory used.

Now, let’s take a deeper look at how Lambda and EC2 differ from each other in terms of performance, cost, security, and other aspects:

Setup & Management

For setting up a simple application on EC2, first, we need to forecast how much capacity the application would need. Then, we have to configure it to spin up the Virtual Machine. 
After that, one needs to set up a bastion server to securely SSH to the VM and install the required software, web server, and so on. You’ll need to manage the scaling as well by setting up an Auto Scaling group. And that’s not all. ALB also need to be set up to do the load balancing in case multiple instances of applications are installed using multiple EC2 instances.
In the case of Lambda, you won’t need to worry about the provisioning of VMs, software, scaling or load balancing. It is all handled by the Lambda service. We just need to compile the code and deploy to Lambda service. Scaling is automated. We just need to configure how many max concurrent executions we want to allow for a function. Load balancing will be handled by Lambda itself.
So here, we can see Lambda is a clear winner.

On-Demand vs. Always Available

For EC2, we essentially have to pay for the amount of time EC2 instances are up and running. But for Lambda, it’s the amount of time functions are up and running. The reason is that Lambda is brought up and spun down automatically based on event sources and triggers. This is something we don’t get out of the box while using EC2. So, while an EC2 container is always available, 
Lambda is available based on the request invocation. The advantage goes to Lambda functions since we are no longer paying for the idle time between invocations, which can save a lot of money in the long run.


There are various aspects to cover when we take performance into consideration. Let's discuss them one by one. 

1. Concurrency and Scaling

With EC2, we have full control in implementing the concurrency and scaling. We can use EC2 Auto Scaling groups to define the policies for scaling up and down. These policies involve defining conditions (avg. threshold limits) and actions (# of instances to be added or deleted). 
However, it requires a lot of effort to identify the threshold limits and accurately forecast the # of instances required. It can only be done by carefully monitoring the metrics (CloudWatch, NewRelic, etc.).
However, with Lambda, concurrency and scaling are handled as a built-in feature. We simply have to define the maximum number of concurrent executions we want a function to be restricted to. There are a few limitations though. It can have a max. 3 GB memory. So if a program needs to scale vertically for memory, it can't do that more than 3 GB. For horizontal scaling, the maximum limit is 1,000 concurrent executions. If your Lambda is deployed in a VPC, then it is even further restricted based on the number of IP addresses available for the subnets allocated.
So, EC2 gives you more flexibility but requires manual configuration and forecasting. Lambda is designed to do all of that by itself but has a few limitations.

2. Dependencies

It is inevitable to run an application without external libraries. When we use EC2, there is no constraint to limit the number of dependencies for an application. However, the more dependencies an application has, the more time it will take to start. It will add a burden to the CPU as well.
However, with Lambda, there are constraints in terms of the maximum size of a package - 50 MB (zipped, for direct upload) and 250 MB (unzipped, including layers). Sometimes, these sizes are not sufficient, especially for ML programs where we need a lot of third-party libraries. 
AWS recommends using /tmp directory to install and download the dependencies during function runtime. However, it can take significant time to download all the dependencies from scratch when a new container is being created. So, this option is good when your lambda container is up most of the time, otherwise, it may cause a long cold start time for each invocation. Also, /tmp folder can hold a maximum of 512MB only. So, it is again restricted for limited use only.
EC2 is a clear winner here.

3. Latency

Comparing the latency between EC2 and Lambda is not straightforward. It depends on the use cases. So, let’s try to get a clearer picture by going through a few examples.
Let's take the first example where the application is used only a few times a day in the interval of 2-3 hours. 
Now, if we use EC2, it will be running for the whole day and latency for the first request will be high but for all subsequent requests will be comparatively less. The reason is, when the EC2 instance is provisioned, all the scripts are run to set up the OS, software, EBS and other things. 
If we use Lambda, the application doesn't need to be running for a whole day. Lambda container can be spun up based on each request. However, it will involve cold start time, which is not significant compared to the EC2 instance setup time. So, for this use case, Lambda will have more latency per request than EC2 but significantly less than the EC2's first request. To reduce this time, some teams create a Lambda function which will periodically call the application Lambda functions to keep them warm. However, it is going to increase the bill, so you need to keep a balance. 
If we take another example, where the application needs to scale up and down frequently, In this case, EC2 will need to scale up to handle the increased volume of requests. This will impact the latency of the requests. While with Lambda, scaling will be comparatively fast and latency will be less.
So, latency will ultimately depend on the use cases and other local factors (cold start time for Lambda and resources setup time for EC2).

4. Timeout

Lambda has more timeout constraints than EC2. If we have long-running workloads, Lambda may not be the right choice as It has a hard timeout limit of 15 minutes. But EC2 doesn't have such kind of restriction. 
AWS has introduced Step Functions to overcome the Lambda timeout constraint. Also, if we have a Lambda function exposed as REST API through API Gateway, it also has a timeout limit of 29 seconds at the gateway. 
Timeouts don’t occur only due to these limits, but the downstream system's application integration as well. And, that can happen to both EC2 and Lambda functions. 
One more thing to note in the EC2 case is that if we don't configure security groups appropriately, it may also cause timeout errors.


To understand how EC2 and Lambda services compare on cost, let’s run  through a couple of examples:
1. Let's assume an application has 5,000 hits per day with each execution taking 100 ms with 512MB. So the cost for the Lambda function will be $0.16.

Now, for the same requirement, I believe I can use t2.nano EC2 instance. And, if we look at the cost for this instance, it will be $4.25.

We can see, the Lambda cost ($0.16) is just ~4% of the EC2 price ($4.25).
2. Let's take the second scenario where an application has a lot of hits, say 5 million per month, and each execution takes 200 ms with 1GB Memory. If we use Lambda, it will cost us $17.67. 

However, if we use EC2 and, I believe, t3.micros should be able to handle this load, then it will cost us only $7.62. 
So, in this case, EC2 is a cheaper solution than Lambda due to the high requirement of memory/request #/execution time. 
3. Now, take an example where multiple EC2 instances would be required to handle the requests. In that case, EC2 would be costlier for two reasons. First, we need an ALB (Application Load Balancer) to handle the load balancing between those instances. It will add to the cost. Second, EC2 will eat up some memory being allocated and traffic is also not evenly distributed always so we would need more EC2 instances than anticipated. Lambda can handle the load balancing internally so no extra cost is added while scaling.


When we talk about security for Lambda, most of the onus is on the AWS side which includes OS patching, upgrades, and other infrastructure-level security concerns. Generally, the malware sits idle on a server for a long time and then starts growing slowly. That is not possible in Lambda as it is stateless in nature.
On the other hand, for EC2, we have full control to define system-level security. We need to configure the security groups, Network ACLs, and VPC subnet route tables to control the traffic in and out for an instance. However, it's a tiring job to ensure the system is fully secure. Security groups can grow to take care of business needs but can become overlapping and confusing sometimes.


Despite EC2's resiliency and elasticity, there are many ongoing objectives that require close tracking of capacity, predictability, and interdependence with other services. Let's talk about some of the important metrics that need to be monitored for EC2.
1. Availability - To avoid an outage in production, we need to know if each of the EC2 instances running for the applications is healthy or not. EC2 has "Instance State" which can be used to track it.
2. Status Checks - AWS performs status checks on all the running EC2 servers. System status checks monitor conditions like loss of network connectivity and hardware issues. That requires AWS’s involvement to fix. Instance status checks monitor conditions like exhausting memory and a corrupt file system. That requires our involvement to fix. The best practice is to set a status check alarm to notify you when a status check fails.
3. Resources Capacity - CPU and Memory utilization are directly related to application responsiveness. If it gets exhausted, we will not even have sufficient memory to do SSH on the instance. The only option would be to reboot the instance which can cause downtime and state loss. A good monitoring system will store metrics from the instance and can show us an increase in its resource usage until eventually hitting a ceiling and becoming unavailable.
4. System Errors - System errors can be found in the system log file like /var/log/syslog. We can aggregate these logs to Amazon CloudWatch Logs by installing their agent, or we can use syslog to forward the logs to some other central location like Splunk or ELK.
5. Human Errors - EC2 needs a lot of manual configuration and sometimes it may go wrong. So we need tracking of such activities, which can be done through CloudTrail audit logs.
6. Performance Metrics - Through CloudWatch Logs, we can monitor CPU usage and disk usage. However, it doesn't provide any metrics for application performance monitoring. And that's where we would need to use APM tools.
7. Cost Monitoring - EC2 instances count, EBS volume usage, and Network usage is very important to monitor as auto-scaling can open a serious risk to the overall AWS billing. CloudWatch helps to get some information about the network usage for an instance but doesn't give overall information of how many instances are being used. Also, we would need to know about storage and network usage at an account level. And that is something that’s missing in CloudWatch Logs.
So, most of the metrics can be tracked using CloudWatch, CloudTrail, and X-Ray but there are still, a few gaps to be filled.
Now, let's talk about Lambda monitoring. Cloudwatch provides all the basic telemetry about the health of the Lambda function out of the box:
 -Invocation Count
- Execution Duration
- Error Count
- Throttled Count

In addition to these built-in metrics, we can also record custom metrics and publish them to CloudWatch Metrics. 
However, there are a few limitations to these metrics. It doesn't cover concurrent execution metrics, which is the most common feature of Lambda. For cold start count and downstream integration-related metrics, we have to rely on X-Ray monitoring. There are also no metrics available for memory usage, but it can be done using custom metrics.
Now, we have a much better understanding of the major differences between EC2 and Lambda in various aspects. So, let's talk about which one to use for a given use case.

Use Cases 

There are certain use cases where there is no competition between the two. For example, If we need to set up a DB like Couchbase or MongoDB, we have to go for EC2 only. We can't do that in Lambda. Another example would be if we need a hosting environment for Backup & Disaster Recovery. Again, EC2 would be the only choice. 
However, there are certain use cases where developers might be a little uncertain about which one to use. 
1. High-Performance Computing Applications - Although Lambda claims that it can perform real-time stream processing if these processes need high compute, it cannot handle it. Remember, Lambda has only 3 GB memory. And it may cause you high execution time, leading to either timeout issues or a higher bill. On the other hand, EC2 has no such restriction and is an ideal fit for these kinds of requirements.
2. Event-Based Applications - Lambda has been primarily designed for handling event-based functions and it does that best. So if a new record is added to DynamoDB or a new file added to an S3 bucket needs processing, Lambda is the best fit. It's very easy to set up and saves cost as well.
3. Security Sensitive Applications - AWS claim that it takes care of Lambda security very well. But remember one thing: Lambda functions are running in a shared VPC that may be shared with other customers in a multi-tenancy setup. So, if an application has highly sensitive data, and security is your primary concern, Lambda functions should be running under a dedicated VPC or use EC2 only. And don’t forget, running the Lambda under VPC has its own challenges like increased cold start time, limited concurrent executions, etc.
4. Less Accessed Applications or Scheduled Jobs - If the application is used very rarely, or should be invoked based on schedule, Lambda is the right fit for it. It will save money as there is no need to run the server all the time.
5. DevOps and Local Testing - DevOps has been developed for EC2 for years and has reached a good level of maturity, but Lambda is still going through that journey. AWS SAM and Serverless Framework is addressing those concerns. Local testing is another aspect you need to consider while using Lambda as it has few limitations in regards to what can be done.


In this article, we have understood that if a team doesn't want to deal with infrastructure, then Lambda will be the right choice. However, it comes with limitations in terms of the use cases it can run. Also, constant monitoring is a must to ensure there is a good balance between the ease it provides and the cost.
EC2 has always been a standard choice for hosting any application and gives us full flexibility to configure our infrastructure. However, it is not best suited to all needs, and that's what where Lambda comes into the picture. 
Keep using both services based on the considerations I have shared and do let me know your feedback.

Tuesday, July 16, 2019


            Developers, who have been using PCF, have always been asking this question that how do we know which version is running on PCF. We used to check this in on-premise by going to that VM and check the jar version. Also, how to do rollback smoothly for PCF deployments as doing it at SCM (Github) or CI/CD level is painful and risk-prone. It involves verification, approval and other deployment processes.
With PCF 2.6, Pivotal has come up with a feature called "App Revisions". Using this feature, we can rollback the deployment for an application very easily. In this article, we will try to understand this feature and walk through the steps for rollback.

What is App Revisions -

A revision represents code and configuration used by an app at a specific time. In PCF terms, it is an object which will contain references to a droplet, environment variables, and custom start command. Have a look at a simple example below:

In the sample above, we have retrieved all the revisions for an application based on it's GUID using PCF CAPI. In line 15, it has a "resources" object. This object contains all the revisions for an application. Resources will have an array of objects and each object will have a guid which represents a revision. The "description" node will provide a PCF configured description of the revision. For example, in this case, the first revision description at line 30 tells that it is the "Initial version" which got deployed for this application.
Now, whenever there is a change in the application deployment, it will create a new revision with a new GUID. Following are the events which will trigger the new revision:
  • A new droplet is created which happens when you do "cf push" after making a change to the code.
  • An environment variable is added or changed for the application.
  • A custom start command is added or changed in the manifest file.
  • An app rolls back to a previous version.

Steps to Follow to Enable, View, and Rollback Revisions -

Step 1 - Take an application that can run on PCF. Deploy it on PCF using cf push command.
Step 2 - Now, get the guid of the running app by executing this command:

Step 3 - Run this command to enable the revision for this app:

Step 4 - Let's check what is the current revision for this deployed application. You will notice an "Initial revision" as a description for the first revision after enabling it:

Step 5 -  Now, we will use the environment variable event to trigger the new revision for this example. Add an environment variable to the application using below command and then restage the application to reflect the changes:

Step 6 -  Lets, check if there is any new revision that has been added for the application. You will notice a new guid is added in the resources node with a description "New droplet deployed. New environment variables deployed.":

That's the way it keep creating new revisions for each event mentioned above in the article.
Step 7 - If we observe that the description node of the revision is predefined text configured by PCF. We don't have control to change it.  Let's suppose we do multiple "New Droplet deployed" type of events. How would we be able to recognize the particular revision? To overcome this, we need to tag each revision with a proper Release version and description. We can use the "metadata" object for this purpose. PCF v3 CAPI provides access to revisions to update the metadata object using the CAPI PATCH request. See the example below:

You will see output like below:

It has been observed that the key_value under metadata does not accept space. It should be Alphanumeric characters only. Though, Empty values are allowed.

Step 8 -  Let's set up one more env variable to create one more revision so that we can try to rollback.

Step 9 - Let's now see how to rollback the changes and go back to a previous revision. Going back to the previous version is very easy and can be done with one command:

We should now be able to see the new revision with the application pointing to the revision configured in the command. You can run below command to check the current revision:

Findings -

  1. If an environment variable is added in the manifest file and pushed to PCF using "cf push", it will create 2 revisions. One with "New environment variables deployed" and second with "New Droplet Deployed" descriptionThis may create confusion while doing rollback.
    To overcome this issue, we can set the environment variable using the "cf set-env" command and then restage the application. This will create only one revision with "New droplet deployed. New environment variables deployed." description.

  2. If you try to roll back to the current version itself, it will not throw any error and create a new revision with a new GUID. This should be raised as a bug to PCF.

  3. When you do roll back to the previous version, it ensures that it has a currently configured number of instances running. For example, if your application has 2 instances. it will first spin up 1 new instance with revised revision and then destroy the existing instance containers. So, It is ensuring that there is no downtime caused while doing the rollback. However, we need to ensure the capacity for 1 new instance is considered before doing this activity. Also, Users may experience different behavior of the application momentarily as both an old and new version of the apps will be live for a very small period.

  4. While doing a rollback, keep in mind that by default, PCF retains only the 5 most recent staged droplets in its droplets bucket. So, it cannot go back until this value is increased. Please note every revision does not mean a change in the droplet. PCF retains max 100 revisions by default.


           In this article, we have seen how the App Revision feature works and how we can use this feature to do a rollback. I believe, it will further smoothen the process of deployments in PCF. I would recommend to put all these above commands in your CI/CD pipeline and use this feature to tag your deployments in PCF.

Sunday, July 14, 2019

Saturday, July 13, 2019


Many articles have been written about AWS Step Functions since it was first introduced in 2016.
Most of them create the impression that the service is simply an extension of the Lambda function
that allows us to stitch together multiple Lambda functions to call each other.

But actually, it is much more than that. Step Functions allows us to design and build the flow of execution
of modules in our application in a simplistic manner. This enables a developer to focus solely on ensuring
that each module performs its intended task, without having to worry about connecting each module with others.

In this article, we will explore the what, the how and the why of Step Functions, before walking through
some use cases, limitations and best practices around using the service.

What is AWS Step Functions?

AWS Step Functions is an orchestrator that helps to design and implement complex workflows. When we
need to build a workflow or have multiple tasks that need orchestration, Step Functions coordinates between
those tasks. This makes it simple to build multi-step systems. 

Step Functions is built on two main concepts Tasks and State Machine

All work in the state machine is done by tasks. A task performs work by using an activity or an AWS Lambda function or passing parameters to the API actions of other services.

A state machine is defined using the JSON-based Amazon States Language. When an AWS Step Functions
state machine is created, it stitches the components together and shows the developers their system and
how it is being configured. Have a look at a simple example:

Can you imagine if you had to do it yourself using a Messaging Queue, Istio or App Mesh? It would be a big task,
and that’s without considering the overhead of actually maintaining that component.

It's really great to see what features it provides out of the box. However, it would have been even better if AWS had
added the ability to design it visually rather than through JSON.

How Step Functions works

As discussed earlier, the state machine is a core component of the AWS Step Functions service. It defines communication
between states and how data is passed from one state to another.


A state is referred to by its name, which can be any string but must be unique within the scope of the entire state machine.
It does the following functions:
  • Performs some work in the state machine (a Task state).
  • Makes a choice between branches of execution (a Choice state).
  • Stops execution with failure or success (a Fail or Succeed state).
  • Simply passes its input to its output or injects some fixed data (a Pass state).
  • Provides a delay for a certain amount of time or until a specified time/date (a Wait state).
  • Begins parallel branches of execution (a Parallel state).

 Here is an example of a state definition for Task type:

"States": {
    "FirstState": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:ap-southeast-2:710187714096:function:DivideNumbers",
      "Next": "ChoiceState"

Input and Output Processing

For Step Functions, the input is always passed as a JSON file to the first state. However, it has
to pass through InputPath, ResultPath, and OutputPath before the final output is generated.
The JSON output is then passed to the next state. 

source: AWS

InputPath - selects which parts of the JSON input to pass to the task of the Task state (for example, an AWS Lambda function).

ResultPath then selects what combination of the state input and the task result to pass to the output.

OutputPath can filter the JSON output to further limit the information that's passed to the output.

Let's take a look at an example to better understand this in detail:

For Lambda execution, Input is described as JSON like above. That input is bound to the symbol $ and passed
on as the input to the first state in the state machine.

By default, the output of each state would be bound to $ and becomes the input of the next state. Each state,
we have InputPath, ResultPath and OutputPath attributes that filter the input and provide the final output.
In the above scenario, “ExamResults” state is filtering “lambda” node, appending the result of a state execution to
results” node and the final output is just “result” node rather than the whole JSON object: 

Hence, the final output will be:
  "math": 80,
  "eng": 93,
  "total": 173

Step Functions can be triggered in four ways :

As I mentioned earlier, Step Functions is not only about Lambda Functions. It has support for several other
Integration Patterns like SQS, DynamoDB, SNS, ECS and many others.

Step Functions use cases

There are many use cases that can be resolved using Step Functions. However, we’ll restrict ourselves to a
few major ones here:

1. Sequence Batch Processing Job

If you have many Batch Jobs to be processed sequentially and need to coordinate the data between those,
this is the best solution. For example, an e-commerce website can first read the product data and the next job
will find out which products are running out of stock soon and then, the third job can send a notification to all the
vendors to expedite the supply process. 

2. Easy Integration with Manual Tasks

If a workflow needs manual approval/intervention, Step Function would be the best solution to coordinate it.
For example, the Employee promotion process - It needs approval from the manager. So the Step function can
send the email using AWS SES service with Approve or Reject link and once receives it, can trigger the next action
using lambda or ECS jobs.

3. Coordinate Container Task in Serverless Workflow

A step function can help to make a decision about how best to process data. Based on the file size, you can decide
to use either lambda, ECS or on-premise activity to optimize the cost and runtime both

Step Functions benefits

  • Retry: Before Step Functions, there was no easy way to retry in the event of a timeout error or runtime error or

any other type of failure. It also provides an exponential backoff feature.

"Retry": [ {
    "ErrorEquals": [ "States.Timeout" ],
    "IntervalSeconds": 3,
    "MaxAttempts": 2,
    "BackoffRate": 1.5
 } ]

  • Error Handling: It provides an easy way of error handling at each state. It can handle several types of errors
a state can throw out, like:

  • States.Timeout - When a Task state cannot finish the job within the TimeoutSeconds or does not send
heartbeat using SendTaskHeartbeat within HeartbeatSeconds value
  • States.TaskFailed - When a Task state fails for any reason.
  • States.Permissions - When a Task state does not have sufficient privileges to call the Lambda/Activity code.
  • States.All - It captures any known error name. 

It can also catch Lambda service exceptions(Lambda.ServiceException) and even the unhandled errors (Lambda.Unknown).
A typical example of error handling: 
"Catch": [ {
    "ErrorEquals": [ "States.TaskFailed", “States.Permission” ],
     "Next": “state x”
 } ]
You can bet that it was never this easy to implement error handling like this with any other workflow solution.

  • Parallelization: You can parallelize the work declaratively. A step machine can have a state calling multiple states in
parallel. This will make the workflow complete faster.
  • High Execution Time: Step Functions has one year as max execution time so if some of the tasks of the workflow are high (more than 15 minutes), they can be run on ECS or EC2 or as an Activity hosted outside of AWS.
  • Pricing: Step Functions counts a state transition each time a step of a workflow is executed. You are charged for the total number of state transitions across all your state machines. Charges are a little on the higher side, but it would almost certainly be costlier to come up with your own solution for orchestrating the different services, Error Handling, Retries, and the many other features Step Function provide.


Despite all the powerful features Step Functions offers, there are still a few things missing:
  • Shorter Execution History: The maximum limit for keeping execution history logs is 90 days. It cannot be extended and
that may preclude the use of Step Functions for businesses that have longer retention requirements.
  • Missing Triggers: Some Event Sources and Triggers are still missing, such as DynamoDB and Kinesis.
  • State machine Execution name: Each Execution name for a state machine has to be unique (not used in the last 90 days).
This can be very tricky.
  • AWS Step Functions does not horizontally scale to dynamically launch AWS Lambda functions in parallel.
For example, if my state 1 generates 10 messages, it cannot spin up 10 AWS Lambda invocations at state
2 to process those messages individually (This feature is available if you use Lambda function directly with concurrent
execution). State 2 is statically defined in the state machine and provisions the same number of task
executors each time. Therefore, it cannot scale dynamically. Hopefully, this feature may be added in the future.

Step Function Best Practices

  • In a workflow sometimes, we would like to resume the process from the fail state as opposed to re-running it from the
beginning. This isn’t provided as a built-in feature, but there is a workaround to achieve this.

  • Beware: State Machine can run infinitely. It has a max execution time of one year. That itself is too high. On top of that,
it provides a feature "Continue as new Execution". This allows you to start a new execution before terminating your current
running execution. This opens up the possibility of it running infinitely by mistake. Monitoring Execution metrics is a good way
to identify and fix those mistakes.

  • AWS Step Functions has a hard limit of 25,000 event entries in the execution history (with a retention of 90 days),
which is going to fill very quickly for a long-running execution. To overcome this limitation, we can implement
“Continue as new Execution” pattern, where we can spin up the new execution from an existing running execution.
For example, if a long running execution has 10 steps and you’re expecting to have 40,000 event entries in your execution
history, you may configure to start a new execution at step 5. And it will create a total of two executions of the state machine.
Therefore, it will distribute the entries between those with steps 1-5 and step 6-10.

  • By default, the Amazon State Language doesn't set timeouts in state machine definitions. In a scenario where a
Lambda Function or Activity has a problem and keeps running without responding back to Step Functions, it will keep waiting
for a year (max timeout) at least. To prevent this, set the timeout using TimeoutSeconds like this:
"ExamResults": {
      "Type": "Task",
       "Resource": "arn:aws:lambda:us-east-1:123456789012:function:HelloFunction",
       "TimeoutSeconds": 200,
       "HeartbeatSeconds": 30,
      "End": true

  • Using TimeoutSeconds & HeartbeatSeconds, we can design a long-running workflow alive.

HeartbeatSeconds value should be less than TimeoutSeconds. And, we need to use
SendTaskHeartbeat periodically within the time we set in HeartbeatSeconds in our state machine
task definition to keep the task from timing out.

Let’s take an example where HeartbeatSeconds value is 30 seconds and TimeoutSeconds is
400 seconds for a long activity worker process. When the state machine and activity worker
process starts, the execution pauses at the activity task state and waits for your activity worker to
poll for a task. Once a taskToken is provided to your activity worker, your workflow will wait for
SendTaskSuccess or SendTaskFailure to provide a status.  If the execution doesn't receive either
of these or a SendTaskHeartbeat call before the time configured in TimeoutSeconds, the execution
will fail and the execution history will contain an ExecutionTimedOut event. So, by configuring these,
we can design a long-running workflow effectively.

  • For longer workflows, keep in mind that API Gateway calls the state machine in an
asynchronous way and responds back only to the execution ARN. You can define another
method in API Gateway to call "DescribeExecution" API. This method will have to call this
API periodically (poll) to get the output until the returned status is no longer "RUNNING".
Some developers use Step Functions for their API calls and want to send the response of
the state machine output to a UI or other system. Firstly, developers should avoid using
Step Functions for Microservice API calls. Microservices are supposed to be small and
respond back in 2-3 seconds, if not milliseconds.

  • Some developers use Step Functions for their API calls and want to send the response of the
state machine output to a UI or other system. Firstly, developers should avoid using Step Functions
for Microservice API calls. Microservices are supposed to be small and respond back
in 2-3 seconds, if not milliseconds.
  • Like any other application code, handle the exception. Use the Retry feature to handle occasional transient service errors.

Logging and Monitoring Step Functions

Similar to Lambda functions, Step Functions also sends logs to CloudWatch and it generates several metrics around it.
For example, Execution metrics, Activity metrics, Lambda metrics, etc. Below is an example of Execution Metrics:


Visual Workflow panel shows the current status of the execution. Look at the right side of the panel (below picture).
We can see the details of any state by clicking on the state in the workflow diagram. It shows the input, output, and
an exception (if any) for the state.  


It also logs the execution history and status for each state. AWS Console does provide a nice
visual of the states from start to end. We can also click on CloudWatch Logs to go to LogGroups
and see detail logs.

One recommendation is to create a Unique Trace ID which should be passed to all the integration
services these states connect to. It will help to track the transactions easily.
It also has integration with CloudTrail to log the events.


In this article, we explored the basic concepts of Step Functions and how it works. We also talked
about how with the Visualization panel, Error Handling and Retry features, it makes the workflow
creation process much smoother. Step Functions should properly be described as state-as-a-service.
Without it, we would not be able to maintain the state of each execution having multiple lambda

Just keep in mind that you need to keep a watch on your bills as it can burn a hole in your pocket very fast.
And the best way to do that is to ensure that proper monitoring and metrics are in place.

Follow by Email


Popular Posts