AWS Step Functions Lesson Learned

Background

Many articles have been written about AWS Step Functions since it was first introduced in 2016.
Most of them create the impression that the service is simply an extension of the Lambda function
that allows us to stitch together multiple Lambda functions to call each other.


But actually, it is much more than that. Step Functions allows us to design and build the flow of execution
of modules in our application in a simplistic manner. This enables a developer to focus solely on ensuring
that each module performs its intended task, without having to worry about connecting each module with others.


In this article, we will explore the what, the how and the why of Step Functions, before walking through
some use cases, limitations and best practices around using the service.

What is AWS Step Functions?

AWS Step Functions is an orchestrator that helps to design and implement complex workflows. When we
need to build a workflow or have multiple tasks that need orchestration, Step Functions coordinates between
those tasks. This makes it simple to build multi-step systems. 


Step Functions is built on two main concepts Tasks and State Machine


All work in the state machine is done by tasks. A task performs work by using an activity or an AWS Lambda function or passing parameters to the API actions of other services.

A state machine is defined using the JSON-based Amazon States Language. When an AWS Step Functions
state machine is created, it stitches the components together and shows the developers their system and
how it is being configured. Have a look at a simple example:



Can you imagine if you had to do it yourself using a Messaging Queue, Istio or App Mesh? It would be a big task,
and that’s without considering the overhead of actually maintaining that component.


It's really great to see what features it provides out of the box. However, it would have been even better if AWS had
added the ability to design it visually rather than through JSON.


How Step Functions works

As discussed earlier, the state machine is a core component of the AWS Step Functions service. It defines communication
between states and how data is passed from one state to another.

State 

A state is referred to by its name, which can be any string but must be unique within the scope of the entire state machine.
It does the following functions:
  • Performs some work in the state machine (a Task state).
  • Makes a choice between branches of execution (a Choice state).
  • Stops execution with failure or success (a Fail or Succeed state).
  • Simply passes its input to its output or injects some fixed data (a Pass state).
  • Provides a delay for a certain amount of time or until a specified time/date (a Wait state).
  • Begins parallel branches of execution (a Parallel state).



 Here is an example of a state definition for Task type:

"States": {
    "FirstState": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:ap-southeast-2:710187714096:function:DivideNumbers",
      "Next": "ChoiceState"
    }


Input and Output Processing

For Step Functions, the input is always passed as a JSON file to the first state. However, it has
to pass through InputPath, ResultPath, and OutputPath before the final output is generated.
The JSON output is then passed to the next state. 


 
source: AWS


InputPath - selects which parts of the JSON input to pass to the task of the Task state (for example, an AWS Lambda function).

ResultPath then selects what combination of the state input and the task result to pass to the output.


OutputPath can filter the JSON output to further limit the information that's passed to the output.

Let's take a look at an example to better understand this in detail:





For Lambda execution, Input is described as JSON like above. That input is bound to the symbol $ and passed
on as the input to the first state in the state machine.


By default, the output of each state would be bound to $ and becomes the input of the next state. Each state,
we have InputPath, ResultPath and OutputPath attributes that filter the input and provide the final output.
In the above scenario, “ExamResults” state is filtering “lambda” node, appending the result of a state execution to
results” node and the final output is just “result” node rather than the whole JSON object: 




Hence, the final output will be:
{
  "math": 80,
  "eng": 93,
  "total": 173
  },


Step Functions can be triggered in four ways :



As I mentioned earlier, Step Functions is not only about Lambda Functions. It has support for several other
Integration Patterns like SQS, DynamoDB, SNS, ECS and many others.


Step Functions use cases

There are many use cases that can be resolved using Step Functions. However, we’ll restrict ourselves to a
few major ones here:

1. Sequence Batch Processing Job

If you have many Batch Jobs to be processed sequentially and need to coordinate the data between those,
this is the best solution. For example, an e-commerce website can first read the product data and the next job
will find out which products are running out of stock soon and then, the third job can send a notification to all the
vendors to expedite the supply process. 

2. Easy Integration with Manual Tasks

If a workflow needs manual approval/intervention, Step Function would be the best solution to coordinate it.
For example, the Employee promotion process - It needs approval from the manager. So the Step function can
send the email using AWS SES service with Approve or Reject link and once receives it, can trigger the next action
using lambda or ECS jobs.

3. Coordinate Container Task in Serverless Workflow

A step function can help to make a decision about how best to process data. Based on the file size, you can decide
to use either lambda, ECS or on-premise activity to optimize the cost and runtime both


Step Functions benefits

  • Retry: Before Step Functions, there was no easy way to retry in the event of a timeout error or runtime error or

any other type of failure. It also provides an exponential backoff feature.

"Retry": [ {
    "ErrorEquals": [ "States.Timeout" ],
    "IntervalSeconds": 3,
    "MaxAttempts": 2,
    "BackoffRate": 1.5
 } ]

  • Error Handling: It provides an easy way of error handling at each state. It can handle several types of errors
a state can throw out, like:

  • States.Timeout - When a Task state cannot finish the job within the TimeoutSeconds or does not send
heartbeat using SendTaskHeartbeat within HeartbeatSeconds value
  • States.TaskFailed - When a Task state fails for any reason.
  • States.Permissions - When a Task state does not have sufficient privileges to call the Lambda/Activity code.
  • States.All - It captures any known error name. 

It can also catch Lambda service exceptions(Lambda.ServiceException) and even the unhandled errors (Lambda.Unknown).
A typical example of error handling: 
"Catch": [ {
    "ErrorEquals": [ "States.TaskFailed", “States.Permission” ],
     "Next": “state x”
 } ]
You can bet that it was never this easy to implement error handling like this with any other workflow solution.


  • Parallelization: You can parallelize the work declaratively. A step machine can have a state calling multiple states in
parallel. This will make the workflow complete faster.
  • High Execution Time: Step Functions has one year as max execution time so if some of the tasks of the workflow are high (more than 15 minutes), they can be run on ECS or EC2 or as an Activity hosted outside of AWS.
  • Pricing: Step Functions counts a state transition each time a step of a workflow is executed. You are charged for the total number of state transitions across all your state machines. Charges are a little on the higher side, but it would almost certainly be costlier to come up with your own solution for orchestrating the different services, Error Handling, Retries, and the many other features Step Function provide.

Limitations

Despite all the powerful features Step Functions offers, there are still a few things missing:
  • Shorter Execution History: The maximum limit for keeping execution history logs is 90 days. It cannot be extended and
that may preclude the use of Step Functions for businesses that have longer retention requirements.
  • Missing Triggers: Some Event Sources and Triggers are still missing, such as DynamoDB and Kinesis.
  • State machine Execution name: Each Execution name for a state machine has to be unique (not used in the last 90 days).
This can be very tricky.
  • AWS Step Functions does not horizontally scale to dynamically launch AWS Lambda functions in parallel.
For example, if my state 1 generates 10 messages, it cannot spin up 10 AWS Lambda invocations at state
2 to process those messages individually (This feature is available if you use Lambda function directly with concurrent
execution). State 2 is statically defined in the state machine and provisions the same number of task
executors each time. Therefore, it cannot scale dynamically. Hopefully, this feature may be added in the future.

Step Function Best Practices

  • In a workflow sometimes, we would like to resume the process from the fail state as opposed to re-running it from the
beginning. This isn’t provided as a built-in feature, but there is a workaround to achieve this.

  • Beware: State Machine can run infinitely. It has a max execution time of one year. That itself is too high. On top of that,
it provides a feature "Continue as new Execution". This allows you to start a new execution before terminating your current
running execution. This opens up the possibility of it running infinitely by mistake. Monitoring Execution metrics is a good way
to identify and fix those mistakes.

  • AWS Step Functions has a hard limit of 25,000 event entries in the execution history (with a retention of 90 days),
which is going to fill very quickly for a long-running execution. To overcome this limitation, we can implement
“Continue as new Execution” pattern, where we can spin up the new execution from an existing running execution.
For example, if a long running execution has 10 steps and you’re expecting to have 40,000 event entries in your execution
history, you may configure to start a new execution at step 5. And it will create a total of two executions of the state machine.
Therefore, it will distribute the entries between those with steps 1-5 and step 6-10.

  • By default, the Amazon State Language doesn't set timeouts in state machine definitions. In a scenario where a
Lambda Function or Activity has a problem and keeps running without responding back to Step Functions, it will keep waiting
for a year (max timeout) at least. To prevent this, set the timeout using TimeoutSeconds like this:
"ExamResults": {
      "Type": "Task",
       "Resource": "arn:aws:lambda:us-east-1:123456789012:function:HelloFunction",
       "TimeoutSeconds": 200,
       "HeartbeatSeconds": 30,
      "End": true
     }

  • Using TimeoutSeconds & HeartbeatSeconds, we can design a long-running workflow alive.

HeartbeatSeconds value should be less than TimeoutSeconds. And, we need to use
SendTaskHeartbeat periodically within the time we set in HeartbeatSeconds in our state machine
task definition to keep the task from timing out.

Let’s take an example where HeartbeatSeconds value is 30 seconds and TimeoutSeconds is
400 seconds for a long activity worker process. When the state machine and activity worker
process starts, the execution pauses at the activity task state and waits for your activity worker to
poll for a task. Once a taskToken is provided to your activity worker, your workflow will wait for
SendTaskSuccess or SendTaskFailure to provide a status.  If the execution doesn't receive either
of these or a SendTaskHeartbeat call before the time configured in TimeoutSeconds, the execution
will fail and the execution history will contain an ExecutionTimedOut event. So, by configuring these,
we can design a long-running workflow effectively.

  • For longer workflows, keep in mind that API Gateway calls the state machine in an
asynchronous way and responds back only to the execution ARN. You can define another
method in API Gateway to call "DescribeExecution" API. This method will have to call this
API periodically (poll) to get the output until the returned status is no longer "RUNNING".
Some developers use Step Functions for their API calls and want to send the response of
the state machine output to a UI or other system. Firstly, developers should avoid using
Step Functions for Microservice API calls. Microservices are supposed to be small and
respond back in 2-3 seconds, if not milliseconds.


  • Some developers use Step Functions for their API calls and want to send the response of the
state machine output to a UI or other system. Firstly, developers should avoid using Step Functions
for Microservice API calls. Microservices are supposed to be small and respond back
in 2-3 seconds, if not milliseconds.
  • Like any other application code, handle the exception. Use the Retry feature to handle occasional transient service errors.

Logging and Monitoring Step Functions

Similar to Lambda functions, Step Functions also sends logs to CloudWatch and it generates several metrics around it.
For example, Execution metrics, Activity metrics, Lambda metrics, etc. Below is an example of Execution Metrics:


 


Visual Workflow panel shows the current status of the execution. Look at the right side of the panel (below picture).
We can see the details of any state by clicking on the state in the workflow diagram. It shows the input, output, and
an exception (if any) for the state.  


 


It also logs the execution history and status for each state. AWS Console does provide a nice
visual of the states from start to end. We can also click on CloudWatch Logs to go to LogGroups
and see detail logs.





One recommendation is to create a Unique Trace ID which should be passed to all the integration
services these states connect to. It will help to track the transactions easily.
It also has integration with CloudTrail to log the events.

Conclusion

In this article, we explored the basic concepts of Step Functions and how it works. We also talked
about how with the Visualization panel, Error Handling and Retry features, it makes the workflow
creation process much smoother. Step Functions should properly be described as state-as-a-service.
Without it, we would not be able to maintain the state of each execution having multiple lambda
functions/activities.


Just keep in mind that you need to keep a watch on your bills as it can burn a hole in your pocket very fast.
And the best way to do that is to ensure that proper monitoring and metrics are in place.

AWS Lambda Timeout Best Practices

1. Overview
More and more developers and architects are using AWS Lambda to deploy serverless functions to control costs and reduce the burden of server management. Due to the unpredictable nature of end-user demand in today's digital-first world, Lambda functions can help resolve unplanned scaling issues.
However, when I talk to these Lambda developers and architects, I hear that they are struggling with many different types of problems like throttling, timeout, and slow performances. That's the reason there is a lot of emphases these days to apply AWS Lambda Best Practices.

In this article we’re going to take a look at Lambda timeout errors, monitoring timeout errors, and how to apply best practices to handle them effectively.
A Lambda Serverless application is made up of three major components: Event Source, Lambda Function, and Services (DynamoDB, S3, and the third party).


Let’s first, understand what type of limits are defined by AWS around these components:

2. AWS Defined Limits
a. Event Source
 AWS API Gateway has a max timeout of 29 seconds for all integration types, which includes Lambda as well.


It means that any API call coming through API Gateway cannot exceed 29 seconds. It makes sense for most of the APIs except for a few high computational ones.

DynamoDB Streams - has 40,000 write capacity units per table (in most regions).


That means, if DynamoDB is used as a trigger for a Lambda Function, it cannot have more than 40,000 writes per second. This limit is very high, and if it is used with the max configuration, most of the downstream system will not be able to handle that load and could timeout easily.
b. Lambda Function
A Lambda Function Timeout Limit can be set to a maximum of 15 minutes (900 seconds).


It used to be set at 5 minutes, as Lambda was intended only for small, simple functions with event-driven executions. However, it proved to be a roadblock for applications like batch applications and high computation functions which needed more execution time. 

Thankfully AWS increased it, although some purists don't like it as it defeats the purpose of FaaS being run with small logic only. However, this is a very high value for regular HTTP APIs and, if not configured at the code level, may cause timeouts or high cost. We’ll dig deeper into some of those scenarios shortly.
Concurrent Execution - Lambda has a default limit of 1,000 maximum concurrent executions allowed at an account level. 

That means it can spin up 1,000 containers for all the functions in an account. If concurrency is not handled at the function level, it can lead to throttling issues for all other functions as well.

c. Services
DynamoDB - It has 40,000 Read/Write capacity units per table (in most regions).

S3 -  Unlimited objects.
      
Third Party Services - It depends on the downstream system.

These configurations also need to be handled for each function, otherwise, it can cause frequent timeouts.

Now, let's take a few scenarios and understand how these limits might cause the timeout errors in a serverless application.
Scenario 1
Problem: A REST API implemented through a Lambda Function is exposed through API Gateway.

This API is calling a third-party service to retrieve the data. But for some reason, this third-party service is not responding. The function has a timeout of 15 minutes, so the thread will be kept waiting for the response.

However, the threshold limit for API Gateway is 29 seconds, so the user will receive the timeout error after 29 seconds. Not only is this a poor experience for the user but it will also result in additional cost.
Solution: For APIs, it’s always better to define your own timeouts at the function level, which should be very short - around 3-6 seconds. Setting the short timeout will ensure that we don't wait for an unreasonable time for a downstream response and cause a timeout.
Scenario 2
Problem: REST API is calling multiple services. It's calling a DynamoDB table to retrieve data, calling an API, and then storing the data back in the DynamoDB table.

If the API is not responding, the function will wait for the response until it reaches the timeout set at the function level (let's assume 6s), and then timeout. Here one integration point is causing the whole function to timeout.
Solution: For each integration point, the timeout needs to be set so that the function can handle the timeout error and process the request with the available data and doesn't waste the execution time. So here, for all 3 integrations, the timeout limit has to be defined to handle the response in an effective way.
Scenario 3
Problem: In order to solve the above two problems, most developers use a fixed timeout limit at the function and integration level hardcoded in the code/config. However, it doesn't make full use of the execution time and can cause problems.
  • If it is too short, it doesn't give the request the opportunity to succeed. For example, there's 6s left in the invocation but we had set timeout to 3s at the integration level.
  • If it is too long, the request will timeout at calling the function. For example. there's the only 5s left in the invocation but we had set timeout to 6 seconds at the integration level.
Let's talk about the two general approaches to set timeout values.
In the first approach, the function timeout limit is set as 6s and for each integration call, it is set at 2s. Even though the whole function invocation (including all three calls) can be done within 6s, the API integration call will timeout as it is not able to perform within 2s. It has not been given the best chance to complete the request.


Similarly, in the second approach, if the timeout is set too high for each call, it will cause the function to timeout without giving a chance for recovery. The function has a 6s timeout and each integration call have a 5s timeout. So, the whole execution can take a maximum of 15s + 1s (1s for handling the response at the function level). In this case, requests are allowed too much time to execute and cause the function to timeout.

Solution: To utilize the invocation time better, set the timeout based on the amount of invocation time left. It must also account for the time required to perform recovery steps, like returning a meaningful error or returning a fallback result based on circuit breaker pattern.

Let’s take an example of one programming language to understand better how to do this:

If Nodejs is the programming language of your function, Lambda handler does provide context object as an input. This object has a method, context.getRemainingTimeInMills(), which returns the approximate remaining execution time of the Lambda function that is currently executing.

To set the timeout for the current running function, we can use this code:

var server = app.listen();
server.setTimeout(6000);

And to set the timeout for each API call, we can use this code:
app.post('/xxx', function (req, res) {
req.setTimeout( context.getRemainingTimeInMills() - 500 ); // 500ms to account recovery steps


3. Monitoring Timeout Errors
There are two AWS-native solutions to monitor logs for Lambda -  CloudWatch and X-Ray.
Lambda automatically monitors the logs of a function and provide metrics through CloudWatch. CloudWatch provides Duration metrics which tell us how much time a function is taking throughout a particular period. It also tells us the Average Duration which can be used to baseline the function timeout limit.

However, CloudWatch doesn't go deep enough to tell us how much time each downstream call. This information is required to set a timeout limit at the Integration level. That’s where X-Ray can be used. It shows the execution time taken by all downstream systems.

In the example below, it shows the execution time of S3 GET (171ms) and S3 PUT (178ms) requests.  

4. Best Practices To Handle Timeout Errors
  a. Always use short timeout limits. Set them at 3-6 seconds for API calls, while for Kinesis, DynamoDB Streams or SQS you should adjust the limits based on the batch size.
  b. Put monitoring in place using CloudWatch and X-Ray and fine tune the timeout values as applicable.
  c. If timeouts are unavoidable, either return the response with error code & error description or use fallback methods. Fallback methods can use cached data or get data from an alternative source. Spring Cloud provides Hystrix for fallback methods. It also has the Spring Retry library to apply the retry logic. Node.js also provides the oibackoff library for retry.
  d. DynamoDB does have a writing capacity defined. If Lambda function concurrent execution gets increased, it will start timing out if it crosses the writing capacity. The node-rate-limiter library is available in Node.js which can be used to control the number of invocations allowed for DynamoDB.
  e. Find the right balance between performance and cost. To increase performance, Lambda gives only a single option - increase Memory. More memory equals more CPU.
  • If a function logic is CPU intensive, apply more memory to reduce the execution time. It not only saves the cost but reduces timeout errors.
  • If a function spends most of its time on DB operation, there is no point in increasing memory. It doesn't help.
  • AWS does charge for lambda usage down to a fraction of 100ms. So, if average execution time for your lambda function is 110ms, increase the memory to bring it to below 100ms, otherwise, you’ll be charged for 200ms.
  • If your lambda function is taking more time than the timeout value you wish to have, pay attention to the steps that the function is performing. It might be doing too many things in one function, so you can consider Step Functions to break the function into smaller pieces.
5. Conclusion
In this article, we have discussed various scenarios in which timeouts can lead to bad user experience, not to mention adding cost to your account. So, apply common sense. If a function is taking more time than allotted, there could well be a problem that needs proper attention, rather than simply increasing the timeout limit. Monitoring is the best way to identify these gaps and finetune timeout configuration.


Most Viewed Posts