How I efficiently processed millions of records using AWS Services

We had a scneario where we were supposed to process Loans and converting them to Leads. We had around a million records in our Database associated with different Users.

We have a feature where we process each loan one by one, apply some magical numbers to them based on its values, check if it is eligible for lead and last send Email notification to Users.

This feature initially used to take around 3 hours to finish. It was not acceptable either to our Clients nor to our Manager. Clients used to complain about this saying its a very lengthy process and I have to wait for a eternal just to get the results. Our manager also complained about the same.

Who helped us with this ?

AWS Step Function and its power to orchestrate Lambda functions comes to our rescue.

What is AWS Step Function?

Its a simple tool to manage different lambdas and make them work as a single unit. It is a State management tool which performs tasks in a linear manner. It has support for adding conditions (If, else), managing loops with concurrency, updating Input or Output from the lambdas, reading from S3 and looping on the file json data, having placeholders as step to its flow and many more.

How it saved us ?

We considered all the features of Step Functions and decided to implement it for the above process.

The above image is not an actual image as I am not authorized to share the actual.

The Complete Process

Step 1: Get Banks Lambda

It is a lambda function that is supposed to query all the available banks. Once queried, we store the results in AWS S3 Bucket as a json file and return the key of S3 file as response from the Lambda. We implemented this approach as there can N number of banks going forward and loop data based on file approach has no Limit to it.

Step 2: Bank Loop

The Step Function will read the key from the response of Step 1 and use the key to read file data from S3 and loop on the data present in the file. In this case it will loop in Bank Detail List. We can have concurrency set here which will help us with Lambda Resource Management.

Step 3: Get Bank Users Lambda

It is a lambda that will receive a single Bank detail at a time. It is same as Step 1. Only difference here is we fetch Users instead of Bank. We fetch Users based on the Bank detail we received and store the User List in a AWS S3 bucket as a json file and return the key of S3 file as a response from the Lambda.

Step 4: Users Loop

It is same as Step 2, Step Function will read data from S3 based on the file key received from previous step. It will then loop based on the user list present in the file. If suppose the file has no data, then it handle the case automatically and wont loop any further for that particular Bank.

Step 5: Identify Data

It is a lambda function which is main step of the whole process. Input received here is the User detail. We then fetch all the loans of a User from DB, add some magical numbers to its loans one by one using an Algorithm and boom. The leads are now generated by the above Algorithm. The lead data is now stored in a AWS S3 file as the data can be huge so we don’t rely on Lambda output payload limit. This lambda again returns the S3 file key as a response for next step.

Step 6: Send Email Lambda

It is a lambda function which will receive S3 file key as a Input. Now we read data from the S3 file and check if the data is eligible for Email Notification to the client or not. If its eligible then we combine all the data as per the user and send collective emails to the users with the information required in a Email. That’s it, the process is now finished.

What was the result ?

Good News, the decision turned out to be correct and the implementation was success to us and to the Clients too. The time taken after the implementation was less then a hour compared to 3 hours before. It was a big relief for us. Everyone was happy and I got even a good appraisal speech from my manager for the same.

Note: We can increase the concurrency for the Step Function and reduce the time more but if we do that we would face Database open connection errors as the clients are also using the Database for querying at the same time and the process is also using the same Database.