6 minutes read

POSTED Sep, 2021 dot IN

Not on MY Servers

Jerrel Fielder

Written by Jerrel Fielder


Picture this: you run a development organization for a small startup.  Or a large corporation.  You’ve had to make some decisions that, while expedient, aren’t necessarily the most elegant.  Time is money and money is time.  You need to ship product fast.  You have solid architectural standards but the tyranny of the urgent has upended some of your good intentions.  Such is my life at Ponga (https://ponga.com), too…and as the person who runs the development organization and is also a co-founder, I find myself wearing the hat of the tyrant and the victim simultaneously at times.  This is our story.  In considering the telling of it, I wanted to write what I would have found helpful before we started…


First, let me set the scene:  80% of Ponga software and infrastructure is running as asynchronous serverless “microservices”.  The 20%, not so much with EC2 autoscaling server groups running monolithic Spring Boot java applications fronted by application load balancers, VPCs, EFS, EBS, and a few other three-letter acronyms that all cost money while waiting for a job to do.  Spring Boot was the fastest way to start but It felt like we could do better.

Having now made the decision to take a serious look at migrating from monolithic to “micro” and from expensive servers to serverless, I was fully prepared for the eventuality of recoding the logic behind each endpoint as a lambda and fronting it with an API Gateway.  Figuring at least two weeks per service and about 160 services, I realized I needed to hire more developers.  Or find a better way.  I had read about AWS Serverless Java Container (https://aws.amazon.com/blogs/opensource/java-apis-aws-lambda/) and gave it a look.  Around the same time, as we were also exploring serverless monitoring tools for our 80%, I found Thundra.io.  As I dug into their Thundra APM product, they appeared to have made the AWS Serverless Container much easier to deploy than the AWS Git project I found.  After a brief online chat with the Thundra CTO, I was convinced I was on the right path, but it was time to meet the Devil in the details.

Devil #1:  Will our fat jar fit in AWS Lambda?

Will our fat jar fit in AWS Lambda?

Photo by esports.com on esports.com        

AWS Lambda has a 50MB zipped limit and 250MB unzipped limit, including layers.  The jar we began with was just over 300MB…no one deploys Spring Boot, apparently, for their efficient jar sizes.  This means we had to review all our dependencies and remove those not necessary.  We also had to deal with Spring Boot reflection to ensure it wasn’t pulling in dependencies unexpectedly.  We also realized we could strip out the servlet container as that would now be handled by AWS Serverless Container.  235MB was the final resulting size, which was hardly small but small enough to be considered a win.  Having now cast that devil aside, we were promptly introduced to….

Devil #2:  Did we use any incompatible code patterns?

Did we use any incompatible code patterns?

Photo by © Riot Games on redbull.com/        

When the service is running continuously and possibly serving multiple clients at the same time, it becomes important to deploy some traffic management or locks to ensure operations are constrained to data that belongs together and the interleaving of other users’ data is well handled.  We had employed locking semaphores and polling in a few critical processes which is incompatible with lambda, as lambda only runs when invoked and only until the job is complete.  In our case, this was the only pattern we had to rearchitect but there are others you will want to watch out for such as cron jobs and file-based control transfer.  Services that cascade calls to another set of services also need to be evaluated carefully as each call will likely instantiate another lambda and that could rapidly change the economics at scale.

The services that employed the semaphore/polling pattern would take n records as input and then check for the semaphore to be “unlocked”.  If “unlocked”, the service would lock it and then submit each record, sequentially, to a series of external REST APIs and unlock when complete, guaranteeing no interleaving of data from other user sessions.  We decided to implement an SQS queue for the request payload which would invoke a Step Function Map Step that in turn invokes a lambda to make the external REST calls, allowing massively parallel execution of both records and users.  SQS provides all the mechanisms for failure handling and Step Function has exponential backoff built in for retries so the external service isn’t swamped.  Not only did this approach allow us to dance past this devil but we also realized a significant boost in performance in that part of the application.  Another win.  

Oh, hello…have we met?  You are….

Devil #3:  All those Environment variables

All those Environment variables

Photo by © Riot Games on besthqwallpapers.com        

We all learned a long time ago to put all your secrets and configuration parameters for the various environments – Dev, Test, Stage, and Prod – into environment variables where you can fish them out at a run time instead of having a configuration and, gasp, secrets in the code.  Well, we certainly make very liberal use of environment variables in Spring Boot; and we promptly hit another lambda limitation: lambda environment variables cannot be larger than 4k and that quota cannot be increased by AWS.  So, while I was researching what it would take to move them to AWS Parameter Store, one of our developers got the bright idea to keep the secrets in AWS lambda environment variables and to compile the environment file with the configuration variables into the jar and try it.  I’m very happy to report that technique worked just fine.  No need for a new AWS service and another successful juke around the devil.  Having now refactored our Spring Boot jar to remove all the stuff we didn’t need, squeezed it to fit into Lambda, replaced our incompatible code patterns with a better queue-based asynchronous pattern, and adjusting our environment variable management we were now ready to smoke test the whole thing with help from the Thundra.io team.  This part was very easy as we only had to add the Thundra Spring Boot Container layer to our lambda, praying it wouldn’t send us past the 250Mb limit (it didn’t), change the lambda handler to the Thundra Spring Boot handler and set a couple of environment variables to leverage Thundra’s Spring Boot fast start capability.  As many of the new lambda’s APIs serve user experience elements of the application, we also deployed Thundra’s “keep warm” lambda to ensure cold-starts were minimized.  This just left one more introduction to…you guessed it….

Devil #4:  How to call this giant lambda?

How to call this giant lambda?

Photo by © Riot Games on Cho'Gath        

We knew the theory was API Gateway would convert an HTTP request to an event JSON, invoke lambda where the serverless container would transform the event to an HTTPServletRequest, hit the correct entry point, catch the output stream and convert back to HTTP response for API Gateway to return to the client.  What we were not sure of was whether we would have to craft an API Gateway for each of the services encapsulated in the lambda or for each method (GET, POST, etc) or if we could just create one and let API Gateway handle the details.  After some experimentation, we discovered the API Gateway {proxy+} method worked amazingly well.  We had only to create one API Gateway endpoint that could forward any request to lambda to start the chain of events.  Fantastic!  A slam dunk on another devil.  We were ready to test!!

Devil #5:  How in the world do you test hundreds of services without curling yourself to death?

How in the world do you test hundreds of services without curling

Photo by © Riot Games on besthdwallpaper        

To fully test the new environment, we wanted to submit the same payload to both versions of the endpoint, Spring Boot on EC2 and Spring Boot as lambda, and compare the results.  We also wanted to submit the same variations of malformed requests to ensure we get the same error response.  This means approximately 500 calls to 160+ endpoints and the requirement to compare every response against its peer.  In order to test the new environment, we replicated the backing services and databases to a new environment that the new Spring Boot lambda would map to so that the data layer remained in synch across both versions as well.

Using curl, at this scale, would be completely mind-numbing and there would be no possibility of reuse of the effort later…so we rejected this method.  We looked at a few other API testing platforms and were equally underwhelmed by the functionality and the lack of reuse potential as well.  As it happens, well before this adventure began, I was evaluating automated testing platforms and came across testRigor.com.  Their AI engine creates test cases automatically, using actual user session data to see the algorithm and then allowing the ML / AI to deviate off those paths to increase coverage.  We were a bit early for them but I left convinced that would be our testing platform going forward…and then it hit me:  perhaps their AI engine could be helpful here.  I reached out to Artem, their CEO, and was informed the testRigor platform CAN make the API calls and automatically compare the results and flag any discrepancies….and since we would be creating re-usable test cases, we can also use them as part of our CI/CD pipeline to ensure no regressions.  Win-Win!

We just completed our migration and I’m very happy to report that my list of “big worries” has been fully addressed to wit:

Will the giant Spring Boot jar take too long to initialize in a Lambda function?
No. Thundra.io’s asynchronous fast start handles this very well.

Will cold-starts have a negative impact on performance vs the Spring Boot service running on EC2?
No, not with the Thundra.io “keep warm” function enabled.

Will we get into an API Gateway nightmare trying to re-implement endpoints to invoke the Lambda?
No. One new API Gateway endpoint using the {proxy+} is all it took.

Will there be any unpredicted behavior when a Spring Boot service invokes another Spring Boot service also running in the Lambda?
Not in our case…so far. As mentioned previously, we avoid cascading services so your mileage may vary. I would be interested to hear from others on this topic.

Total time for this adventure, not including building the testRigor test cases, was 11 man-days and breaks down as:

  • Optimizing the Spring Boot build to get under the 250MB limit – 3 days
  • Implementing SQS-lambda pattern to replace our semaphore lock / poll – 6 days
  • Solving the ENV variable problem – 1 day
  • Building and testing the API Gateway {proxy+} method – 1 day

Easily half, or more, of that time was figuring out our approach at each “devil”.  I hope this document is helpful in your migration to serverless and serves to save you time and effort as well.