This is a story how we cut down our AWS costs by 80% in just under 2 weeks.
Table of Contents
AWS is a candy shop for developers
I need to begin with some introduction. We use AWS since 2018 for all our projects, and it has worked miracles for us. We are a fully distributed team and having our own data center somewhere in the world would be problematic. It is much easier to rent resources from AWS and skip all the capital expenses.
The problem with AWS is that developers can basically create any resources without having to approve them with our financial department. With traditional data center this is not the case – buying an additional server would need getting an invoice from the store and asking the financial department to pay for it.
So, basically, the basis of the problem is that with AWS, the developers can just buy the resources in the amounts they want and when they want.
What did we do to cut AWS costs?
We are not a huge company and our AWS costs are just a little higher than $7k per month across all AWS accounts. Also it is worth mentioning that we host only DEV and QA stands, as PROD stands are paid by our customers. Our resources are mostly individual dev machines, test databases, and various custom resources for research projects such as Kinesis Firehose, Sage Maker, etc. So we have a lot of random resources that are hard to categorize, structure, predict and control.
So, how did we tackle lowering our AWS costs?
First, we started looking into the Cost Explorer and identified the most expensive items:
- We found a Bitcoin Node that was running for the last 4 months and costing us $600/month as it required a large SSD with additional provisioned speed. We had a small research into Bitcoin Ordinals and did not remove the machine.
Resolution: we archived the volume (costs $6/month) and terminated the VM.
Savings: $594/month - We found an Nvidia Tesla GPU machine that costs $531/month. We use it up to this day for generative AI experiments. We think of building our own app that generates text-to-video, so we need this machine.
Resolution: moved the volume to a spot instance.
Savings: $360/month - Not the most expensive, but the most amazing finding was that we forgot to remove a demo-PROD stand in one of the unused regions where we deployed our terraform scripts to test rollout of PROD “from scratch”.
Savings: $340/month. - Many more smaller items.
Resolutions: vary.
Savings: $1700/month
Second, we started moving everything possible to spot instances. This is a simple procedure. For an individual machine, you need to shut it down, detach the volume (remember to write down the mount path) and then terminate the machine. Then you create a new spot instance (no matter what AMI, just make sure that the CPU architecture is compatible with your previous volume). Once the spot instance is created, detach (and don’t forget to delete!) the new volume and attach the previous volume on the same mount path as it were on the original machine. For Beanstalk environments, it’s simpler – we just changed the capacity settings to utilize only spot instances.
Savings: $1000/month
Third, we cleared unused S3 buckets (we did some auto-trading bots that accumulated a lot of streaming data). And setup auto-removing of data in multiple S3 buckets, so that we don’t store trading data for more than a year as it becomes completely obsolete and unuseful.
Savings: $300/month
Fourth, we shrank some resources. It’s a matter of checking the consumed CPU and RAM, and if we see less than 50% constant use, we lower the tier.
Savings: $300/month (would be 3x more on on-demand instances)
Fifth, we set up auto-shutdown on individual machines. We created multiple lambda functions fo different types of tasks: shutdown a SageMaker Jupyter VM after 1 hour of inactivity, shutdown individual VMs, DEV and QA stands for the night period when nobody is working. These lambda functions are run on cloudwatch events daily. There are lambdas to enable DEV and QA stands as well to facilitate the process.
Savings: $500/month
Also, we implemented some smaller solutions for further savings, but they are not covered in this article.
So far, we have saved about $5500 of our $7000 monthly bill, which is around 80% of all costs! I knew that we were overspending on AWS, but never knew that it was THAT much. Over the course of the year, it means about $66,000 in savings.
How do organizations approach cloud costs optimization
After having our own experience of cloud cost optimization, I understood how important it is to carefully track cloud costs. Basically, cloud cost optimization can save enough to boost the business if you put the saved money into marketing. Or you could take it out as dividends and buy a new car. The sum is great and there many things that can be done with it.
Since it is out of question that cloud cost optimization is an absolutely needed endeavor, how do companies approach it? Let’s think about ways of implementing cloud waste management, from the simplest to the most advanced.
1. Buying just virtual machines
You could approach the problem in the most traditional way possible. Deny the countless possibilities provided by AWS and just restrict your developers to buying EC2 machines.
SQS? No. DynamoDB? No. Just use EC2 virtual machines and install everything on them.
Pros:
- You can predict the spending very well, as there is a flat rate for each type of EC2 VM
- The developers will stuff the available machines with the software they need. Just like in a traditional physical on-premise data center, thus increasing the effectiveness of money spending
Cons:
- You miss out the benefits of auto-scaling
- Your developers waste time on implementing things that are already there
- You miss auto-updates of software that would be applied automatically
All-in-all, it is not a good strategy to work with the cloud as if you just rent hosting on GoDaddy.
2. Review every request
What if you allow the developers to use and scale any resources, but they have to negotiate them with the special department that controls the costs? The developers do not have their own rights to buy/scale resources, but they can ask a special person to buy/scale a resource for them.
Let’s say, a developer needs a Kinesis Firehose endpoint (yes, I mention a service that you most probably have not even heard about). Would it be a simple task for the developer to explain what he/she wants to the controller? And then the developer should also explain the reasoning behind scaling, and probably even prove that the architecture choice is good and not wasteful in terms of cost management.
Upon providing a specific example, one could see that it just does not work this way. It could work only if the cost management team consists of experts.
And that’s just the tip of the iceberg. Now consider:
- A resource becoming unneeded due to the architecture change
- A developer leaving the job and not removing the resources they used for their individual development purposes
- An emergency when a resource needs to be scaled quickly to avoid business trouble
Pros:
- The developers are allowed to utilize the maximum benefits of AWS managed resources
- The spending is well-controlled
Cons:
- Cloud waste still can come from non-removed unneeded resources
- The cost management team needs high level of AWS knowledge
- The bureaucracy level can damage business
3. Hire a FinOps team
A more advanced way would be to actually find and hire experts in AWS that would control the spending. They can use the tools that AWS provides to control spending out-of-the-box. It has:
- a cost explorer
- a tagging subsystem
- reserved instances
- savings plans
- cost anomalies
- much more
These tools are not user-friendly and require a well educated personnel that knows what to do with it. However, you can actually start controlling your cloud costs. This approach requires not only tools and highly skilled workers, but also a framework in which the team works: periodic check-ups of underutilized resources, shrink&clean procedures and others.
A team that is basically DevOps with a financial conscious approach is called FinOps.
Pros:
- The developers have the full power of AWS
- Small bureaucracy overhead for the developers
- The financial team has full control over the spending in various aspects: per-project, per-team, etc.
- The developers consume resources in a conscious manner
Cons:
- Requires highly educated staff that mostly does not even exist yet, so you need to train one
- Vulnerable to human factor
- The reaction time is as fast as period between check-ups – an unused EC2 machine can stay on for 1-2 weeks or more
4. Use cloud waste management software
Once you think seriously about hiring (or growing your own) FinOps team, you should also consider a 3rd party cloud cost optimization software, such as infinops.com. It is your automatic FinOps team memeber that works 24/7 and is not susceptible to human error. Such software automatically controls your cloud for underused resources and other known ways of saving, such as:
- Using spot instances
- Using reserved instances
- Reducing number of OpenSearch clusters in QA environment
- Disabling personal VMs for the night
- Auto-shutting off expensive SageMaker VMs with Jupyter
- etc
All those tips come automatically as your system in constantly scanned for changes. And such advice can save you up to 80% of the monthly bill. This usually means saving at least tens of thousands of dollars over the course of year.
Pros:
- Great tool for the FinOps team
- Helps beginner FinOps with optimization techniques
- Reduces the human factor
- Enforces periodic reviews of resource consumption
- Enforces tags, lifecycle management, etc
- Allows tracking multiple AWS accounts at once
Cons:
- Has its own cost (usually much less than it saves)