As you move your estate onto cloud inevitably the cost genie escapes the bottle as engineers and ops personnel of all flavours spin up test and Development environments and the general number of machines escalates beyond all your estimates and predictions.
1) Get your AWS tagging correct from day 1.
This is an esssential step to allow you to slice and dice your costs and see where the money is going
Tagging needs to baked into your Dev Ops process from the start so it cannot be circumvented and is 100% consistent across your estate
Tagging meta data on instances alllowing you to see what’s going on with billing : see http://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-alloc-tags.html
2) Make the Teams accountable
Centralised cost control is tricky as problems get seen too late in the day, and not by the DevOps folks in the teams who made those technically nuanced and expensive decisions.
Making the teams accountable for their spend is the best solution to this but necessitates
i) their engagement
ii) Visible reporting ( see ‘Show me your numbers’ below)
Ideally this can be slightly gamified by making costs a metric for teams – but fair comparison is tricky so aim for a ratio based metric – perhaps something like
AWS Cost in Cents / Number of Customer Sign ups
with a target of 50cents per customer a target for all teams
3) Show me your numbers
To allow your team to be accountable, and also for an overall view of costs you need to make those costs visible.
Your AWS console has good visualisation tools that allow you to slice and dice based on tags : but making that global costs data accessible to everyone may not be politic – so other Tools like
Netflix Ice (https://github.com/Netflix/ice )
Splunk AWS costs plugins ( https://splunkbase.splunk.com/app/1577/ )
allow a degree more reporting flexibility .
4) Get the basics right
Use the AWS billing Alarms
They are not very fine grained, and need frequent maintenance as your estate grows so as not to mis-alert, but an email warning that you’ve spent 75% of you budget on the 3rd of the month is a very worthwhile exercise, especially in the early days of your use.
Budgets for teams are also supported
5 ) What are the main causes of waste and possible solutions?
Main culprits for where money gets wasted are
i) Dev test environments on when unused
— Solution: See Step 9) below on a Script to disable Dev environments.
ii) Defunct resources not being terminated on production, load test , staging etc
— Solution See Step 8) Below for Occasional manual sweeps
iii) You’re not using reserved instances
— Solution: Use them – but be careful – See step 7) below
iv) Low utilisation (tends to be the hardest to Solve – see step 6) below)
.a) Machines the wrong size for the work they do
— Solution: Devs need to re-evaluate
– Instance just runs crons twice a week
+ Solution: Perhaps use AWS Lambda based utillty which will spin up resources only as it runs rather than lying idle waiting to work.
– Only one Application per instance
+ Solution: perhaps you need better scheduling on your PaaS to stack multiple Apps on bigger instances?
But bear in mind an average utilisation above 5% is pretty good across a data-centre
6) Tips on Low Utilisation
Very often the biggest culprit for wasted cash looks like ‘low utilzation’, meaning the machines aren’t doing much – perhaps just running intermittent batch jobs, or the machine chosen is over-specced .
Unfortunately getting to the bottom of each of the under-utilizations is hard because
i) Each case needs individual investigation (and reasons why things are as they are may be hard to find or long forgotten)
ii) Fixing it may after the event often seems more expensive than what it’s costing
Tooling can help
The AWS ‘Trusted Advisor’ tool on your AWS console is a great and free way to get clues as to where the money is going.
Third Party Services like Cloudyn, Cloudability etc. can also help here as they automatically trawl the AWS cloudwatch logs with some intelligence to recognise common anti-patterns and then make recommendations. This intelligence is something you need to apply yourself if you work on the raw Trusted Advisor data, and there’s aquite a bit of judgement involved
These services typically work on a %ge of your spend though so get your costs house in reasonable order first before giving them 2% of the total.
7) Tips on Reserved instances
Reserving instances means committing to them for longer periods of time, but savings can be significant : 30-50%.
It used to be the case that AWS reservations had to be paid for up-front which was a huge outlay, but that’s been addressed now with ‘no upfront cost reservations’ so, if you’re sure an app will stay on a specific machine class for a year’ reservations are the answer…
i) Rightsize, then Reserve. Check your utilisation is good for the instances / applications you’re going to reserve – otherwise you’ll lock in over-specced instances
ii) Start Slow – reserve a few instances and check over the next month that utilisation of those reservations is good and that your process of identifying good reservation candidates is working.
Identifying candidates needs expertise on the estate and the applications so might best sit or at least get reviewed with some centralised dev-ops or architecture function
iii) The reserving interface is a nightmare and Amazon are wary to take-backs on reservations even those done in error so to avoid incredibly expensive mistakes:
.a) 2 heads are better than one – so pair when using it to avoid expensive mistakes !
.b) Submit the reservations form in small chunks – e.g. only reserve 2 line items and submit not 20 and submit – this makes it easier to double check it.
8) Make a destroy script to tear down all test / dev environments overnight and weekends
AWS Lambda is a good fit for this.
If your tagging is correct and un-circumventable then you should have the confidence to run such a script, which basically uses the AWS API’s to list resources then turns off everything tagged with e.g. ‘env’ tag set to ‘dev’ or ‘test’.
Make sure as you design such a script you include a ‘white list’ feature where exceptional kit can be configured not be turned off. And also make sure you honour any ‘Termination Prevention’ tags.
The harder bits can restarting stuff in the morning, which probably also needs automation, especially since getting permissions right to make this easy for teams to do this themselves is tricky.
9) Occasional Manual Cleanup Sweeps
Clear up any orphaned AutoSclaing Groups, Load balancers etc. Sometimes these don’t clear up fully when the pipeline ‘destroy’ step fails for whatever reason.
Remember to tear down load testing environments promptly.
10) Be careful with health-check rules on Auto Scaling Groups (ASGs).
Prefer the EC2 status checks to the ELB based ones . Any flaws in ELB rules can lead to machines spinning up in the ASG, immediately being flagged unhealthy and then torn down – over and over. In the few seconds they are up – you get billed for an hour so this fluttering can get expensive.
Part of your reporting in step3) above could be the number of instances used in a day – that will show fluttery instance starts like this
11) Prefer AWS Lambda for cron type tasks rather than a dedicated machine.
it’s much cheaper (and cooler), and a dedicated machine for cron often has very low utilisation.
12) AWS can bill in numerous currencies….
so if you’re outside the US but paying in Dollars there might be some small savings to be had in paying in local currency and tax treatments. See