Adopting Chaos Engineering In Your Organization In 4 Steps

Emrick Donadei
4 min readMay 21, 2021

--

You’re having a product used by a lot of customers, and reliability becomes critical, each minute lost for a customer means potential loss of revenue. You’ve got the best engineers, the infrastructure is strong and automated, what could happen really? In fact, everything will fail at one point. Hardware, software, and licenses are many things that can fail, but you probably don’t even expect it.

But can I really blame you? Our systems are getting more fragmented every day, we have so many dependencies on external services like SaaS, or even microservices architectures. No human can really think in advance about what could happen.

The “death stars” of microservices at Netflix and Amazon

Chaos Engineering is one of the ways of being proactive with your reliability. Doing controlled failure injections in your infrastructure could gather information on how an outage could happen and impact your business. Maybe you already tried to implement it on your own, but it’s not that easy to understand the value in the activity for your engineers. Let me help you explain a plan on how you could adapt gradually to chaos engineering.

Define a clear objective

Chaos engineering is sexy, we all heard about these Netflix experiments, and we all want the same for our organization. But we’re not Netflix, and we don’t have the same objectives they have. Before even starting to adopt Chaos engineering, understand what is your main objective.

Focus on one problem you already have. Do you often have outages on unusual components? Are customers complaining about some issues you never experienced during your development? Do not follow the hype and don’t discover your objectives while experimenting. Developing chaos engineering is a time-consuming process, and can become really expensive if you don’t know why you are doing it.

Having a clear objective will also help you convince other peoples to follow you on this path. If you share the same objective, it will be far easier to convince people to make it a part of the culture.

Trust your infrastructure

Before starting to do experiments on your infrastructure, are all your basics covered? You should have an idea of which services you have on your infrastructure and monitor them at all times. The best would be to already have a bit of auto-healing and alerting.

If you don’t even trust your infrastructure and in the possibility of recovering from simple experiments, you should avoid starting it, because you won’t bring any value out of it, and will just add more stress and work to your team.

And Chaos engineering is only really meaningful when you apply it in production. Because reproducing production with another placeholder environment will never replace the true chaos of reality. So to bring as much value as possible, you should thrive for experiments in the production and so you need to be sure of the minimum reliability of the environment before starting.

Increase gradually the sophistication of your experiments

There are many different really good tools on the market to get started with Chaos engineering. I really like Gremlin, Chaos toolkit, and Litmus as they all have really advanced features to making experiments with full confidence. But you should increase gradually the level of your experiments, you don’t really need a huge tool when you start. Chaos could start by just doing a docker kill dns.

When you’re developing a product, you always try to avoid useless features to customers, so do the same with yourself! Don’t overwhelm yourself with too many features, and start with really simple scenarios. See how the team reacts to it, are they interested? Do they understand the value of experimenting on the infrastructure? Sometimes, it won’t click with the team, and that’s ok! At least when it does not work, you did not spend too much time and effort on it.

Photo by the blowup on Unsplash

Advocate to grow adoption

Chaos engineering is not a set of tools, it’s also not set in stone. It’s more of a framework that will allow you to slowly but surely improve your reliability. If you’re the only one to promote it, you won’t see any value in the activity, because your findings made after experiments will rot in backlogs as nobody will really understand the criticality of fixing it.

That’s why you really need to make it appealing and always talk about it. Slowly but surely make it a part of your culture. Get help from management to help it gain visibility among your peers, but at some point, like all cultures, it can’t be forced. If at some point you failed to introduce it and interest peoples after a workshop. Experiment with other things, get out of the paved road to chaos engineering that you can find in the books.

I’m personally a fan of gamedays, with an emphasis on the word game. When you experiment on infrastructure, it should remain a fun activity. And Chaos engineering when done right is damn fun! Make a team of villains that experiment on infrastructure versus a team of good guys that need to detect the problems! This way, you will create early adopters that understand the value of chaos engineering and will advocate for you!

Adopting Chaos Engineering is a long process that will evolve with your own team over time. Don’t expect to follow a simple road to success, and be prepared for failures. And learn from these failures. But in the end, it will be worth it and improve massively the engineering culture of your enterprise. I wish you luck !

--

--

Emrick Donadei

Sometimes too much metric-driven. I love the new products made with boring techs. Currently working at SAP to drive Chaos engineering adoption.