How Netflix Became A Master of DevOps? An Exclusive Case Study

Even though Netflix is an entertainment company, it has left many top tech companies behind in terms of tech innovation. With its single video-streaming application, Netflix has significantly influenced the technology world with its world-class engineering efforts, culture, and product development over the years.

One such practice that Netflix is a fantastic example of is DevOps. Their DevOps culture has enabled them to innovate faster, leading to many business benefits. It also helped them achieve near-perfect uptime, push new features faster to the users, and increase their subscribers and streaming hours.

With nearly 214 million subscribers worldwide and streaming in over 190 countries, Netflix is globally the most used streaming service today. And much of this success is owed to its ability to adopt newer technologies and its DevOps culture that allows them to innovate quickly to meet consumer demands and enhance user experiences. But Netflix doesn’t think DevOps.

So how did they become the poster child of DevOps? In this case study, you’ll learn about how Netflix organically developed a DevOps culture with out-of-the-box ideas and how it benefited them.

Simform is a leading DevOps consulting and implementation company, helping businesses build innovative products that meet dynamic user demands efficiently. To grow your business with DevOps, contact us today!

Netflix’s move to the cloud

It all began with the worst outage in Netflix’s history when they faced a major database corruption in 2008 and couldn’t ship DVDs to their members for three days. At the time, Netflix had roughly 8.4 million customers and one-third of them were affected by the outage. It prompted Netflix to move to the cloud and give their infrastructure a complete makeover. Netflix chose AWS as its cloud partner and took nearly seven years to complete its cloud migration.

Netflix didn’t just forklift the systems and dump them into AWS. Instead, it chose to rewrite the entire application in the cloud to become truly cloud-native, which fundamentally changed the way the company operated. In the words of Yury Izrailevsky, Vice President, Cloud and Platform Engineering at Netflix:

“We realized that we had to move away from vertically scaled single points of failure, like relational databases in our datacenter, towards highly reliable, horizontally scalable, distributed systems in the cloud.”

As a significant part of their transformation, Netflix converted its monolithic, data center-based Java application into cloud-based Java microservices architecture. It brought about the following changes:

Denormalized data model using NoSQL databases
Enabled teams at Netflix to be loosely coupled
Allowed teams to build and push changes at the speed that they were comfortable with
Centralized release coordination
Multi-week hardware provisioning cycles led to continuous delivery
Engineering teams made independent decisions using self-service tools

As a result, it helped Netflix accelerate innovation and stumble upon the DevOps culture. Netflix also gained eight times as many subscribers as it had in 2008. And Netflix’s monthly streaming hours also grew a thousand times from Dec 2007 to Dec 2015.

netflix streaming hours graph

After completing their cloud migration to AWS by 2016, Netflix had:

netflix after cloud migration

And it handled all of the above with 0 Network Ops Centers and some 70 operations engineers, who were all software engineers focusing on writing tools that enabled other software developers to focus on things they were good at.

Netflix’s Chaos Monkey and the Simian Army

Migrating to the cloud made Netflix resilient to the kind of outages it faced in 2008. But they wanted to be prepared for any unseen errors that could cause them equivalent or worse damage in the future.

Engineers at Netflix perceived that the best way to avoid failure was to fail constantly. And so they set out to make their cloud infrastructure more safe, secure, and available the DevOps way – by automating failure and continuous testing.

Chaos Monkey

Netflix created Chaos Monkey, a tool to constantly test its ability to survive unexpected outages without impacting the consumers. Chaos Monkey is a script that runs continuously in all Netflix environments, randomly killing production instances and services in the architecture. It helped developers:

Identify weaknesses in the system
Build automatic recovery mechanisms to deal with the weaknesses
Test their code in unexpected failure conditions
Build fault-tolerant systems on day to day basis

The Simian Army

After their success with Chaos Monkey, Netflix engineers wanted to test their resilience to all sorts of inevitable failures, detect abnormal conditions. So, they built the Simian Army, a virtual army of tools discussed below.

Latency Monkey

It creates false delays in the RESTful client-server communication layers, simulating service degradation and checking if the upstream services respond correctly. Moreover, creating very large delays can simulate an entire service downtime without physically bringing it down and testing the ability to survive. The tool was particularly useful to test new services by simulating the failure of dependencies without affecting the rest of the system.

Conformity Monkey

It looks for instances that do not adhere to the best practices and shuts them down, giving the service owner a chance to re-launch them properly.

Doctor Monkey

It detects unhealthy instances by tapping into health checks running on each instance and also monitors other external health signs (such as CPU load). The unhealthy instances are removed from service and terminated after service owners identify the root cause of the problem.

Janitor Monkey

It ensures the cloud environment runs without clutter and waste. It also searches for unused resources and discards them.

Security Monkey

An extension of Conformity Monkey, it identifies security violations or vulnerabilities (e.g., improperly configured AWS security groups) and eliminates the offending instances. It also ensures the SSL (Secure Sockets Layer) and DRM (Digital Rights Management) certificates were valid and not due for renewal.

10-18 Monkey

Short for Localization-Internationalization, it identifies configuration and runtime issues in instances serving users in multiple geographic locations with different languages and character sets.

Chaos Gorilla

Like Chaos Monkey, the Gorilla simulates an outage of a whole Amazon availability zone to verify if the services automatically re-balance to the functional availability zones without manual intervention or any visible impact on users.

Today, Netflix still uses Chaos Engineering and has a dedicated team for chaos experiments called the Resilience Engineering team (earlier called the Chaos team).

In a way, Simian Army incorporated DevOps principles of automation, quality assurance, and business needs prioritization. As a result, it helped Netflix develop the ability to deal with unexpected failures and minimize their impact on users.

On 21st April 2011, AWS experienced a large outage in the US East region, but Netflix’s streaming ran without any interruption. And on 24th December 2012, AWS faced problems in Elastic Load Balancer(ELB) services, but Netflix didn’t experience an immediate blackout. Netflix’s website was up throughout the outage, supporting most of their services and streaming, although with higher latency on some devices.

Netflix’s container journey

Netflix had a cloud-native, microservices-driven VM architecture that was amazingly resilient, CI/CD enabled, and elastically scalable. It was more reliable, with no SPoFs (single points of failure) and small manageable software components. So why did they adopt container technology? The major factors that prompted Netflix’s investment in containers are:

Container images used in local development are very similar to those run in production. This end-to-end packaging allows developers to build and test applications easily in production-like environments, reducing development overhead.
Container images help build application-specific images easily.
Containers are lightweight, allowing building and deploying them faster than VM infrastructure.
Containers only have what a single application needs, are smaller and densely packed, which reduces overall infrastructure cost and footprint.
Containers improve developer productivity, allowing them to develop, deploy, and innovate faster.

Moreover, Netflix teams had already started using containers and seen tangible benefits. But they faced some challenges such as migrating to containers without refactoring, ensuring seamless connectivity between VMs and containers, and more. As a result, Netflix designed a container management platform called Titus to meet its unique requirements.

Titus provided a scalable and reliable container execution solution to Netflix and seamlessly integrated with AWS. In addition, it enabled easy deployment of containerized batches and service applications.

netflix titus

Titus served as a standard deployment unit and a generic batch job scheduling system. It helped Netflix expand support to growing batch use cases.

Batch users could also put together sophisticated infrastructure quickly and pack larger instances across many workloads efficiently. Batch users could immediately schedule locally developed code for scaled execution on Titus.
Beyond batch, service users benefited from Titus with simpler resource management and local test environments consistent with production deployment.
Developers could also push new versions of applications faster than before.

Overall, Titus deployments were done in one or two minutes which took tens of minutes earlier. As a result, both batch and service users could experiment locally, test quickly and deploy with greater confidence than before.

“The theme that underlies all these improvements is developer innovation velocity.”

-Netflix tech blog

This velocity enabled Netflix to deliver fast features to the customers, making containers extremely important for their business.

Netflix’s “Operate what you build” culture

Netflix invests and experiments significantly in improving development and operations for the engineering teams. But before Netflix adopted the “Operate what you build” model, it had siloed teams. The Ops teams focused on deploy, operate and support parts of the software life cycle. And Developers handed off the code to the ops team for deployment and operation. So each stage in the SDLC was owned by a different person and looked like this:

specialized roles at netflix

The specialized roles created efficiencies within each segment but created inefficiencies across the entire SDLC. The issues that they faced were:

Individual silos that slowed down end-to-end progress
Added communication overhead, bottlenecks and hampered effectiveness of feedback loops
Knowledge transfers between developers and ops/SREs were lossy
Higher time-to-detect and time-to-resolve for deployment problems
Longer gaps between code complete and deployment, with releases taking weeks

Operate what you build

To deal with the above challenges and drawing inspiration from DevOps principles, Netflix encouraged shared ownership of the full SDLC and broke down silos. The teams developing a system were responsible for operating and supporting it. Each team owned its own deployment issues, performance bugs, alerting gaps, capacity planning, partner support, and so on.

operate what you build at netflix

Moreover, they also introduced centralized tooling to simplify and automate dealing with common development problems of the teams. When additional tooling needs arise, the central team assesses if the needs are common across multiple development teams and built tools. In case of too team-specific problems, the development team decides if their need is important enough to solve on their own.

centralized tooling at netflix

Full Cycle Developers

Combining the above ideas, Netflix built an even better model where dev teams are equipped with amazing productivity tools and are responsible for the entire SDLC, as shown below.

full cycle developers at netflix

Netflix provided ongoing training and support in different forms (e.g., dev boot camps) to help new developers build up these skills. Easy-to-use tools for deployment pipelines also helped the developers, e.g., Spinnaker. It is a Continuous Delivery platform for releasing software changes with high velocity and confidence.

However, such models require a significant shift in the mindsets of teams/developers. To apply this model outside Netflix, you can start with evaluating what you need, count costs, and be mindful of bringing in the least amount of complexities necessary. And then attempt a mindset shift.

Lessons we can learn from Netflix’s DevOps strategy

Netflix practices are unique to their work environment and needs and might not suit all organizations. But here are a few lessons to learn from their DevOps strategy and apply:

Don’t build systems that say no to your developers

Netflix has no push schedules, push windows, or crucibles that developers must go through to push their code into production. Instead, every engineer at Netflix has full access to the production environment. And there are neither strict policies nor procedures that prevent them from accessing the production environment.

Focus on giving freedom and responsibility to the engineers

Netflix aims to hire intelligent people and provide them with the freedom to solve problems in their own way that they see as best. So it doesn’t have to create artificial constraints and guardrails to predict what their developers need to do. But instead, hire people who can develop a balance of freedom and responsibility.

Don’t think about uptime at all costs

Netflix servers their millions of users with a near-perfect uptime. But it didn’t think about uptime when they started chaos testing their environment to deal with unexpected failure.

Prize the velocity of innovation

Netflix wants its engineers to do fun, exciting things and develop new features to delight its customers with reduced time-to-market.

Eliminate a lot of processes and procedures

They limit an organization from moving fast. So instead, Netflix focuses on hiring people they can trust and have independent decision-making capabilities.

Practice context over control

Netflix doesn’t control and contain too much. What they do focus on is context. Managers at Netflix ensure that their teams have a quality and constant flow of context of the business, rather than controlling them.

Don’t do a lot of required standards, but focus on enablement

Teams at Netflix can work with their choice of programming languages, libraries, frameworks, or IDEs as they see best. In addition, they don’t have to go through any research or approval processes to rewrite a portion of the system.

Don’t do silos, walls, and fences

Netflix teams know where they fit in the ecosystem, their workings with other teams, dependents, and dependencies. There are no operational fences over which developers can throw the code for production.

Adopt “you build it, you run it” culture

Netflix focuses on making ownership easy. So it has the “operate what you build” culture but with the enablement idea that we learned about earlier.

Focus on data

Netflix is a data-driven, decision-driven company. It doesn’t do guesses or fall victim to gut instincts and traditional thinking. It invests in algorithms and systems that combs enormous amounts of data quickly and notify when there’s an issue.

Always put customer satisfaction first

The end goal of DevOps is to make customer-driven and focus on enhancing the user experience with every release.

Don’t do DevOps, but focus on the culture

At Netflix, DevOps emerged as the wonderful result of their healthy culture, thinking and practices.

How Simform can help

Netflix has been a gold standard in the DevOps world for years, but copy-pasting their culture might not work for every organization. DevOps is a mindset that requires molding your processes and organizational structure to continuously improve the software quality and increase your business value. DevOps can be approached through many practices such as automation, continuous integration, delivery, deployment, continuous testing, monitoring, and more.

At Simform, our engineering teams will help you streamline the delivery and deployment pipelines with the right DevOps toolchain and skills. Our DevOps managed services will help accelerate the product life cycle, innovate faster and achieve maximum business efficiency by delivering high-quality software with reduced time-to-market.

How Netflix Became A Master of DevOps? An Exclusive Case Study

Table of Contents

Netflix’s move to the cloud