In late October 2021, Roblox’s global online gaming network went down, an outage that lasted three days. The site is used daily by 50 million players. Engineers at Roblox and its main technology provider, HashiCorp, had to go to great lengths to find and resolve the faults causing this outage. At the end of January, in a blog, Roblox finally provided an amazing analysis of what had happened. It shows that the network failure results from several concurrent events. For any business running infrastructure as code at scale or making heavy use of containers and microservices in their infrastructure, the processes that Roblox and HashiCorp follow to diagnose and ultimately fix issues are instructive. Here are some of the lessons to be learned from the Roblox outage.
Roblox bet it all on HashiCorp’s software stack
Roblox’s massively multiplayer online games are distributed around the world to provide the lowest possible network latency and to ensure fairness between players who may be connecting from widely separated locations. To manage a set of more than 18,000 servers and 170,000 containers spread around the world, Roblox therefore uses Consul, Nomad and Vault from HashiCorp. Hashi software is used to discover and schedule workloads and to store and rotate encryption keys. At the 2020 HashiCorp User Conference, Rob Cameron, CTO of Roblox Infrastructure, explained how the company uses these technologies and why they are critical to its business model (transcription and video recording of the conference are available here, in English). Mr. Cameron had declared that “if a player was in the United States and wanted to play with a player located in France”, it was entirely possible. “We will find a solution and we will give them the means to connect. We will make sure to give them the best gaming experience possible by placing the compute servers as close to the players as possible,” he said.
Roblox engineers followed false leads
While investigating the cause of the failure, engineers first noticed a performance issue. From the outset, they assumed that a hardware cluster was faulty, and they replaced it with new hardware. When performance continued to drop, they thought of another reason: heavy traffic. And the entire Consul cluster has been upgraded with twice the CPU cores (from 64 to 128 cores) and faster SSD storage. Other actions were taken including restoring from an earlier healthy snapshot, reverting to 64-core servers, and other configuration changes. But all of them failed.
Lesson 1: While hardware issues aren’t uncommon on the scale that Roblox operates, blaming a hardware problem from the start can sometimes turn out to be a red herring. As we will see later, the failure was caused by several software errors combined.
Roblox and HashiCorp identify two root causes
Roblox and HashiCorp engineers eventually identify two root causes. First, they found a bug in BoltDB, an open-source database used in Consul to store some log data, that wasn’t properly cleaning up its disk usage. The problem has been exacerbated by an unusually high load on a Consul streaming feature recently implemented by Roblox.
Lesson 2: Do not make new with old. What’s interesting about these causes is that they relate to the same kinds of low-level resource management issues that have haunted system designers since the earliest days of computing. BoltDB failed to free disk space as old log data was deleted. Consul’s data stream suffered from write contention under very high loads. Finding the root cause of these issues involved knowing how BoltDB handled free pages in its filesystem and how Consul’s streaming used Go’s concurrency. Scaling has a completely different meaning today. When running thousands of servers and containers, manual management and monitoring processes are not really possible. Monitoring the health of such a complex, large-scale network requires deciphering dashboards like this:
Monitoring Roblox resource activity was not enough to prevent the three-day outage. (Roblox)
Lesson 3: any large-scale service provider must develop automation and orchestration routines that can quickly spot failures or abnormal values, before they bring down the entire network. For Roblox, latency variations of a few milliseconds are important, which is why they use the HashiCorp software stack. But how services are segmented is also key. Roblox was running all of its back-end services on a single Consul cluster, which ended up being a single point of failure in its infrastructure. Roblox has since added a second site and started creating multiple Availability Zones to add redundancy to its Consul cluster.
Roblox uses the HashiStack stack to better control its costs
“We build and manage our own basic on-premises infrastructure because, at the scale we will operate as our platform grows, we can control costs significantly compared to using public cloud and manage our network latency,” Roblox wrote in its blog post. The “HashiStack” allows Roblox to efficiently manage a global network of services, and to act quickly, in particular by building multi-node sites in a few days. “With HashiStack, we have a repeatable design pattern to run our workloads wherever we go,” Cameron said in his 2020 presentation. However, too much depended on a single Consul cluster, not just the entire infrastructure of Roblox, but also the monitoring and telemetry needed to understand the state of that infrastructure.
Lesson 4: network debugging skills are paramount. If you don’t know what’s going on in your network infrastructure, you’re overwhelmed. But to debug thousands of microservices, it’s not enough to check router logs; you also need to know how the different elements fit together in depth. This task was made all the more difficult for Roblox as it built its entire infrastructure on its own custom server hardware. There was also a circular dependency between the monitoring systems of Roblox and Consul. Subsequently, Roblox removed this dependency and extended its telemetry to provide better visibility into Consul and BoltDB performance, as well as traffic patterns between Roblox and Consul services.
Transparency with customers in the event of breakdowns
It’s not enough to say “we were down, we’re back online”. It is important to explain precisely what happened. Yes, it took Roblox over two months to deliver this review. But, the document they produced, detailing the issues, showing the false starts, and describing how the engineering teams from Roblox and HashiCorp worked together to resolve the issues, is truly invaluable. It gives confidence in Roblox, in HashiCorp and in their engineering teams. When I emailed HashiCorp Public Relations, they replied, “Because of the critical role our software plays in our customers’ environments, we actively work with them to provide recommended best practices and proactive advice for the architecture of their environments”. Hopefully your critical infrastructure provider will be as well prepared when your next outage occurs.
Clearly, Roblox has overstepped the bounds of what HashiStack can offer, but the good news is that they understood the issues and eventually fixed them. Solving a breakdown in three days is not exceptional. However, given the size and complexity of Roblox’s infrastructure, this is quite a remarkable accomplishment. And there are lessons to be learned from this experience, even for less complex environments, where a software library can still hide a low-level bug that might suddenly come to light later.