Gaming

The fall guys: why big multiplayer games almost always collapse at launch


On 4 August, British game studio Mediatonic launched a colourful and self-consciously silly game entitled Fall Guys: Ultimate Knockout, in which 60 players compete in zany challenges straight out of Takeshi’s Castle. After beta testing well, it attracted some online buzz, so the developer was prepared for a modestly successful opening day with maybe a couple of hundred thousand participants. Within 24 hours, more than 1.5m people attempted to play.

What happened next has become a familiar story in the world of online multiplayer games: the servers collapsed, the game stopped working and Mediatonic was inundated with furious complaints. Pretty soon, Fall Guys was being review-bombed on Metacritic by petulant players accusing the team of laziness and cynicism. Didn’t they prepare properly for launch? Why didn’t they see this coming?

As with most social media blow-ups, the answer is far too nuanced for Twitter to cope with, but it comes down to this: running a global large-scale multiplayer online game is an expensive, technologically complex endeavour, even in 2020, even after weeks of beta testing and data analysis. Jon Shiring, co-founder of new studio Gravity Well and previously a lead engineer on Apex Legends and Call of Duty: Modern Warfare 4, puts it very simply: “Each game relies on a lot of semi-independent services, and each one is its own scale problem. On top of that, sometimes they interact in complex ways.”

Many players experienced outages and crashes during the launch of sci-fi shooter Destiny 2, despite developer Bungie’s experience with online games



Many players experienced outages and crashes during the launch of sci-fi shooter Destiny 2, despite developer Bungie’s experience with online games. Photograph: Activision Blizzard

One key thing to understand is that game developers usually don’t own or operate the servers that online games run on. Instead, they are rented. A multiplayer game may rely on servers housed in dozens of data centres spread across the world, and there are hundreds of different companies running such centres. Alternatively, a developer may use a large cloud-based service such as AWS, Google Compute Engine, or Microsoft Azure, which run games on virtual machines that share server space among lots of different users. The former option, commonly using “bare metal” servers, can lead to better online performance but is complicated to manage; the latter is easier to manage, and to scale up and down depending on player demand, but can be much more expensive.

On top of this, developers sometimes employ a middleware service – such as Multiplay or Zeuz – which handles basic outages, monitors data centres and predicts demand. Studios may also be using external web development services to manage the game’s databases, and these may be owned by the publisher, the platform or the middleware provider – but the developer will also need to add custom components for their particular game, so there’s a mix of external and internal applications. “This is where a lot of problems lie,” says Shiring. “The sheer complexity of multiple services being called upon by millions of game clients all over the world.”

In fact, the problems of managing an online game begin way before launch, when operations managers require a lot of technical information that the game designers struggle to provide, because they’re still designing the game.

As Shiring explains:

“You’ll have a bunch of questions … How long is the average match? How long will most players play every day? What is your player population split between NA, EU, Asia, South America and Oceania? What percentage of players will use a mic? Session length is important for modelling how many total players will be online at once; the longer each person plays, the more total players will be online simultaneously across different time zones. And bandwidth costs more in certain regions, and each data centre can have its own independent outage. Voice bandwidth can be significant, and can trigger third-party services like speech-to-text.

A lot of times, your launch outage is a result of these guesses being wildly off.”

So what about beta tests? Most major online games tend to run a small closed beta test with a controlled number of players and then a larger open test that everyone can join in. Surely this provides a lot of the data the studio needs to estimate demand and iron out problems? The answer from all the tech leads I spoke to was “kind of”. One thing to note is that just because you may play a beta test a couple of weeks before a game launches, it does not mean you’re playing an almost finished build – it’ll be a stable build that might be months old. Any work done on the game code after the beta can add new bugs, and new bugs means new opportunities for unforeseen problems.

Beta tests also can’t account for the utter unpredictability of human behaviour. “Even lengthy playtesting with a large number of testers pales into almost insignificance when it comes to launching for real,” says Rocco Loscalzo, CTO at specialist studio, The Multiplayer Guys. “During Beta, a lot of people will have played ‘nice’. At launch, the gloves come off and you attract not only genuine players but also hackers, cheats, and trolls. The more successful your game becomes, the greater the exposure to a wider variety of people, behaviours, and problems.”

As more and more people attempt to access the game, the problems expand and travel up the delivery pipeline, triggering fresh issues along the way; which is perhaps what happened with Fall Guys. “Often a small outage turns into a giant outage,” says Shiring. “What if your game servers start getting an error, and they immediately drop players back to the main menu with an error message? Next you get players searching for matches frantically. Now your matchmaker gets flooded and you have two major issues to fix. Once you get the game servers fixed so they stop getting errors, your users still can’t play until you figure out how to drip-feed players back into the matchmaker – it may take another hour to slowly add players back into matches again.”

Then there are the outages that are completely beyond the control of the developer, such as a hardware failure in one of the data centres, or a disruption in the cloud server network or with an internet service provider – or hackers. “In our case, a massive DDoS-for-hire service once targeted an entire data centre, causing an outage for everything running inside of it,” recalls Shiring.

So the familiar player refrain of “just add more servers!” often isn’t the solution, because the problem might not be with the server networks at all. It might be with matchmaking or calculating player data (game progress, character set-up, etc) – functions that are centralised, running on just a few machines, and therefore heavily impacted by scale issues. The system may be able to deal with 100,000 simultaneous players having their stats updated many times a minute, but one million? Adding servers won’t help. It’ll just multiply the number of players hitting the choke point.

For the battle royal game Apex Legends, developer Respawn Entertainment took a different approach, bypassing beta tests and launching the game with no pre-publicity. It still attracted 2.5m players within 24 hours.



For the battle royal game Apex Legends, developer Respawn Entertainment took a different approach, bypassing beta tests and launching the game with no pre-publicity. It still attracted 2.5m players within 24 hours. Photograph: Electronic Arts

“Very, very few outages are caused by companies ‘running out of servers,’ as that’s just so easily fixed,” confirms Shiring. “You can just go on Amazon and spin up more servers if you run out – it can be an expensive solution, but nobody wants their game to have headlines about outages so it is probably worth it. But every service basically has a number of bottlenecks – CPUs, threads, RAM, network, partner services – and one of them will suddenly stop scaling. Everyone wants to feel like they are prepared for launches, but the truth is that you just aren’t because you have too many knowledge gaps.”

The final thing most players tend not to think about is cost. Running a server infrastructure for an online game can cost millions of dollars a month. “If a server is being under or over utilised this can often have a negative impact on stability or financials,” says Andrew Prime, lead programmer on Romans: Age of Caesar and Stronghold Kingdoms at Firefly. “It’s a regular balancing act making sure we’re using server architecture that can support the fluctuations we see in our player-base, but that also isn’t so insanely overpowered that it costs us the Earth.”

Launching an online game is fraught and complicated. Beta tests can only ever provide so much information, every fix can lead to multiple new problems, and every decision needs to be weighed against the resources available. This is why major studios such as Bungie and Rare have vast control rooms, staffed 24 hours a day, with the very best software engineers, netcode programmers and operations analysts. This is why things go wrong even with all these factors in place. There will be a lot of exhausted staff at Mediatonic who now know a hell of a lot more about launching an online multiplayer game than they ever did before. And maybe the most important lesson they’ll learn is that this will probably all happen again.

Mediatonic was approached for this article but declined to comment.



READ SOURCE

Leave a Reply

This website uses cookies. By continuing to use this site, you accept our use of cookies.