Monday, September 6, 2021

No Downtime For Tranquility On September 9th

With the war in Delve over and each side on opposite ends of the New Eden cluster, the EVE Online developers are free to experiment with technical changes. One such experiment will occur on Thursday as for the second time CCP will not perform the daily server reboot. The first time CCP did so was on 4 December 2019 in order to see what broke.

CCP Explorer posted a dev blog outlining the objectives of this week's test.

  • Verify the fixes made for the issues discovered in the previous experiment in the live production environment
  • Verify that no other code/features have regressed since last time and in general look for further issues
  • Observe memory usage
  • Verify that our technology platform (which you will hear more about later) is not making any downtime assumptions

In 2019, CCP noted three categories of issues associated with skipping downtime. The first centered on those systems built around the existence of downtime. Items like structures not finishing 24 hour timers and corporations not joining Faction Warfare as well as asteroid belts not respawning.

The second issue was time desynchronization of the servers. I'll quote CCP Explorer's explanation.

The time desynchronization was a known issue, but last time we were observing whether players noticed at the end of day #2. The target for time desynchronization is a maximum of ±0.5 seconds. But with newer hardware, we had been observing an end-of-run desynchronization of 2.25 seconds and - predictably - 4.5 seconds at the end of day #2 in the first no-downtime experiment in 2019.

Players started to notice once the desynchronization was above 3 seconds, mostly by noting what felt like module lag or delay when their client and the node hosting their solar system disagreed significantly about when modules were cycling. Time desynchronization is now normally within ±1/100 of a second, well within the maximum of ±0.5 seconds.

The final issue involves memory issues on the server. Once again, CCP Explorer explained.

Tranquility has always been memory hungry. For better performance, then, we have always opted for pre-computing values & processing data and storing the results for later reference rather than re-computing those values again later. As an example, the Brain in a Box and Dogma Rewrite projects in 2015 were all about computing and storing skills and their effects (i.e., the characters' brains) and transferring the computed results between solar systems instead of re-computing the brains on each entry to a new solar system. We also never clean up any memory, as the cluster node memory is reset every day anyway, which is a reliance on a daily reboot (note: we of course don't clear our DB cache memory or our Redis cache memory, but the main simulation cache memory is cleared in the reboot in each downtime).

The most memory-hungry nodes in the Tranquility cluster, the Character Services nodes that store those brains I mentioned above (among other things), were at 75% memory pressure at the end of day #2 last time, which is just below our operating tolerance of 80%. We might be able to run Tranquility for 3 days (and perhaps 7 hours more) if we were to run the cluster to a "first-node-at 100% memory usage" state, given those 2019 numbers. In 2019, the day #1 memory pressure was at 55%, but these days it is around 35% and so we want to rebase our observations.

Now, we get to the exciting part. Not having downtime Thursday is a test of new technology, not just of fixing bugs.

No-downtime is a long-term goal and all our technological advances aim towards that. We have been working for a few years now on a micro-service and message bus technology platform for EVE, and started using that platform for a number of features. We now want to observe how that ecosystem holds up with no downtime of the primary game cluster, making sure no assumptions have been made about a daily downtime. [emphasis mine]

When I started playing EVE in 2009, the daily downtime ran for one hour. On 1 November 2010, the official daily downtime was reduced to 30 minutes. The official downtime period was halved again on 11 May 2016. And as CCP Explorer noted in today's dev blog, the current average daily downtime has been under 5 minutes since December 2019. Maybe the dream of no daily downtime will soon come true.

No comments:

Post a Comment