Meddling with time can have unintended consequences. In the data center world, it only takes a second to cause an outage or even corrupt data.

But since 1972 we have done just that 27 times, adding a leap second every few years in an effort to equalize two different ways of tracking what the time it is.

Now, many in the industry are calling for a rethink of how we approach time, and demand the end of leap seconds.

This feature appeared in the latest issue of the DCD Magazine. Read it for free today.

Issue 46 - The Last Data Center

Long-term data storage enters a new epoch

07 Nov 2022

Our species has always been in search of precision, with complex mechanical watches replacing centuries of hourglasses, astrolabes, water clocks, and sundials.

But it was in the last century that we made a breakthrough in precision: The atomic clock, which measures time by monitoring the resonant frequency of atoms.

Hundreds of these clocks around the world were used to create International Atomic Time (TAI), a weighted average of time as a constant duration defined by cesium radiation.

But scientists soon realized that something was amiss - the time did not match up to that of Universal Time, or UT1, which is based on observed solar time, where noon is when the Sun is at its apex.

The Earth is not a perfect sphere, nor is its orbit a perfect circular ellipse. Complicating matters further is the fact that the Earth's rotation has been slowing due to tidal deceleration and other factors, changing the length of a day.

In the latter half of the last century, this change meant that UT1 was 1.3ms slower than TAI every day, on average.

In 1972, the international reference time scale Coordinated Universal Time (UTC) was launched in an effort to combine these two competing visions of time.

It began based on TAI (with an initial difference of 10 seconds), but periodically has whole seconds added to bring it closer to the slower time tracked by UT1. Outside of the leap second adjustments, UTC is mapped to atomic time by a constant offset.

This is the time that the networked world relies on. Computers need to know the precise time to communicate with each other, with accurate time stamps required for billing systems, database sorting, network diagnostics, transactions, and more. If they get the time wrong, things can crash.

Computers come with their own clocks, of course. But quartz oscillators drift, slowly going out of sync with time, causing havoc when multiple systems have a different concept of ‘now.’

So enterprises turn to time servers to tell their systems the time. These use Network Time Protocol (NTP) - a networking protocol for clock synchronization between computer systems over packet-switched, variable-latency data networks - to get within a few milliseconds of UTC.

A big virtual cluster of timeservers is known as a pool, where a large number of computers volunteer to provide highly accurate time via NTP based on their own source of time from a DCF77 receiver, WWVB receiver, or a GPS receiver, among others.

Both Meta and Google offer their own NTP service, based on their own atomic clocks. "Every pool defines its own rules, engineers have strong opinions," Oleg Obleukhov, the creator of Facebook's Public NTP and cofounder of the company’s internal time card, said.

It’s all a careful balance, where if one mistake happens, everything can come crashing down.

That's why periodically adding a new second can be a significant threat to uptime.

When data centers receive the satellite signal announcing the leap second, they either show the impossible time of 23:59:60, or they miss a second.

That could cause "a negative number, which of course blows up everything in your code," Obleukhov explained.

"There are outages all across the industry all around the world when leap seconds hit, where CPUs spin at 100 percent because of such events, where the only remediation was to go and physically reboot devices. This has happened again and again every leap second."

Each one has caused problems, taking down platforms like Reddit, Cloudflare, Foursquare, LinkedIn, and Yelp, among others.

"Throughout my career, I went through multiple leap seconds, and everywhere it was a disaster, and everything was falling apart every time," Obleukhov said.

A report by the National Institute of Standards and Technology (NIST) and France's Bureau International des Poids et Mesures (BIPM) found that “contrary to our expectations, the number of problems reported has increased with time."

In an effort to mitigate such a risk of a sudden change in seconds, Meta has begun 'smearing,' a concept first proposed by Google in 2011, instead of ‘stepping’ a whole second in one go.

Smearing adds a couple of milliseconds every now and then over a longer period of time, reaching a full second just as the new leap second comes in.

There are numerous ways to smear, either by adding equal amounts of milliseconds, or doing so in different amounts at varying intervals.

Google does it over 24 hours, while Meta goes for 17. Alibaba is believed to smear for 12 hours on either side of the leap second.

"There are many different techniques for smearing - all of them are bad," Obleukhov said.

"You have to do it over many machines, and this introduces errors between machines,” he explained.

“Depending on how sensitive your systems are you might have a problem. And when you’re smearing, if appliances get rebooted or if something else goes wrong, then the chances of a fatal issue raises drastically.”

Smearing is the best option we have at the moment, "but you still may get negative time," Obleukhov said.

Meanwhile, public pools like NTP.org do not smear. “What you will end up doing if you join them is stepping, which is just dangerous,” he added.

This is not the only problem. After decades of leap seconds being added as the Earth's spin slowed, the planet began to accelerate in 2016, reaching its fastest spin since the change in August.

Why this is happening is not clear, but scientists have several theories.

Seismic activity such as the 2011 earthquake in Japan shifted the planet's axis by 6.7 inches, which sped up the rotation.

Another potential reason is known as the 'Chandler Wobble,' where the movement of the geographic north and south poles causes the planet to wobble, slowing it down - but in recent years it has wobbled less.

Finally is our own impact on the planet. Mountain ice caps have historically melted and refrozen, impacting our rotation like the arms of a spinning figure skater - when the arms are out, they move slower, when they are in, they move faster.

Now, due to anthropogenic climate change, that great mass of ice has melted and is not returning, instead staying at a lower altitude.

Whatever the cause, we now face the first time our rotation has sped up since UTC began, potentially leading to a completely new challenge: The negative leap second.

31 Aug 2017

A wrinkle in time

The Internet has ways to keep time, but they may not be good enough for a new breed of applications and regulations

Instead of adding an extra second, UTC could remove one.

This could also theoretically be smeared, but that introduces its own risks, most notably that the networked world has never tried this.

"These events have never happened, so that it is almost a certainty that there will be widespread errors in realizing the event, if it happens," the NIST and BIPM report states.

As it stands, if the current rate of change between UTC and UT1 continues, then a negative leap second is expected to be required by 2030.

Given all this, Meta - along with Google, Microsoft, and Amazon - suggest killing off the leap second entirely. They are joined in this recommendation by the NIST and BIPM, although the time-tracking bodies have a slightly different approach.

There are still those that wish to keep the status quo, arguing that scientists and astronomers observing celestial bodies rely on UTC. Were it to move out of sync with UT1, then legacy equipment would need to be adjusted, and there could be a period of inaccurate astronomical observations and celestial measurements as a UTC-based infrastructure has to be painstakingly shifted to UT1.

But Ahmad Byagowi, time appliance project lead at the Open Compute Project and research scientist at Meta, argues that ultimately they will benefit from such a move.

At the moment UTC and UT1 are already out of sync, he reasoned, as their times are only normalized when a whole second is added. Between those leap seconds "you have an error," he said.

"To those scientists that want to observe the sky, we're suggesting that they will always be able to go to a website that says 'this is the offset between UTC and UT1.' It's much more granular, you can go into milliseconds, and you can actually see things much much better. That's what we're proposing."

They want December 31, 2016, to mark the date of the last leap second, with computers no longer having to worry about interruptions to constant time.

NIST and BIPM aren't quite as aggressive, just yet. Researchers at the time institutes suggested that perhaps a temporary answer lies in increasing the maximum difference between UT1 and UTC. That could mean a leap minute, or even a leap hour.

The benefit would be that it would occur much less regularly, making risky events less common. But there's a danger to that, they admitted.

By making it a once-in-a-generation or more event, whole systems would be birthed and die between leap events. Knowledge and preparation could be lacking, making the time change all the more dangerous.

"Therefore, it will be necessary to place an increased emphasis on education and awareness ahead of such a step," they said.

Such a move would add a huge risk, but reduce the times we will face that gamble. "We do not consider that there is a 'perfect' solution to the problem," the national bodies said.

"Defining a time scale that satisfies the needs of time and frequency users and is also in agreement with astronomical phenomena is not straightforward and a series of trade-offs are necessary. We consider that enlarging the tolerance [between UTC and UT1] is a wise provisional solution, which should be re-considered when new discoveries and deeper understanding could result in a better solution."

It is not clear whether either of these calls will lead to immediate action. The quest to kill the leap second faces an obstacle almost as inevitable as the passage of time: Inertia.

Time has no single ruler. The decision over the leap second will have to be agreed on by multiple governmental, research, and non-governmental bodies, with its detractors having to navigate complex politics and a natural unwillingness to change.

Such an effort kicks off this month, with a vote on the future of the leap second. The decision at the Consultative Committee for Time and Frequency could help decide when the next major outage hits.

A second look

Issue 46 - The Last Data Center

A wrinkle in time

Tags

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies

Success story: Kao Data and Cadence