Outage earlier this morning (1 Viewer)

Kevin

Owner
Staff member
Founder
Joined
Oct 4, 2012
Messages
5,565
Hi,

so as some of you may have noticed we had today at around 8:54 or so a global outage of our web-related services till around 10:58, since then everything has stabilised again and functions like it should.

there was simply a loss of network connectivity to the upstream POPs due to a bug in the software running on the fiber optic equipment.

Official statement:

Hello,
This morning, we had an incident on the optical network that interconnects our site Roubaix (RBX) with 6 of the 33 points of presence (POP) of our network: Paris (TH2 and GSW), Frankfurt (FRA), Amsterdam (AMS), London (LDN), Brussels (BRU).

The RBX site is connected through 6 optical fibers to these 6 POPs: 2x RBX <> BRU, 2x RBX <> LDN, 2x RBX <> Paris (1x RBX <> TH2 and 1x RBX <> GSW). These 6 optical fibers are connected to optical node systems that allow to have 80 wavelengths of 100Gbps on each optical fiber.

For each 100G connected to the routers, we use 2 optical paths that are geographically distinct. In case of optical fiber cut, the famous "kick back", the system is reconfigured in 50ms and all links remain UP. To connect RBX to POPs, we have 4.4Tbps capacity, 44x100G: 12x 100G to Paris, 8x100G to London, 2x100G to Brussels, 8x100G to Amsterdam, 10x100G to Frankfurt, 2x100G to DC GRA and 2x100G to DC SBG.

At 8:01, all 100G links, 44x 100G, were lost. Given the redundancy system we put in place, the root of the problem could not be the physical cutoff of 6 optical fibers simultaneously. We could not do the remote chassis diagnostics because the management interfaces were fixed. We had to intervene directly in the routing rooms, to manipulate the chassis: disconnect the cables between the chassis and then restart the system and finally only make the diagnostics with the equipment manufacturer. Attempts to reboot the system took a long time because each chassis needs 10 to 12 minutes to boot. This is the main reason for the duration of the incident.

Diagnosis: All the transponder cards we use, ncs2k-400g-lk9, ncs2k-200g-cklc, are in "standby" state. One of the possible origins of such a state is the loss of configuration. So we recovered the backup and put back the configuration, which allowed the system to reconfigure all the transponder cards. The 100Gs in the routers came back naturally and the connection of RBX to the 6 POPs was restored at 10:34.

This is clearly a software bug on optical equipment. The database with the configuration is saved 3 times and copied to 2 supervision cards. Despite all these security, the base has disappeared. We will work with the OEM to find the source of the problem and help fix the bug. We do not question the trust with the equipment manufacturer, even if this type of bug is particularly critical. The uptime is a matter of design that takes into account all the cases, including when nothing else works. The parano mode at Ovh has to be pushed even further in all of our designs.

The bugs can exist, the incidents that impact our customers no. There is necessarily a mistake at Ovh since despite all investments in the network, in the fibers, in the technologies, we just have 2 hours of downtime on all of our infrastructure in Roubaix.

One of the solutions is to create 2 optical node systems instead of one. 2 systems, that means 2 databases and so in case of loss of configuration, only one system is down. If 50% of the links go through one of the systems, today we would have lost 50% of the capacity but not 100% of links. This is one of the projects we started 1 month ago, the chassis have been ordered and we will receive them in the coming days. We can start the configuration and migration work in 2 weeks. Given today's incident, this project is becoming a priority for all of our infrastructures, all DCs, all POPs.

In the business of providing cloud infrastructures, only those that are paranoid last. The quality of service is a consequence of 2 elements. All anticipated incidents "by design". And the incidents where we learned from our mistakes. This incident leads us to raise the bar even higher to approach the zero risk.

We are sincerely sorry for the 2H33 minutes of downtime on the RBX site. In the coming days, impacted customers will receive an email to trigger SLA commitments.

Regards
Octave


some gameservers may still experience some issues due to a not re-established database connection, i'll perform a global restart later this evening as i'm playing in some global updates aswell.

there was no data loss just simply a cut off from the internet on our 4 dedicated hosts.

na5c.png


just wanted to share this here as a little information.
 

Users who are viewing this thread