On 2020/07/12 17:52 UTC Hologram internal network monitors detected an unplanned outage experienced by one of our global packet gateway providers. As a result, cellular internet traffic was unroutable for affected Hologram customers and other providers worldwide, and those SIMs that were affected were unable to attach to towers and open new sessions at the time. The outage, subsequent congestion, and core router repairs were resolved by the upstream provider as of 2020/07/14 10:00 UTC.
At approximately 2020/07/12 17:45 UTC, one of Hologram’s global gateway providers experienced a system-wide network incident that impacted all cellular services through that provider. Due to a time synchronization failure in network vendor hardware (possibly due to physical hardware clock failure), faulty synchronization values propagated over links to multiple data centers. These faulty synchronization values caused redundant core IP routers and links in multiple data centers to become unusable simultaneously. As a result, all data (i.e. GRX/IPX) paths and all signaling traffic (i.e. SIM connection/authentication) paths were disrupted for Hologram’s affected customers and for other providers worldwide.
Additionally, after initial fixes and later a resolution of the root cause of failure were in place, full recovery for 4G/LTE and 2G/3G devices was not immediate. This was caused by signaling traffic congestion as SIMs from Hologram and other providers worldwide attempted to reconnect, which caused not all device connection/authentication requests to succeed on the network before timing out.
Automated monitoring systems at Hologram and at the upstream provider both reported failures occurring for network service health at and around 2020/07/12 17:52 UTC. These reports of service disruption were escalated to Hologram's on-call Engineering, CTO, and Connectivity Product Lead. Hologram escalated the issue to the affected global gateway provider for priority investigation. The hardware vendor’s 24/7 response team was engaged, and a physical visit to each data center location in each of four European cities in multiple countries was required for manual resolution due to the faulty synchronization values bricking the vendor’s hardware.
By 2020/07/13 04:57 UTC, 2G/3G services were restored and 4G/LTE services were partially restored (already restored in the U.S.). At 06:53 UTC, 2G/3G services began to stabilize but 4G/LTE services continued to experience congestion, and the global gateway provider submitted a request to increase signaling link capacity with their upstream GRX/IPX and signalling providers. At 08:20 UTC, this capacity increase was completed and the global gateway provider reported an improvement in signaling traffic. At 15:44 UTC, all redundant data centers and links were confirmed to be online with normal redundancy in place and devices were continuing in the process of reconnecting.
From approximately 2020/07/13 07:00 UTC onward, and into the morning of 2020/07/14, connection/authentication success rates on both 2G/3G and 4G/LTE continued to climb, with occasional spikes in signaling congestion when large numbers of devices attempted reconnection. As of 2020/07/14 10:00 UTC, the global gateway provider reported monitoring near-100% success rates for connection/authentication (within the normal baseline rate) as well as normal signaling and data load on the network throughout the night, marking the incident as fully resolved internally (while continuing to monitor). Hologram Engineering and Customer Success continued to monitor customer devices as they reconnected and then marked the incident as resolved after no further congestion or network issues were observed.
Although initial root cause analysis suggests the root cause is a one-off event not before seen by the hardware vendor and unlikely to be able to occur again, the global gateway provider alongside their hardware vendor are implementing a plan to mitigate the possibility for this type of a failure in the future and communicating this plan with Hologram. Although our existing escalation and communication/coordination processes did not have any negative impact on time to recovery in this case, Hologram and this provider are working together on process improvements as well that could be of benefit in a similar-scale incident.
Hologram will update postmortem as needed following completion of root cause analysis and mitigation planning.