Cellular Connection Issues
Incident Report for Hologram
Postmortem

Summary

On 2020/07/12 17:52 UTC Hologram internal network monitors detected an unplanned outage experienced by one of our global packet gateway providers. As a result, cellular internet traffic was unroutable for affected Hologram customers and other providers worldwide, and those SIMs that were affected were unable to attach to towers and open new sessions at the time. The outage, subsequent congestion, and core router repairs were resolved by the upstream provider as of 2020/07/14 10:00 UTC.

Cause of Failure

At approximately 2020/07/12 17:45 UTC, one of Hologram’s global gateway providers experienced a system-wide network incident that impacted all cellular services through that provider. Due to a time synchronization failure in network vendor hardware (possibly due to physical hardware clock failure), faulty synchronization values propagated over links to multiple data centers. These faulty synchronization values caused redundant core IP routers and links in multiple data centers to become unusable simultaneously. As a result, all data (i.e. GRX/IPX) paths and all signaling traffic (i.e. SIM connection/authentication) paths were disrupted for Hologram’s affected customers and for other providers worldwide.

Additionally, after initial fixes and later a resolution of the root cause of failure were in place, full recovery for 4G/LTE and 2G/3G devices was not immediate. This was caused by signaling traffic congestion as SIMs from Hologram and other providers worldwide attempted to reconnect, which caused not all device connection/authentication requests to succeed on the network before timing out.

Resolution and Recovery

Automated monitoring systems at Hologram and at the upstream provider both reported failures occurring for network service health at and around 2020/07/12 17:52 UTC. These reports of service disruption were escalated to Hologram's on-call Engineering, CTO, and Connectivity Product Lead. Hologram escalated the issue to the affected global gateway provider for priority investigation. The hardware vendor’s 24/7 response team was engaged, and a physical visit to each data center location in each of four European cities in multiple countries was required for manual resolution due to the faulty synchronization values bricking the vendor’s hardware.

By 2020/07/13 04:57 UTC, 2G/3G services were restored and 4G/LTE services were partially restored (already restored in the U.S.). At 06:53 UTC, 2G/3G services began to stabilize but 4G/LTE services continued to experience congestion, and the global gateway provider submitted a request to increase signaling link capacity with their upstream GRX/IPX and signalling providers. At 08:20 UTC, this capacity increase was completed and the global gateway provider reported an improvement in signaling traffic. At 15:44 UTC, all redundant data centers and links were confirmed to be online with normal redundancy in place and devices were continuing in the process of reconnecting.

From approximately 2020/07/13 07:00 UTC onward, and into the morning of 2020/07/14, connection/authentication success rates on both 2G/3G and 4G/LTE continued to climb, with occasional spikes in signaling congestion when large numbers of devices attempted reconnection. As of 2020/07/14 10:00 UTC, the global gateway provider reported monitoring near-100% success rates for connection/authentication (within the normal baseline rate) as well as normal signaling and data load on the network throughout the night, marking the incident as fully resolved internally (while continuing to monitor). Hologram Engineering and Customer Success continued to monitor customer devices as they reconnected and then marked the incident as resolved after no further congestion or network issues were observed.

Conclusion

Although initial root cause analysis suggests the root cause is a one-off event not before seen by the hardware vendor and unlikely to be able to occur again, the global gateway provider alongside their hardware vendor are implementing a plan to mitigate the possibility for this type of a failure in the future and communicating this plan with Hologram. Although our existing escalation and communication/coordination processes did not have any negative impact on time to recovery in this case, Hologram and this provider are working together on process improvements as well that could be of benefit in a similar-scale incident.

Hologram will update postmortem as needed following completion of root cause analysis and mitigation planning.

Posted Jul 15, 2020 - 02:49 UTC

Resolved
Cellular network signaling load has remained at pre-incident levels since early this morning, with connection/authentication success rates also at near-100%. All links and systems for the global gateway provider are functioning at pre-incident levels, and the global gateway provider is continuing to monitor. Hologram Engineering and Customer Success are continuing to monitor.

Some customers can experience residual effects due to waiting for devices' module or firmware to reconnect. A majority of such devices have reconnected. A post-mortem will be posted by end of day (and will be updated as necessary once root cause analysis is complete).
Posted Jul 14, 2020 - 20:19 UTC
Update
Hologram Engineering and Customer Success are monitoring 2G/3G connection/authentication delays due to an increase in reconnection load on those radio access technologies (RATs). Our affected global gateway partner is continuing work to increase signaling capacity for 2G/3G connection/authentication.
Posted Jul 13, 2020 - 22:21 UTC
Update
All core nodes and redundant links have been restored to normal service levels, including both London and Paris data centers, and devices are continuing to reconnect (with higher than normal system load). A unique system failure in vendor hardware triggered a state issue that propagated across redundant data centers and affected redundant links. Root cause analysis is still under way and a post-mortem will follow with additional information.

Customer devices may still see intermittent cellular network connection/authentication issues as device modules attempt to reconnect following the initial outage, but this is now improving. Due to excessive signaling load during restoration efforts, which occurred as waiting devices attempted to reconnect simultaneously, increased GRX/IPX link capacity remains elevated to mitigate cellular network connection/authentication delays.

Users of Hologram's "Indigo" connectivity offering remain unaffected. Hologram Engineering and Customer Success teams are continuing to monitor as service returns to standard levels.
Posted Jul 13, 2020 - 17:34 UTC
Update
Hardware links have been successfully restored by an upstream cellular gateway provider. Customer devices may see intermittent cellular network connection/authentication issues as device modules attempt to reconnect following the initial outage. Due to excessive signaling load during restoration efforts, which occurred as waiting devices attempted to reconnect simultaneously, increased GRX/IPX link capacity remains elevated to mitigate cellular network connection/authentication delays.

Users of Hologram’s “Indigo” connectivity offering remain unaffected.
Posted Jul 13, 2020 - 16:29 UTC
Update
Affected devices have continued to reconnect following the restoration of network services at our global gateway partner. Due to excessive signaling load during restoration efforts, which occurred as waiting devices attempted to reconnect simultaneously, our global gateway partner also increased GRX/IPX link capacity to mitigate cellular network connection/authentication delays. Some devices reconnecting during the period of excessive signaling load timed out while attempting a connection/authentication to the cellular network, and some of those devices have reconnected.

Users of Hologram's "Indigo" connectivity offering remain unaffected, and all other services are functioning normally. Hologram Engineering and Customer Success teams are continuing to monitor as service returns to standard levels.
Posted Jul 13, 2020 - 14:29 UTC
Monitoring
Network services are beginning to come back online including Hologram devices. Hologram engineering and success teams continuing to monitor as service returns toward standard levels
Posted Jul 12, 2020 - 23:58 UTC
Update
Network partner engineers are beginning to reroute traffic following systemwide IP failures causing the disruption of service. We will continue to update as we receive more information on progress toward full service restoration.
Posted Jul 12, 2020 - 22:56 UTC
Update
Hologram is continuing to monitor as partner works to restore service. We are awaiting estimated time to resolution and will update when available.
Posted Jul 12, 2020 - 20:19 UTC
Update
Issue confirmed to be wider in scope affecting gateway's non-Hologram service subscribers.
Posted Jul 12, 2020 - 19:08 UTC
Identified
An issue has been identified with a cellular gateway provider affecting all network connectivity routed out of their global point of presence. Hologram has escalated the issue and is actively monitoring for service recovery. We will further update here. Hologram customers using Indigo SIM connectivity are unaffected.
Posted Jul 12, 2020 - 18:47 UTC
Investigating
We are investigating possible issues connecting to the cellular network for some customers.
Posted Jul 12, 2020 - 18:27 UTC
This incident affected: Cellular Networking (Global Cellular Data Network).