Partial Outage - G-1 SIMs/profiles

Incident Report for Hologram

Postmortem

On July 29, 2024, Hologram experienced an outage with our G-1 connectivity partner, which prevented some devices using G-1 SIMs/profiles from being able to establish new cellular data sessions during the incident, while devices with sessions already open and which did not close during the incident remained unaffected. During recovery, a shorter second period of unavailability affected
new connection attempts, followed by degraded performance for establishing new connections.

Incident Summary and Impact

Shortly after 23:00 UTC on July 29, 2024, one of Hologram's connectivity partners began experiencing degraded performance, followed by an outage affecting new connections. This incident impacted some devices that make use of the G-1 SIM/profile, where attempts to create new data sessions failed at times, followed by degraded performance for later attempts to open data sessions. As
the outage was isolated to that single connectivity partner, customers using other global profiles (G-2, G-3) or non-global native profiles were unaffected. Unfortunately, many customers rely upon the affected partner and had devices that could not open new cellular data sessions during the incident. Customers with many devices attempting to connect during the incident were most affected. Devices with data sessions that were already open before the incident and remained open during the incident continued to function normally.

Multiple actions were attempted to restore the ability to make new connections, and to address subsequent network congestion. The two most impactful actions were: (1) the ability to establish new connections was initially restored by manually rerouting connection requests to a secondary (backup) network authorization resource at 00:30 UTC on July 30, and (2) the subsequent resource exhaustion on the secondary authorization resource was rectified at 04:45 UTC on July 30 when the primary resource was restored. By 04:45 UTC on July 30, the authorization service was thereafter able to process new connection requests successfully, and most of Hologram's affected customers saw a steady recovery. By 06:00 UTC, both primary and secondary network authorization resources were processing new connection requests normally.

Some devices reconnected right away during either the first or second stages of recovery, other devices reconnected more slowly due to congestion, and others waited to reconnect until device-specific retry timers automatically expired. The combined (complete) unavailability for new connections totaled approximately 115 minutes during this incident, comprised of each of an 85-minute and a 30-minute unavailability period, respectively. At the conclusion of the second unavailability period, congestion from a backlog of requests resulted in degraded performance for new connections, and this congestion was mostly resolved within 1 hour. With performance no longer degraded and device retry timers progressively expiring, numbers of new connection requests returned to
pre-incident levels within approximately 6.5 hours.

Root Cause and Technical Description

In order to implement network authorization and metering of usage, the G-1 connectivity partner operates redundant authorization services in different data centers. Redundancy for these authorization services is implemented in an active-standby configuration, in part to help reserve sufficient resources on the secondary systems in order to handle the primary systems' full load immediately following the failover period. In the event of a failure of the primary authorization resources, the secondary resources are intended to
process new connection attempts after only a minimal failover period.

During this incident, the primary service experienced exhaustion of some of its system resources, and alarms were triggered as resources approached exhaustion. While alarms were being investigated by the connectivity partner, further resource exhaustion resulted in failed new connection attempts and a drop in the success rate for processing new connection requests. Unfortunately, as
implemented, the authorization service's redundancy failed to properly handle this failure mode for two reasons: (1) the primary resources did not fail completely and were still responding to some requests, which resulted in a longer-than-typical failover period, and (2) even after manual failover, the secondary resources were insufficient and were exhausted shortly afterward as well, like with the primary resources.

Sharing load across the primary and secondary resources, once both were restored, increased the speed of recovery, but some devices experienced degraded performance while attempting to establish new data connections following service restoration, and other devices waited for some time before attempting to reconnect according to device-specific retry timers. Root cause
analysis identified the following root causes to this incident: (1) insufficient system resource management by the G-1 connectivity partner, and (2) insufficient automated failover criteria for the G-1 connectivity partner's network authorization resources.

Timeline

July 29, 2024

23:00 UTC: G-1 connectivity partner observed primary authorization resources nearing exhaustion (alarm)
23:05 UTC: G-1 connectivity partner observed primary authorization processing success rate of <100%
23:05 UTC: Hologram number of new sessions reported on G-1 SIMs/profiles began decreasing, while some new sessions succeeded after delays or multiple attempts, which indicated possible degraded performance in session reporting or possible degraded performance in new connection success rate

July 30, 2024

Around 00:00 UTC: Hologram observed further decrease in new sessions reported on G-1 SIMs/profiles, with successful connections reported at a significantly lower rate than typical
00:10 UTC: G-1 connectivity partner disabled lower-priority services to try to mitigate
Around 00:15 UTC: Hologram confirmed complete unavailability of new connections on G-1 SIMs/profiles
Around 00:30 UTC: G-1 connectivity partner confirmed complete unavailability of new connections
00:30 UTC: G-1 connectivity partner manually rerouted new connection attempts to secondary resources
00:30 UTC: G-1 connectivity partner observed secondary authorization processing success rate of 100%, Hologram new connection requests succeeding and first recovery begins
02:35 UTC: G-1 connectivity partner re-enabled lower-priority services
04:00 UTC: G-1 connectivity partner restored primary authorization resources
04:10 UTC: G-1 connectivity partner observed secondary authorization resources nearing exhaustion (alarm), a reproduction of the trigger that had occurred on the primary authorization resources prior
04:15 UTC: G-1 connectivity partner observed secondary authorization processing success rate of <100%
04:20 UTC: G-1 connectivity partner disabled lower-priority services to try to mitigate
04:45 UTC: G-1 connectivity partner rerouted new connection attempts to primary resources
04:45 UTC: G-1 connectivity partner observed primary authorization processing success rate of 100%, Hologram new connection requests succeeding and second recovery begins
Around 05:45 UTC: G-1 connectivity partner observed significant drop in congestion (which had been caused by backlog of new connection requests)
06:00 UTC: G-1 connectivity partner restored secondary authorization resources
07:00 UTC: G-1 connectivity partner re-enabled lower-priority services
10:45 UTC: G-1 connectivity partner observed signalling (including new connection requests) at pre-incident levels

Remediation and Next Steps

This incident had a widespread impact, and we take reliability very seriously. While Hologram and its partners implement systems and safeguards to maximize reliability, we have identified several areas of improvement and will continue to work on uncovering any other gaps to prevent a reoccurrence.

Insufficient system resource management: Root cause analysis revealed that the G-1 connectivity partner did not budget sufficient extra resources and did not alarm soon enough to prevent the rapid exhaustion of the resources on their authorization service. The G-1 connectivity partner is addressing system resource management and alarm configuration to prevent this from reoccurring.

Insufficient automated failover criteria: Root cause analysis also revealed that the G-1 connectivity partner's automated failover checks did not positively identify the emerging unhealthiness of the primary authorization resources due to the continued (degraded) availability of the service endpoint. The G-1 connectivity partner is investigating a more holistic health check to use for initiating failover to secondary resources.

While the incident is resolved on our status page and new cellular data connections are succeeding, we are continuing our efforts to maximize reliability and address these areas of improvement.

Conclusion

Our aim at Hologram is to provide the most reliable cellular connectivity product available, and we clearly fell short of our customer expectations with this incident that affected new connection requests on G-1 SIMs/profiles. Although the root cause was with an upstream partner, we are ultimately accountable and are deeply sorry for the disruption to our customers. We have started working on the changes outlined above and will continue our diligence to prevent this from reoccurring.

Customers using other global profiles (G-2, G-3) and non-global native profiles were not affected. Additionally, users of Hologram Hyper+ SIMs benefit from multi-core technology that provides outage protection and guaranteed uptime SLAs. Customers should reach out if interested in exploring these solutions.

Posted Aug 05, 2024 - 23:20 UTC

Resolved

Signaling congestion and connectivity are now back to pre-incident levels.

Posted Jul 30, 2024 - 13:06 UTC

Identified

The issue has been identified and traffic has been rerouted.

Connectivity is still impacted by congestion but as this eases the devices should naturally come back online.

Posted Jul 30, 2024 - 02:27 UTC

Investigating

We're currently experiencing a partial outage on our cellular network specific to our G1 profiles (sims/profiles with ICCID prefix 89445*).

Posted Jul 30, 2024 - 00:38 UTC

This incident affected: Cellular Networking (Global Cellular Data Network, SMS over cellular - Device Terminated, SMS over cellular - Device Originated).