On July 29, 2024, Hologram experienced an outage with our G-1 connectivity partner, which prevented some devices using G-1 SIMs/profiles from being able to establish new cellular data sessions during the incident, while devices with sessions already open and which did not close during the incident remained unaffected. During recovery, a shorter second period of unavailability affected
new connection attempts, followed by degraded performance for establishing new connections.
Shortly after 23:00 UTC on July 29, 2024, one of Hologram's connectivity partners began experiencing degraded performance, followed by an outage affecting new connections. This incident impacted some devices that make use of the G-1 SIM/profile, where attempts to create new data sessions failed at times, followed by degraded performance for later attempts to open data sessions. As
the outage was isolated to that single connectivity partner, customers using other global profiles (G-2, G-3) or non-global native profiles were unaffected. Unfortunately, many customers rely upon the affected partner and had devices that could not open new cellular data sessions during the incident. Customers with many devices attempting to connect during the incident were most affected. Devices with data sessions that were already open before the incident and remained open during the incident continued to function normally.
Multiple actions were attempted to restore the ability to make new connections, and to address subsequent network congestion. The two most impactful actions were: (1) the ability to establish new connections was initially restored by manually rerouting connection requests to a secondary (backup) network authorization resource at 00:30 UTC on July 30, and (2) the subsequent resource exhaustion on the secondary authorization resource was rectified at 04:45 UTC on July 30 when the primary resource was restored. By 04:45 UTC on July 30, the authorization service was thereafter able to process new connection requests successfully, and most of Hologram's affected customers saw a steady recovery. By 06:00 UTC, both primary and secondary network authorization resources were processing new connection requests normally.
Some devices reconnected right away during either the first or second stages of recovery, other devices reconnected more slowly due to congestion, and others waited to reconnect until device-specific retry timers automatically expired. The combined (complete) unavailability for new connections totaled approximately 115 minutes during this incident, comprised of each of an 85-minute and a 30-minute unavailability period, respectively. At the conclusion of the second unavailability period, congestion from a backlog of requests resulted in degraded performance for new connections, and this congestion was mostly resolved within 1 hour. With performance no longer degraded and device retry timers progressively expiring, numbers of new connection requests returned to
pre-incident levels within approximately 6.5 hours.
In order to implement network authorization and metering of usage, the G-1 connectivity partner operates redundant authorization services in different data centers. Redundancy for these authorization services is implemented in an active-standby configuration, in part to help reserve sufficient resources on the secondary systems in order to handle the primary systems' full load immediately following the failover period. In the event of a failure of the primary authorization resources, the secondary resources are intended to
process new connection attempts after only a minimal failover period.
During this incident, the primary service experienced exhaustion of some of its system resources, and alarms were triggered as resources approached exhaustion. While alarms were being investigated by the connectivity partner, further resource exhaustion resulted in failed new connection attempts and a drop in the success rate for processing new connection requests. Unfortunately, as
implemented, the authorization service's redundancy failed to properly handle this failure mode for two reasons: (1) the primary resources did not fail completely and were still responding to some requests, which resulted in a longer-than-typical failover period, and (2) even after manual failover, the secondary resources were insufficient and were exhausted shortly afterward as well, like with the primary resources.
Sharing load across the primary and secondary resources, once both were restored, increased the speed of recovery, but some devices experienced degraded performance while attempting to establish new data connections following service restoration, and other devices waited for some time before attempting to reconnect according to device-specific retry timers. Root cause
analysis identified the following root causes to this incident: (1) insufficient system resource management by the G-1 connectivity partner, and (2) insufficient automated failover criteria for the G-1 connectivity partner's network authorization resources.
This incident had a widespread impact, and we take reliability very seriously. While Hologram and its partners implement systems and safeguards to maximize reliability, we have identified several areas of improvement and will continue to work on uncovering any other gaps to prevent a reoccurrence.
Insufficient system resource management: Root cause analysis revealed that the G-1 connectivity partner did not budget sufficient extra resources and did not alarm soon enough to prevent the rapid exhaustion of the resources on their authorization service. The G-1 connectivity partner is addressing system resource management and alarm configuration to prevent this from reoccurring.
Insufficient automated failover criteria: Root cause analysis also revealed that the G-1 connectivity partner's automated failover checks did not positively identify the emerging unhealthiness of the primary authorization resources due to the continued (degraded) availability of the service endpoint. The G-1 connectivity partner is investigating a more holistic health check to use for initiating failover to secondary resources.
While the incident is resolved on our status page and new cellular data connections are succeeding, we are continuing our efforts to maximize reliability and address these areas of improvement.
Our aim at Hologram is to provide the most reliable cellular connectivity product available, and we clearly fell short of our customer expectations with this incident that affected new connection requests on G-1 SIMs/profiles. Although the root cause was with an upstream partner, we are ultimately accountable and are deeply sorry for the disruption to our customers. We have started working on the changes outlined above and will continue our diligence to prevent this from reoccurring.
Customers using other global profiles (G-2, G-3) and non-global native profiles were not affected. Additionally, users of Hologram Hyper+ SIMs benefit from multi-core technology that provides outage protection and guaranteed uptime SLAs. Customers should reach out if interested in exploring these solutions.