On August 28, 2023, Hologram experienced an outage with one of our connectivity partners. This incident impacted all customers using SIMs/profiles with ICCIDs beginning with the 8944* prefix. During the incident, devices using these SIMs/profiles experienced a cellular data outage and degraded cellular data connectivity.
Shortly after 19:00 UTC on August 28, 2023, one of Hologram's connectivity partners began experiencing degraded performance, followed by an outage. This incident impacted all customers using SIMs/profiles with ICCIDs beginning with the 8944* prefix. During the incident, devices using these SIMs/profiles were unable to establish cellular data sessions at times, and experienced degraded cellular data connectivity at other times. As the outage was isolated to that single connectivity partner, customers using other profiles were unaffected. Unfortunately, many customers rely upon the affected partner and had devices whose cellular data connectivity was disrupted.
Multiple actions were attempted to restore connectivity, and address subsequent network congestion. The most impactful action occurred around 0:500 UTC on August 30, 2023 when traffic was rerouted around an identified fault, after which Hologram's affected customers saw a steady recovery in service. Some devices reconnected right away, others reconnected more slowly due to congestion, and others waited to reconnect until device-specific retry timers reset. By 00:23 UTC on August 31, 2023, 95% of impacted devices had reconnected, and 99% by August 4, 2023. A subset of devices experienced shorter-length-than-expected data sessions after the outage.
The root cause was identified as an erroneous configuration change introduced by our upstream partner's upstream interconnect provider, which occurred during planned but unannounced work in one to the interconnect provider's data centers. The interconnect provider is a critical element that allows traffic to flow between multiple operators' networks, and is also a critical element in implementing routing redundancy to and from operators. Unfortunately, this configuration change resulted in abnormal routing across redundant components that undermined system reliability and safeguards that are in place.
The root problem did not cause any of the component services to fail. Because of this, the location of the fault was not immediately apparent to the end-to-end tests that both Hologram and our upstream partner were running. Because individual components were functioning and health checks were (correctly) reporting nodes across the networks were healthy, it took a substantial time to identify and fix the root cause. After restarting several hot standby nodes in their network, and temporarily rerouting traffic, our upstream partner discovered that traffic routed via a specific set of data centers operated by their interconnect provider worked as expected. This allowed them to identify the data center that was failing silently and promptly remove it from their network as a workaround. This discovery led to locating the specific configuration that the interconnect provider had applied erroneously during planned works at that data center.
August 28, 2023
August 29, 2023
August 30, 2023
August 31, 2023
September 5, 2023
This incident had a widespread impact, and we take reliability very seriously. While Hologram and its partners implement systems and safeguards to maximize reliability, we have identified several areas of improvement and will continue to work on uncovering any other gaps to prevent a recurrence.
Process: We are stressing that all planned works, no matter how small, should be announced for the entire connectivity supply chain. This allows Hologram to leverage our reporting and alerting to gauge impact (or non-impact) by correlating with known changes. Hologram has already stressed the importance of transparency with its direct partners, but it is evident that there's more work to do to ensure that transparency is extended to all critical upstream providers. Additionally, Hologram is proactively performing a fresh assessment of incident management and disaster recovery plans/procedures up the entire escalation path to ensure accountability for quick identification and resolution when issues arise.
Monitoring: Hologram already heavily relies upon automated monitoring and alerting, and is committed to continuing to invest and innovate in this area. As a part of our fresh assessment of incident management and disaster recovery of critical providers, we will also assess areas of monitoring improvements for faster identification of issues.
Product: In addition to the above, which we hope will minimize the frequency and duration of incidents, Hologram is investing in product enhancements that will prevent or minimize downtime when an incident does occur.
While the incident is resolved on our status page and cellular data services are restored, we are continuing our efforts to maximize reliability and address these areas of improvement.
Our aim at Hologram is to provide the most reliable cellular connectivity product available, and we clearly fell short of our customer expectations with this painful incident. Although the root cause was with an upstream partner, we are ultimately accountable and are deeply sorry for the disruption to our customers who were unable to use cellular data as expected due to the outage. We have started working on the changes outlined above and will continue our diligence to ensure this cannot happen again.