Connectivity Interruption on some SIM cards
Incident Report for Hologram
Postmortem

On August 28, 2023, Hologram experienced an outage with one of our connectivity partners. This incident impacted all customers using SIMs/profiles with ICCIDs beginning with the 8944* prefix. During the incident, devices using these SIMs/profiles experienced a cellular data outage and degraded cellular data connectivity.

Incident Summary and Impact

Shortly after 19:00 UTC on August 28, 2023, one of Hologram's connectivity partners began experiencing degraded performance, followed by an outage. This incident impacted all customers using SIMs/profiles with ICCIDs beginning with the 8944* prefix. During the incident, devices using these SIMs/profiles were unable to establish cellular data sessions at times, and experienced degraded cellular data connectivity at other times. As the outage was isolated to that single connectivity partner, customers using other profiles were unaffected. Unfortunately, many customers rely upon the affected partner and had devices whose cellular data connectivity was disrupted.

Multiple actions were attempted to restore connectivity, and address subsequent network congestion. The most impactful action occurred around 0:500 UTC on August 30, 2023 when traffic was rerouted around an identified fault, after which Hologram's affected customers saw a steady recovery in service. Some devices reconnected right away, others reconnected more slowly due to congestion, and others waited to reconnect until device-specific retry timers reset. By 00:23 UTC on August 31, 2023, 95% of impacted devices had reconnected, and 99% by August 4, 2023. A subset of devices experienced shorter-length-than-expected data sessions after the outage.

Root Cause and Technical Description

The root cause was identified as an erroneous configuration change introduced by our upstream partner's upstream interconnect provider, which occurred during planned but unannounced work in one to the interconnect  provider's data centers. The interconnect provider is a critical element that allows traffic to flow between multiple operators' networks, and is also a critical element in implementing routing redundancy to and from operators. Unfortunately, this configuration change resulted in abnormal routing across redundant components that undermined system reliability and safeguards that are in place.

The root problem did not cause any of the component services to fail. Because of this, the location of the fault was not immediately apparent to the end-to-end tests that both Hologram and our upstream partner were running. Because individual components were functioning and health checks were (correctly) reporting nodes across the networks were healthy, it took a substantial time to identify and fix the root cause. After restarting several hot standby nodes in their network, and temporarily rerouting traffic, our upstream partner discovered that traffic routed via a specific set of data centers operated by their interconnect provider worked as expected. This allowed them to identify the data center that was failing silently and promptly remove it from their network as a workaround. This discovery led to locating the specific configuration that the interconnect provider had applied erroneously during planned works at that data center.

Timeline

August 28, 2023

  • Around 19:00 UTC: Interconnect provider introduces erroneous configuration change to one of its data centers. This initially results in degradation and eventually an outage of cellular data for SIMs/profiles with ICCIDs beginning with the 8944* prefix.
  • 19:39 UTC: Status page posted following Hologram detecting a probable outage as network conditions deteriorate. Hologram also escalates to our upstream connectivity partner.
  • 19:51 UTC: Upstream connectivity partner's NOC paged on-call engineering support, and additional engineering support was resourced. Multiple component services in partner's network are cleared of fault, and additional engineering resources are mobilized.

August 29, 2023

  • 00:25 UTC: Upstream partner commences attempts to adjust/restart various nodes that are reporting abnormal traffic patterns related to signalling (control plane for connection and session management).
  • 05:36 UTC: Following failed attempts to restore service, including replacing signalling hardware links that were alerting and implementing temporary workarounds to increase signalling performance and capacity, partner executes their crisis management plan.
  • 10:00 UTC: Partner performed various traffic management actions, including blocking specific incoming traffic, shutting down specific links, and increasing capacity to allow network elements time to recover and catch up. Partner's network monitoring now indicates that their mitigating actions are having a positive effect.
  • 15:00 UTC: With network congestion stabilizing, and after implementing a shunt to further increase signalling performance, partner begins introducing traffic that was previously temporarily blocked.
  • 20:00 UTC: Partner observes traffic levels approaching normal, and restores all interconnect partner link configurations to their pre-incident state.
  • 23:59 UTC: Partner's network monitoring alerts that there is once again abnormal signalling traffic. Partner escalates to interconnect provider, who begins an immediate investigation.

August 30, 2023

  • 05:00 UTC: After partner implemented a repeat of prior workarounds and traffic mitigations with the help of the interconnect provider, the interconnect provider now reverts the erroneous configuration change in their affected data center.
  • 07:30 UTC: Hologram continues to observe devices having issues connecting due to congestion during recovery.
  • 10:00 UTC: Partner observes signalling traffic levels approaching normal. Interconnect provider has commenced an investigation into their network configuration and planned works activities.
  • 19:38 UTC: Hologram observes 85% of affected devices have reconnected.
  • 21:11 UTC: Hologram observes 90% of affected devices have reconnected.

August 31, 2023

  • 00:23 UTC: Hologram observes 95% of affected devices have reconnected. Hologram observes a small subset of devices with unstable data connections, with sessions dropping more often than expected. Hologram continues to escalate these findings.
  • 16:59 UTC: Hologram observes 97% of affected devices have reconnected. Hologram observes a small subset of devices with unstable data connections, with sessions dropping more often than expected. Hologram continues to escalate these findings.

September 5, 2023

  • 17:14 UTC: Following the weekend, Hologram observed over 99% of affected devices had reconnected. Hologram continues to investigate the small subset of devices with sessions dropping more often than expected.
  • September 6, 2023
  • 16:40 UTC: With the vast majority of affected devices connected and exhibiting normal behavior, Hologram closes the incident on our status page.

Remediation and Next Steps

This incident had a widespread impact, and we take reliability very seriously. While Hologram and its partners implement systems and safeguards to maximize reliability, we have identified several areas of improvement and will continue to work on uncovering any other gaps to prevent a recurrence.

Process: We are stressing that all planned works, no matter how small, should be announced for the entire connectivity supply chain. This allows Hologram to leverage our reporting and alerting to gauge impact (or non-impact) by correlating with known changes. Hologram has already stressed the importance of transparency with its direct partners, but it is evident that there's more work to do to ensure that transparency is extended to all critical upstream providers. Additionally, Hologram is proactively performing a fresh assessment of incident management and disaster recovery plans/procedures up the entire escalation path to ensure accountability for quick identification and resolution when issues arise.

Monitoring: Hologram already heavily relies upon automated monitoring and alerting, and is committed to continuing to invest and innovate in this area. As a part of our fresh assessment of incident management and disaster recovery of critical providers, we will also assess areas of monitoring improvements for faster identification of issues.

Product: In addition to the above, which we hope will minimize the frequency and duration of incidents, Hologram is investing in product enhancements that will prevent or minimize downtime when an incident does occur.

While the incident is resolved on our status page and cellular data services are restored, we are continuing our efforts to maximize reliability and address these areas of improvement.

Conclusion

Our aim at Hologram is to provide the most reliable cellular connectivity product available, and we clearly fell short of our customer expectations with this painful incident. Although the root cause was with an upstream partner, we are ultimately accountable and are deeply sorry for the disruption to our customers who were unable to use cellular data as expected due to the outage. We have started working on the changes outlined above and will continue our diligence to ensure this cannot happen again.

Posted Oct 03, 2023 - 15:03 UTC

Resolved
This incident has been resolved. Working with our upstream partner we have identified and addressed the root causes. Device connections have returned to normal levels and we are seeing consistent performance across our network. A very small number of devices could still be experiencing unusual behavior. If you have such a device, please contact support for device-specific diagnostics.
Posted Sep 06, 2023 - 16:40 UTC
Update
Our data shows that 99% of devices are connecting normally now after the weekend.

We're still troubleshooting a small subset of devices that have unstable data connections (sessions dropping more often than expected).
Posted Sep 05, 2023 - 17:14 UTC
Update
Our data shows that 97% of devices are connecting normally. We have observed a small subset of devices with unstable data connections, with sessions dropping more often than expected. We are diagnosing root causes.
Posted Aug 31, 2023 - 16:56 UTC
Update
Over 95% of affected devices have restored connectivity, with more devices coming back online each hour. We're still working with our upstream partners to ensure the same success on remaining devices.
We have observed a small subset of devices with unstable data connections, with sessions dropping more often than expected. We are diagnosing root causes
Posted Aug 31, 2023 - 00:23 UTC
Update
Based on our monitoring we've identified 90% of our affected SIM cards connectivity has been restored. We're still working with our upstream partners to ensure the same success on the remaining ones.
Posted Aug 30, 2023 - 21:11 UTC
Update
Based on our monitoring we've identified 85% of our affected SIM cards connectivity has been restored. We're still working with our upstream partners to ensure the same success on the remaining ones.
Posted Aug 30, 2023 - 19:38 UTC
Update
Our upstream partner is continuing to make improvements, and Hologram is starting to see devices come on line. To reduce signal overload on the network, the number of new devices are being rate limited to prevent congestion, and our upstream partner is continuing to incrementally increase capacity. Both Hologram and our partner are closely monitoring the network recovery to ensure network stability, and work through the recovery in a controlled manner.
Posted Aug 30, 2023 - 13:09 UTC
Monitoring
Our upstream partner is continuing to make improvements, and Hologram is starting to see devices come on line. To reduce signal overload on the network, the number of new devices are being rate limited to prevent congestion, and our upstream partner is continuing to incrementally increase capacity. Both Hologram and our partner are closely monitoring the network recovery to ensure network stability, and work through the recovery in a controlled manner.
Posted Aug 30, 2023 - 09:28 UTC
Update
Our upstream partner is continuing to implement changes to bring devices back online, and is currently working to mitigate signaling congestion. With the current congestion, Hologram continues to see devices failing to register with carrier networks. We are continuing to monitor the situation.
Posted Aug 30, 2023 - 07:42 UTC
Update
Our upstream partner has shifted traffic to a new node and is beginning to see improvements in IoT device connectivity. We are continuing to monitor the situation. We will continue to post updates here as we learn more.
Posted Aug 30, 2023 - 05:19 UTC
Identified
The previously identified fix for the failures in our upstream partner's network interfaces was not ultimately viable and was not implemented. Our partner believes they have now identified a common problem with the interface failures and are working on a fix.
Posted Aug 30, 2023 - 02:01 UTC
Update
The likely root cause of the regression has been identified on a node in our upstream partner's network interface. They engaged with their vendors and a solution has been identified, which will be executed in the next 30 minutes. The updated node will come back online and traffic will be gradually increased to it to ensure a stable recovery.
The expected recovery time depends on the depth of backlog of connection requests, it will be a slow release to avoid overloading the signal, they're estimating recovery by 0700 to 0800 UTC.
Posted Aug 30, 2023 - 01:06 UTC
Update
Our upstream partner has identified the root cause of the regression. One of the provider's network interface was unable to support the amount of traffic being released and was compromised.
The solution is currently being assessed, in which they'll migrate the signaling services to a different node that is operating normally and has bandwidth to support the incremental traffic.
Latest estimate for full resolution is 0700 to 0800 UTC. This estimate is based on the assumption that the applied solutions work as intended.
Posted Aug 29, 2023 - 22:59 UTC
Update
After trending upwards close to a resolution, we noticed a regression on resolved sims. We escalated with our upstream partners and they confirmed there has been a major regression on the solution that was implemented. We're monitoring the impact and working with our partners to identify the root cause of this regression.
Posted Aug 29, 2023 - 20:59 UTC
Update
We've been monitoring progress and it's been steadily improving. We're already past 65% of resolution , we will keep monitoring and updating.
Posted Aug 29, 2023 - 19:01 UTC
Monitoring
The latest update received from our upstream partner indicates signalling has been stabilized and congestion is now overcome with continuous monitoring ongoing. The network is processing traffic as per normal standard, though affected devices will take some time to all reattach.
From timelines of previous similar incidents, it is expected that full recovery will likely take until well past midnight UTC.
All teams are focusing on improving these timelines wherever possible.
Posted Aug 29, 2023 - 15:58 UTC
Update
There are still issues with the connectivity on our upstream partners network.

The cause of this incident has now been rectified and stability has been confirmed.

They're now facing a signaling storm due to congestion, they're restricting the traffic and are slowly increasing throughput to resolve this.

We're still unable to provide an ETA but we started seeing improvements.
Posted Aug 29, 2023 - 12:39 UTC
Update
Our upstream partner continues to work on the remaining network instability issues adversely affecting subscriber attachments. There is no ETA yet on resolution. We will continue to post updates here as we learn more.
Posted Aug 29, 2023 - 09:23 UTC
Update
Our upstream partner continues to work on the remaining network instability issues adversely affecting subscriber attachments. There is no ETA yet on resolution. We will continue to post updates here as we learn more.
Posted Aug 29, 2023 - 08:11 UTC
Update
Our upstream partners have stabilized the replacement hardware and continue to work on the remaining network instability issues that are adversely affecting subscribe attachments. There is no ETA yet on resolution. We will continue to post updates here as we learn more.
Posted Aug 29, 2023 - 07:26 UTC
Update
We are continuing to work on a fix for this issue.
Posted Aug 29, 2023 - 07:25 UTC
Update
We are continuing to work on a fix for this issue.
Posted Aug 29, 2023 - 06:58 UTC
Update
Our upstream partners have replaced faulty hardware, have begun bringing interconnect links back online, and are continuing to work to resolve the issues that remain. Unfortunately, bringing back the interconnect links have not yet had the expected effect on connectivity. Subscriber attachments are still adversely affected. There is no ETA yet on resolution. We will continue to post updates here as we learn more.
Posted Aug 29, 2023 - 05:39 UTC
Update
Our upstream partners have replaced faulty hardware, have begun bringing interconnect links back online, and are continuing to work to resolve the issues that remain. Subscriber attachments are still adversely affected. We will continue to post updates here as we learn more.
Posted Aug 29, 2023 - 04:31 UTC
Update
Our upstream partners have replaced faulty hardware and are continuing to work to resolve the issues that remain. We will continue to post updates here as we learn more.
Posted Aug 29, 2023 - 03:25 UTC
Update
Unfortunately the fix attempted by our upstream partners did not resolve the issue. They are continuing to investigate. We will post updates here as we learn more.
Posted Aug 29, 2023 - 02:35 UTC
Update
Upstream partners have identified the issue and are implementing a fix.
Posted Aug 29, 2023 - 00:55 UTC
Identified
The issue has been identified. We will post updates here as we learn more.
Posted Aug 28, 2023 - 23:45 UTC
Update
This issue is affecting SIMs starting with the 8944* prefix. It appears to be preventing any attachment to the cellular network right now. This has been escalated to the highest level
Posted Aug 28, 2023 - 19:47 UTC
Investigating
We are seeing an issue with a network partner causing some SIM cards to be unable to pass data on the network. The issue has been escalated and we are investigating along with our partner. We will post updates here as we learn more.
Posted Aug 28, 2023 - 19:39 UTC
This incident affected: Cellular Networking (Global Cellular Data Network).