Temporary Data Outage for Planned Maintenance
Scheduled Maintenance Report for Hologram
Postmortem

Summary

On April 22nd an upstream network partner engaged in routine, planned maintenance to upgrade core packet networking infrastructure for our customers, with a planned outage window of up to 30 minutes for Hologram customers around the world from 01:00 UTC through 01:30 UTC on April 23rd. This maintenance was necessary for improvements to preventative security measures as well as for making overall improvements to network operation.

As we were monitoring the maintenance progress, our internal network monitors detected a lower than expected number of devices reconnecting to our global network, and we alerted our carrier providers as we continued to monitor the situation, and posted an update to our status page.

Following the completion of the planned outage during the maintenance window, a larger than anticipated number of devices attempted to connect to the network simultaneously, and created signal congestion. As a result, cellular internet traffic was un-routable for Hologram customers and other providers worldwide, and the affected SIMs were unable to attach to towers and open new sessions.

Following the maintenance window and subsequent network delays, our partners were able to add increased capacity to better balance the load of devices, and ultimately restored service globally by the early hours of Apr 24 UTC.

We apologize for the inconvenience many customers experienced as a result of the disruption of service. Hologram is currently working with our carrier partners to find solutions to mitigate issues like this in the future. If you have any questions or concerns, please reach out to support@hologram.io.

Below is a brief synopsis of the time markers throughout the outage:

  • Apr 23 01:00 - 03:00 planned outage window
  • Apr 23 01:50 UTC Hologram detected issues with connection recovery and escalated issue
  • Apr 23 02:54 UTC Status page updated to reflect outage status
  • Apr 23, 04:58 UTC US and 4G LTE connections began to recover, but 2G/3G and non-US connections were still delayed
  • Apr 23, 07:01 UTC signaling traffic congestion had been eliminated and network functionality began to recover worldwide
  • Apr 23, 14:29 UTC 4G and LTE connections resumed normal connectivity levels, while 2G/3G continued to be congested
  • Apr 23 19:00 UTC 2G/3G connections worldwide began to recover
  • By Apr 24, 03:16 UTC Issue resolved. Network signaling congestion issues had been repaired, and 2G/3G and LTE network functionality resumed normal worldwide.

Cause of Failure

On April 22nd our upstream network partners engaged in routine, planned maintenance to upgrade core packet networking infrastructure. This maintenance was necessary for improvements to preventative security measures as well as for making overall improvements to network operation.

The maintenance window began at 22:55 UTC on schedule, and was expected to last no later than 03:00 UTC on April 23rd. We were expecting to experience brief losses of data routing and packet deliverability between 01:00 UTC through 01:30 UTC on April 23rd, resulting in connections being dropped or destination hosts becoming temporarily unreachable during that time.

However, following the planned outage, as devices began attempting to reconnect to networks, the volume of devices was larger than our upstream carrier partners had anticipated, and the resulting spike in signaling traffic created congestion that prevented devices from reattaching. This led to signaling traffic timeouts which leads to devices making more requests and causing a "storm" of traffic that doesn't abate, and which overwhelmed our carrier partner systems.

The initial cause appeared to be limitations on the maximum number of transactions per second in our carrier partner's systems, and the issue was then isolated to a single signaling provider. At that time 4G connectivity was restored in the US, and the engineering teams worked with the signaling provider to make routing changes in an effort to decrease congestion. The signaling providers' systems were unable to process the new connections, and the backup resulted in long queues of devices waiting to come back online.

Resolution and Recovery

Automated monitoring systems at Hologram and at our upstream provider both reported failures occurring for network service health around 01:50 UTC on April 23 2021, and we updated our status page reflecting this delay at 02:54 UTC to notify customers of the delays.

These reports of service disruption were escalated to Hologram's on-call Engineering, CTO, and Connectivity Product Lead. Hologram escalated the issue to the affected global gateway provider for priority investigation.

When this issue was identified, our global gateway provider submitted a request to increase signaling link capacity with their upstream signaling providers. Resources were dedicated to balancing the traffic load, and an emphasis was placed on addressing domestic (US) and LTE traffic as that covers a bulk of our connections.

Following the successful return of 4G network functionality, 2G/3G network functionality took another 12 hours to be fully restored as the network partners were forced to re-route traffic in order to regain enough capacity to process incoming messages. Regular routing patterns were restored once the situation was resolved.

Recovery timeline:

  • Apr 23 01:50 UTC Hologram detected issues with connection recovery and escalated issue
  • Apr 23 04:30 UTC Connection/authentication success rates on both 2G/3G and 4G/LTE began to climb.
  • Apr 23, 04:58 UTC, US and 4G LTE connections were partially recovered, while 2G/3G and non-US connections were still delayed.
  • Apr 23 13:00 UTC 4G connections were in a stable position, and US destinations were performing as expected. Where possible, customers were advised to force devices to connect to 4G services as signaling providers were experiencing SS7 signaling congestion impacting 2G, 3G and API lookups.
  • Apr 23, 19:01 UTC Signaling traffic congestion had been eliminated and network functionality began to recover worldwide
  • Apr 24, 03:16 UTC Global gateway provider reported monitoring near-100% success rates for connection/authentication (within the normal baseline rate) as well as normal signaling and data load on the network. Hologram Engineering and Customer Success continued to monitor customer devices as they reconnected and then marked the incident as resolved after no further congestion or network issues were observed.
Posted Apr 30, 2021 - 15:16 UTC

Completed
Network signaling congestion has been resolved, and 2G/3G and LTE network functionality is currently normal worldwide.
Devices should continue to reattach on their own after any failed requests have timed out or any device back-off periods have expired.
As devices reattach, our team continues to work with affected customers. A post-mortem will be published at status.hologram.io
Posted Apr 24, 2021 - 03:16 UTC
Update
Our upstream carrier partners and signaling providers have fixed network congestion issues, and 2G/3G and LTE network functionality is currently normal worldwide. We are continuing to monitor along with our partners.
Devices should continue to reattach on their own after any failed requests have timed out or any device back-off periods have expired.
As devices reattach, our team continues to work with affected customers.
Posted Apr 23, 2021 - 22:34 UTC
Update
We are continuing to monitor progress
Posted Apr 23, 2021 - 21:12 UTC
Update
Our upstream carrier partners continue to improve on network signaling congestion.
Global LTE networks are operating near normal levels.
Global 2G/3G is still sub-nominal but we have been seeing continuous improvement. Customers in the UK specifically may see issues attaching 2G/3G devices to the network.
US 2G/3G networks are operating near normal levels.
Posted Apr 23, 2021 - 19:54 UTC
Update
We are continuing to see more devices reattach as network congestion decreases and continue to monitor the situation with our upstream carrier partners
Posted Apr 23, 2021 - 17:02 UTC
Update
Our team is monitoring a small number of network connection requests delayed by signaling network congestion, primarily affecting 2G/3G connection requests. Impacted devices continue to reconnect and overall network connection counts and activity are near normal levels. Our team is actively working with network partners through resolution for remaining devices.
Posted Apr 23, 2021 - 14:29 UTC
Update
Signaling congestion has been eliminated and network performance has been restored to prior levels. We will continue to monitor as remaining devices reconnect.
Posted Apr 23, 2021 - 07:01 UTC
Verifying
We are seeing 2G and 3G sessions recovering. We will monitor the situation to confirm resolution.
Posted Apr 23, 2021 - 06:40 UTC
Update
We are seeing service mostly restored and working well for LTE devices. Our upstream carrier partners are still working to manage signaling traffic to bring the remaining 2G and 3G devices online.
Posted Apr 23, 2021 - 04:58 UTC
Update
Issue has been identified as signaling congestion from large number of devices reconnecting. Carriers are working to increase capacity on their systems
Posted Apr 23, 2021 - 03:26 UTC
Update
Devices are coming back online from the maintenance window, however this process is moving slower than expected. We are working with our carrier partners to speed this up.
Posted Apr 23, 2021 - 02:54 UTC
In progress
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted Apr 22, 2021 - 22:55 UTC
Scheduled
One of our carrier partners will be performing planned network maintenance on core packet networking infrastructure on April 22nd, beginning at 22:55 UTC and running until 03:00 UTC April 23rd. This maintenance is necessary for improvements to preventative security measures as well as making overall improvements to network operation. Indigo SIMs will not be impacted.

Impacts: During the active work beginning at 01:00 UTC and through 01:30 UTC April 23rd there may be brief losses of data routing and packet deliverability, resulting in connections being dropped or destination hosts becoming temporarily unreachable. PDP contexts (packet-switched data sessions) and automatic IP address assignment requests may timeout. SIM operations (activation, deactivation, pausing, and unpausing) should not be affected. Tower authentication and circuit-switched SMS should not be affected. Dashboard and API will not be affected.

We recommend avoiding performing critical maintenance or operations on devices during this maintenance window. If you have any questions, please feel free to reach out to support@hologram.io, and the Hologram team will be standing by during the maintenance window to answer any questions.
Posted Apr 09, 2021 - 15:14 UTC
This scheduled maintenance affected: Cellular Networking (Global Cellular Data Network) and Data Services (SMS over IP, SpaceBridge Tunneling).