On April 22nd an upstream network partner engaged in routine, planned maintenance to upgrade core packet networking infrastructure for our customers, with a planned outage window of up to 30 minutes for Hologram customers around the world from 01:00 UTC through 01:30 UTC on April 23rd. This maintenance was necessary for improvements to preventative security measures as well as for making overall improvements to network operation.
As we were monitoring the maintenance progress, our internal network monitors detected a lower than expected number of devices reconnecting to our global network, and we alerted our carrier providers as we continued to monitor the situation, and posted an update to our status page.
Following the completion of the planned outage during the maintenance window, a larger than anticipated number of devices attempted to connect to the network simultaneously, and created signal congestion. As a result, cellular internet traffic was un-routable for Hologram customers and other providers worldwide, and the affected SIMs were unable to attach to towers and open new sessions.
Following the maintenance window and subsequent network delays, our partners were able to add increased capacity to better balance the load of devices, and ultimately restored service globally by the early hours of Apr 24 UTC.
We apologize for the inconvenience many customers experienced as a result of the disruption of service. Hologram is currently working with our carrier partners to find solutions to mitigate issues like this in the future. If you have any questions or concerns, please reach out to support@hologram.io.
Below is a brief synopsis of the time markers throughout the outage:
On April 22nd our upstream network partners engaged in routine, planned maintenance to upgrade core packet networking infrastructure. This maintenance was necessary for improvements to preventative security measures as well as for making overall improvements to network operation.
The maintenance window began at 22:55 UTC on schedule, and was expected to last no later than 03:00 UTC on April 23rd. We were expecting to experience brief losses of data routing and packet deliverability between 01:00 UTC through 01:30 UTC on April 23rd, resulting in connections being dropped or destination hosts becoming temporarily unreachable during that time.
However, following the planned outage, as devices began attempting to reconnect to networks, the volume of devices was larger than our upstream carrier partners had anticipated, and the resulting spike in signaling traffic created congestion that prevented devices from reattaching. This led to signaling traffic timeouts which leads to devices making more requests and causing a "storm" of traffic that doesn't abate, and which overwhelmed our carrier partner systems.
The initial cause appeared to be limitations on the maximum number of transactions per second in our carrier partner's systems, and the issue was then isolated to a single signaling provider. At that time 4G connectivity was restored in the US, and the engineering teams worked with the signaling provider to make routing changes in an effort to decrease congestion. The signaling providers' systems were unable to process the new connections, and the backup resulted in long queues of devices waiting to come back online.
Automated monitoring systems at Hologram and at our upstream provider both reported failures occurring for network service health around 01:50 UTC on April 23 2021, and we updated our status page reflecting this delay at 02:54 UTC to notify customers of the delays.
These reports of service disruption were escalated to Hologram's on-call Engineering, CTO, and Connectivity Product Lead. Hologram escalated the issue to the affected global gateway provider for priority investigation.
When this issue was identified, our global gateway provider submitted a request to increase signaling link capacity with their upstream signaling providers. Resources were dedicated to balancing the traffic load, and an emphasis was placed on addressing domestic (US) and LTE traffic as that covers a bulk of our connections.
Following the successful return of 4G network functionality, 2G/3G network functionality took another 12 hours to be fully restored as the network partners were forced to re-route traffic in order to regain enough capacity to process incoming messages. Regular routing patterns were restored once the situation was resolved.
Recovery timeline: