During routine maintenance activity conducted by Amazon Web Services, an AWS service critical to our platform function was degraded and did not recover. This service is responsible for enabling communication among the TeleTracking components that power the applications you use, and in its degraded state, our applications could not function properly.
Our monitoring detected the issue at approximately 5:20 a.m., at which time our support team began assessing the situation. We declared an incident at 5:56 a.m., at which point all on-call engineering teams were paged. Our public Status Page (https://us.status.teletracking.com) was updated at 6:02 a.m.
Because the affected service was degraded and not completely unavailable, we initially attempted to recover it. During this period, we also engaged AWS support. As our investigation progressed, we escalated further with AWS at 8:24 a.m. In parallel, we began building a replacement for the affected service and prepared for a transition. At 11:55 a.m., we deployed the replacement, and customers began recovering within two minutes.
Over the following hours, we closely monitored recovery. Most clients’ functionality restored automatically. For some, our support team proactively reached out and restarted on-premise services to fully restore communication.
While our review of the incident will continue over the next few days, we have already identified opportunities to improve our response. Specifically, we are evaluating the time between incident identification and initial engagement with AWS Support, as well as our decision timing to transition to a replacement service. Both intervals were longer than our expectations.
We regret this incident and the disruption to your operations.