Service Issue in Platform - SMS API, Verify API, Messages API
Incident Report for Vonage API
Postmortem

Date/Time Impacted

2022-05-19 15:31 UTC - 2022-05-19 15:34 UTC

Services Impacted

SMS API

Verify API

Messages API

Summary

Our backend authentication service experienced a configuration error resulting in customers experiencing HTTP 403 responses to impacted services. 

At 15:34 UTC our engineering team was alerted to a problem related to an increase in customer API failures. Further investigation revealed an increase in customer HTTP 403 responses, and we immediately attributed the cause to a recent software update which had been deployed at 15:28 UTC. The software update was completed at 15:34 UTC and therefore no further impact was experienced after this time.

Root Cause

On May 19th at 15:28 UTC we deployed a routine software update on our platform; no customer impact was expected as part of this update. At the start of the deployment process, a manual step was missed and the upshot of this remained undetected during the deployment. Upon completion, the update appeared to be successful. However, due to missing an initial step in the deployment process, the authentication service experienced instability issues. More specifically, requests requiring authentication were incorrectly routed to services undergoing restarts (to commit the update). Ultimately, the services undergoing a restart were unable to process requests resulting in the inability to authenticate some customer API requests temporarily. 

Restoration Actions

No direct restorative actions were required; the software update was completed and therefore customer impact halted.

Next Steps

  • Implement additional automation in our deployment process to reduce manual intervention
  • Implement configuration to gracefully shutdown internal services processing authentication requests
  • Expand our monitoring scope by alerting on elevated 4xx errors, per region
  • Review the frequency of our automated testing
  • Configure services to respond with a specific status code during a period of unavailability
Posted May 23, 2022 - 14:00 UTC

Resolved
This incident has been resolved.
Posted May 19, 2022 - 16:53 UTC
Update
We are continuing to monitor for any further issues.
Posted May 19, 2022 - 15:55 UTC
Monitoring
We have identified the root cause of this issue and implemented a fix to restore the full functionality of the following services:

SMS API
Verify API
Messages API

From 19-05-2022 15:31 UTC until 15:34 UTC customers submitting API requests to impacted services may have experienced HTTP 403 responses.

We will continue to monitor the service during the next few hours and post updates should anything change.
Posted May 19, 2022 - 15:55 UTC
Investigating
Our monitoring has alerted us to a potential service issue with our platform. Although It is not yet certain whether there is any customer impact, we are alerting customers immediately while our engineering teams investigate.

If there is a customer impact, it is likely to be affecting the following products:

SMS API
Verify API
Messages API
Dispatch API

We aim to provide an update on our investigation within 1 hour.

Please see this article - https://help.nexmo.com/hc/en-us/articles/360015693092-Nexmo-Incident-Handling - for an explanation of our approach to publishing incidents.
Posted May 19, 2022 - 15:45 UTC
This incident affected: SMS API (Outbound SMS) and Verify API and Verify SDK, [Beta] Messages API.