Summary

Between 4:00 pm and approximately 5:36 pm PT on Tuesday, November 23rd, we experienced an outage across most Coinbase production systems. During this outage, users were unable to access Coinbase using our websites and apps, and therefore were unable to use our products. This post is intended to describe what occurred and the causes, and to discuss how we plan to avoid such problems in the future.

The Incident

On November 23rd, 2021, at 4:00pm PT (Nov 24, 2021 00:00 UTC) an SSL certificate for an internal hostname in one of our Amazon Web Services (AWS) accounts expired. The expired SSL certificate was used by many of our internal load balancers which caused a majority of inter-service communications to fail. Due to the fact that our API routing layer connects to backend services via subdomains of this internal hostname, about 90% of incoming API traffic returned errors.

Error rates returned to normal once we were able to migrate all load balancers to a valid certificate.

Chart depicting overall 90% error rate at our API routing layer for duration of incident.

Context: Certificates at Coinbase

It’s helpful to provide some background information about how we manage SSL certificates at Coinbase. For the most part, certificates for public hostnames like coinbase.com are managed and provisioned by Cloudflare. For certificates for internal hostnames used to route traffic between backend services, we historically leveraged AWS IAM Server Certificates.

One of the downsides of IAM Server Certificates is that certificates must be generated outside of AWS and uploaded via an…


Source link

Leave a Reply

Your email address will not be published.

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed