Yesterday, many of you, our valued customers encountered issues accessing our products. I want to sincerely apologise to you on behalf of the entire team as we take any disruption to your service more seriously than anything else.
I also want to be completely transparent on what happened and assure you that we will do everything possible to ensure this never happens again.
At 4.20pm GMT we performed what we would normally classify as a routine database table update on our primary database. We required a new field on a table that holds information about each specific account. To minimize impact on the live platform while a database update is performed we use a tool that can perform the change on a copy of the table. When the update is complete, the old table is replaced by the temporary modified table. Unfortunately a step failed while replacing the original table which we had not planned for.
This table is used by the User Login service and the impact of the issue left users with an account not found message. Once identified, we immediately set about restoring the table to get you back into your accounts.
Based on an initial analysis, a mixture of human and process error were to blame. We disrupted the majority of our customers for 45 minutes but critically left many customers unable to access their sites for up to 3 hours while we worked on restoring the records for all accounts from our RDS backup.
We have immediately updated some of our internal processes to ensure this does not happen again. We will also carry out a full audit of events to ensure that all possible learnings are taken from the incident. The messages we got from you again brought home how critical it is that we deliver maximum uptime to support your projects, and will increase our efforts to meet and exceed your availability expectations.
To all our customers who were affected by this incident, please accept my most sincere apology.
Daniel Mackey, CTO