Cooling System Incident at Data Centre
Incident Report for Crucial Hosting
Postmortem

Introduction

On Friday November 16, 2018 controlled emergency shutdown procedures were executed when the Pyrmont Data Centre temperatures exceeded known acceptable thresholds due to a cooling system failure.

No customer data was lost and all hardware was protected from thermal-failure due to this timely response. Once the cooling system was restored we were able to bring all customers services back online.

We have received a full PIR (post incident review) from the data centre including remedial action and confirmation of upgrades and testing to ensure a recurrence of a similar incident has been satisfactorily mitigated.

Summary

At 12:32pm, our monitoring systems indicated that temperatures in the Pyrmont data centre had started to increase. Our system administrators conducted an immediate audit of all servers in the facility to determine if this was isolated to some areas, or was a facility-wide event. It was clear that this was a facility-related event and we immediately contacted our data centre provider for more information.

At 12:44pm we received confirmation that the facility had seamlessly moved to UPS power after an areawide Ausgrid power outage at 12:22pm, however there was a cooling issue that they were actively working on.

As temperatures continued to rise, the decision was made at 1:02pm to execute emergency shutdown procedures to ensure data integrity across our platforms. This involved a graceful shutdown of all services.

At 1:31pm our engineers on-site noted that cooling systems were once again functioning.

At 1:42pm our monitoring systems indicated that temperatures in the data centre facility were approaching normal levels. Our engineers monitored the situation closely to ensure thermal stability prior to powering servers on.

At 1:45pm our engineers started restoring power to servers. Customers services start coming back online.

The majority of services were restored by 2:40pm. All remaining services were restored by 4:30pm.

Root Cause

The root cause of the incident was an extended cooling system interruption at the data centre in Pyrmont.

Corrective and Preventative Measures

We have followed up with the data centre and they have advised that independent engineers have determined the remedial action to replace some components in the cooling system and install an override facility on the auxiliary emergency heat extraction system. The faulty components have been replaced and a number of full failover tests have been successfully completed without incident.

Posted Dec 03, 2018 - 11:34 AEDT

Resolved
All services have remained stable.
Our engineers will continue to monitor the situation closely.
A post-incident review will be conducted and a postmortem will be provided.

If you are experiencing any ongoing issues with your hosting service, please contact our support team via https://support.crucial.com.au/
Posted Nov 19, 2018 - 11:36 AEDT
Monitoring
All services have been restored.
Our engineers will continue to monitor the situation closely.
A post-incident review will be conducted and a postmortem will be provided.

If you are experiencing any ongoing issues with your hosting service, please contact our support team via https://support.crucial.com.au/
Posted Nov 16, 2018 - 16:30 AEDT
Update
All Web Hosting and Reseller Hosting services are now online.

ControlPanel VPS, BareBones VPS and Managed VPS are progressively starting up.

An update on the situation will be provided in 90 minutes or when there is a significant change in the situation.
Posted Nov 16, 2018 - 16:17 AEDT
Update
Blaze Cloud VPS's are now online. If you are experiencing any ongoing issues with your Blaze Cloud VPS, please contact our support team via https://support.crucial.com.au/

ControlPanel VPS, BareBones VPS and Managed VPS are progressively starting up.

Web Hosting servers sh15 and sh24 were just brought back online.

Reseller Hosting server rs13 has just completed integrity checks and is now starting up.

An update on the situation will be provided in 90 minutes or when there is a significant change in the situation.
Posted Nov 16, 2018 - 15:55 AEDT
Update
Our engineers are continuing to restore services as quickly as possible. Blaze Cloud VPS's, ControlPanel VPS, BareBones VPS and Managed VPS are progressively starting up.

A number of 'Web Hosting' and 'Reseller Hosting' services are offline and will be brought back online as soon as possible.

An update on the situation will be provided in 60 minutes or if there is a significant change in the situation.
Posted Nov 16, 2018 - 14:40 AEDT
Update
Our engineers are proceeding to restore services and are continuing to monitor the thermal-stability of the environment. An update on the situation will be provided in 30 minutes or if there is a significant change in the situation.
Posted Nov 16, 2018 - 13:59 AEDT
Update
The thermal issue at the data-centre has been rectified. Our engineers are monitoring to ensure thermal-stability before they begin to restore services. An update on the situation will be provided in 10 minutes.
Posted Nov 16, 2018 - 13:43 AEDT
Identified
Our Pyrmont Data Centre is currently experiencing a cooling system incident. Some customer services are being shut down as a precaution to avoid data-loss. Your patience is appreciated.

As further information becomes available, we will provide an update here.
Posted Nov 16, 2018 - 13:06 AEDT
This incident affected: Blaze Cloud Platform, Virtual Server Platform, Web Hosting Services, Reseller Hosting Services, Crucial Client Area, Help Centre, and Phone Support.