Blaze VPS - Storage Platform Outage

Incident Report for Crucial Hosting

Resolved

This incident has now concluded.

Posted Nov 25, 2019 - 17:20 AEDT

Monitoring

Bringing the replacement storage cluster hardware online has allowed us to recover further VMs and we continue to work with the Storage Vendor.

Another update will be provided following any significant changes.

Posted Nov 20, 2019 - 17:01 AEDT

Update

We have worked with the Storage Vendor to successfully bring the replacement hardware online and joined to the Storage Cluster. We are currently performing verification checks as directed by the Storage Vendor.

In the meantime we have recovered 97% of customer VM's to our contingency hardware. All remaining VM's are either being recovered or belong to customers who have elected to wait for recovery of the Storage Cluster. We are awaiting the checks to complete before we are able to proceed with further recovery work of the Storage Cluster.

Another update will be provided on 20/11/2019 by 17:00 AEST, or if there's any significant changes in-between.

Posted Nov 20, 2019 - 14:22 AEDT

Update

We've received the replacement hardware from the storage vendor.

The replacement hardware is in the process of being racked up within our datacentre.

The storage vendor have scheduled their senior staff to work directly with our team on 20/NOV/2019 from 03:00 AEST.

Our technicians are still continuing to restore the last few remaining VMs onto new infrastructure.

Another update will be provided on 20/NOV/2019 by 11:00 AEST, or if there's any significant changes in-between.

Posted Nov 19, 2019 - 16:39 AEDT

Update

Our technicians are continuing to transfer and restore the remaining VMs onto the new infrastructure.

For customers who are still offline and do not have backups, we are waiting for replacement hardware to arrive from the storage vendor, which we hope will assist in the recovery of the remaining data. Once the replacement hardware arrives, we will work with the storage vendor engineers to bring the remaining volumes online. The current ETA on the replacement hardware arrival is by Tuesday 17:00 AEDT.

If we can assist with your self-restore efforts by provisioning a new service, please let us know. For example, we are able to provision you a new Virtual Machine on the new infrastructure with your existing IP addresses in order to speed up your recovery efforts.

As we do not expect the situation to change significantly in the next 24 hours, our next status update is expected to occur no earlier than Tuesday 17:00 AEDT (Sydney time).

In the event there is a significant change we will update here accordingly.

Posted Nov 18, 2019 - 17:01 AEDT

Update

We have restored a large number of customers’ services so far. If your service has been restored or if you have any further issues, please let our support team know.

Our technicians are continuing to transfer and restore the remaining VMs onto the new infrastructure. Additional infrastructure is being built to enable us to execute this process more quickly.

For customers who are still offline and do not have backups, we are waiting for replacement hardware to arrive from the storage vendor, which we hope will assist in the recovery of the remaining data. Once the replacement hardware arrives, we will work with the storage vendor engineers to bring the remaining volumes online. The current ETA on the replacement hardware arrival is Tuesday 17:00 AEDT.

If we can assist with your self-restore efforts by provisioning a new service, please let us know. For example, we are able to provision you a new Virtual Machine on the new infrastructure with your existing IP addresses in order to speed up your recovery efforts.

We will post another status update at 17:00 AEDT (Sydney Time), or when there is a significant change in the situation.

Posted Nov 18, 2019 - 13:35 AEDT

Update

Our technicians have identified a method to successfully transfer and restore VMs from the storage cluster onto the new infrastructure and are prioritising this method to bring services back online. They are working through the list as fast as possible.

Those VMs which are not able to be restored in this way will need to be restored from backups. If you have already requested a restore from backup, we will first attempt to recover your data from the storage cluster as this has proven to be the best approach.

We have scheduled System Administrators to be available to work on this continually until this issue is resolved.

We continue to wait for replacement hardware to arrive from the storage vendor which we hope will assist in the recovery of data also. This will happen most likely on Tuesday.

Our next scheduled status update will likely occur no earlier than Monday 09:00 AEDT (Sydney time).

Posted Nov 16, 2019 - 21:43 AEDT

Update

The storage vendor have not been able to recover the second node with the spare hardware we have onsite. They have organised a shipment of spare parts, however this is coming from interstate and will arrive on Monday morning at the earliest.

Once the hardware arrives, we will perform basic verification and then work with the storage vendor engineers to develop a plan to introduce and further verify the new hardware.

In the event that the Storage Cluster recovers correctly, we expect to be able to start verifying data by Tuesday morning at the earliest. Our engineers are already working on extracting available data from volumes on the Storage Cluster.

In the mean time alternative infrastructure is ready to go for customers with backups available. Please contact our support team if you would like to be restored from the latest backup available.

Please note that after restoring a service from backups we will not be able to restore any data for the service from the Storage Cluster when it is back online.

When requesting a restore, please contact our support team at support@crucial.com.au, so that our System Administrators can work on the restore as quickly as possible.

We have scheduled System Administrators to be available to work on this continually until this issue is resolved.

As we do not expect the situation to change significantly in the next 40 hours, our next status update is expected to occur no earlier than Monday 09:00 AEDT (Sydney time).

In the event there is a significant change we will update here accordingly.

Posted Nov 16, 2019 - 17:16 AEDT

Update

The storage vendor have not been able to recover the second node with on-site spares. They have organised a shipment of spare parts but this is coming from interstate and will arrive Monday morning at the earliest.

We are also discussing other options with the storage vendor to see if any further action can be taken with the hardware we have on hand.

In the mean time alternative infrastructure is ready to go for customers with backup services. Please contact our support team if you would like to be restored from the latest backup available as soon as possible.

When requesting a restore, please contact our support team at support@crucial.com.au, so that our System Administrators can work on the restore as quickly as possible.

Please note that after restoring a service from backups we will not be able to restore any data for the service from the Storage Cluster when it is back online.

We will post a status update at 17:00 AEDT (Sydney Time), or when there is a significant change in the situation.

Posted Nov 16, 2019 - 13:11 AEDT

Update

We are continuing to work closely with the storage vendor, and have been carefully executing a recovery procedure. This is taking longer than anticipated because we are trying to recover as much data as possible.

We are still looking at options to source replacement hardware as we have exhausted our spares for a particular hardware component.

At this stage we are still not able to provide an ETA on Virtual Machine Recovery.

We will post a status update at 14:00 AEDT (Sydney Time), or when there is a significant change in the situation.

Posted Nov 16, 2019 - 11:40 AEDT

Update

We are continuing to work closely with the storage vendor, and have been carefully executing a recovery procedure.

This is taking longer than anticipated because we are trying to recover as much data as possible.

At this stage we are still not able to provide an ETA on Virtual Machine Recovery.

We will post a status update at 11:00 AEDT (Sydney Time), or when there is a significant change in the situation.

Posted Nov 16, 2019 - 09:37 AEDT

Update

Storage vendor Engineers have provided a recovery procedure which we are carefully executing.
We will provide further updates on this process and an ETA on Virtual Machine recovery at 9:30 AEDT (Sydney time) or when there is a significant change in the situation.

Posted Nov 16, 2019 - 07:49 AEDT

Update

Storage vendor Engineers are testing the recovery procedure in their U.S. test lab to find the safest way to recover the storage cluster with minimum data loss.

This process is taking longer than anticipated.

We expect further news at 07:00 AEDT (Sydney time) , and will update the virtual machine recovery ETA at that time.

We will post a status update at 07:00, or when there is a significant change in the situation.

Posted Nov 16, 2019 - 05:12 AEDT

Update

Recovery efforts are continuing.

We still anticipate that we'll be able to start bringing Virtual Machines online between 04:00 and 05:00 AEDT (Sydney time).

We will continue to post further status updates here every 3 hours, or when there is a significant change in the situation.

Posted Nov 16, 2019 - 03:07 AEDT

Update

We have suffered a 2 node failure with our storage cluster. Although the cluster is designed to handle a multi node failure, the automated failover and recovery process has not worked as expected and has required manual intervention which includes repairing the failed nodes.

We have been actively working with the storage vendor in order to restore the storage cluster. Progress has been slow and careful as the priority is to prevent as much data loss as possible.

So far 1 node has been repaired and reintroduced to the cluster and at this stage, we've been able to recover almost all volumes on the cluster. The storage vendor staff have been working with their Engineering Team attempting to bring the remaining volumes online.

In order to minimise the potential of any unrecoverable volumes, the storage vendor staff has requested that we do not bring active workload on to the storage cluster until their Senior Engineering Team gives the go ahead.

We anticipate that we'll be able to start bringing Virtual Machines online between 04:00 and 05:00 AEDT (Sydney time).

We will continue to post further status updates here every 3 hours, or when there is a significant change in the situation.

Posted Nov 15, 2019 - 22:30 AEDT

Update

Recovery efforts are continuing.

We are continuing to work with the storage vendor to recover the cluster. Engineers are currently performing verification procedures. We are unable to provide an ETA at this stage.

Alternate infrastructure has been provisioned and we are planning a process to allow the option to recover customer services onto new hardware as a contingency plan.

We have been and will continue to work on resolving this issue with the highest priority. We will continue to post further status updates here every 3 hours, or when there is a significant change in the situation.

Posted Nov 15, 2019 - 19:44 AEDT

Update

Recovery efforts are still continuing.

We are continuing to work with the storage vendor to recover the cluster. We have made some progress, bringing one of the failed storage members back online. Engineers are currently performing verification procedures. We are unable to provide an ETA at this stage.

Alternate infrastructure has been provisioned and we are planning a process to allow the option to recover customer services onto new hardware as a contingency plan.

We have been and will continue to work on resolving this issue with the highest priority. We will continue to post further status updates here within 60 minutes, or if there is a significant change in the situation.

Posted Nov 15, 2019 - 18:36 AEDT

Update

Recovery efforts are still continuing.

We are continuing to work with the storage vendor to recover the cluster, and are executing an action plan provided by the vendor.

Alternate infrastructure has been provisioned and we are planning a process to allow the option to recover customer services onto new hardware as a contingency plan.

We have been and will continue to work on resolving this issue with the highest priority. We will continue to post further status updates here within 60 minutes, or if there is a significant

Posted Nov 15, 2019 - 17:31 AEDT

Update

Recovery efforts are still continuing.

We are continuing to work with the storage vendor to recover the cluster, and are executing an action plan provided by the vendor.

Alternate infrastructure has been provisioned and we are still verifying a process to allow the option to recover customer services onto new hardware as a contingency plan.

We have been and will continue to work on resolving this issue with the highest priority. We will continue to post further status updates here within 60 minutes, or if there is a significant change on the issue.

Posted Nov 15, 2019 - 16:09 AEDT

Update

Recovery efforts are still continuing.

Progress has been made on the recovery of the cluster. We are continuing to work with the storage vendor to recover the cluster, and are executing an action plan provided by the vendor.

Alternate infrastructure has been provisioned and we are verifying a process to allow the option to recover customer services onto new hardware as a contingency plan.

We have been and will continue to work on resolving this issue with the highest priority. We will continue to post further status updates here within 60 minutes, or if there is a significant change on the issue.

Posted Nov 15, 2019 - 15:09 AEDT

Update

Recovery efforts are still continuing.

We are continuing to work with the storage vendor to recover the cluster, and are still executing an action plan provided by the vendor. We are still unable to provide an ETA on this process at this stage.

Alternate infrastructure has been provisioned and configuration is underway to allow us the option to recover customer VPS onto new hardware as a contingency plan.

We have been and will continue to work on resolving this issue with the highest priority. We will continue to post further status updates here within 60 minutes, or if there is a significant change on the issue.

Posted Nov 15, 2019 - 14:00 AEDT

Update

Recovery efforts are still continuing.

We are continuing to work with the storage vendor to recover the cluster, and are still executing an action plan provided by the vendor. We are still unable to provide an ETA on this process at this stage.

Alternate infrastructure has been provisioned and configuration is underway to allow us to recover customer VPS onto new hardware as a contingency plan.

We have been and will continue to work on resolving this issue with the highest priority. We will continue to post further status updates here within 60 minutes, or if there is a significant change on the issue.

Posted Nov 15, 2019 - 13:00 AEDT

Update

Recovery efforts are still continuing.

We are continuing to work with the storage vendor to recover the cluster, and are currently executing an action plan provided by the vendor. We are still unable to provide an ETA on this process.

Alternate infrastructure has been provisioned and configuration is underway to allow us to recover customer VPS onto new hardware as a contingency plan.

We have been and will continue to work on resolving this issue with the highest priority. We will continue to post further status updates here within 60 minutes, or if there is a significant change on the issue.

Posted Nov 15, 2019 - 12:04 AEDT

Update

Recovery efforts continue.

We are continuing to work with the storage vendor to recover the cluster.

Alternate infrastructure is still in the process of being stood up to recover customer VPS onto as a contingency plan.

We have been and will continue to work on resolving this issue with the highest priority. We will continue to post further status updates here within 60 minutes, or if there is a significant change on the issue.

Posted Nov 15, 2019 - 11:19 AEDT

Update

Posted Nov 15, 2019 - 10:06 AEDT

Update

Posted Nov 15, 2019 - 08:14 AEDT

Update

Recovery efforts continue.

Alternate infrastructure is still in the process of being stood up to recover customer VPS onto as a contingency plan.

We are continuing to work with the storage vendor to recover the cluster, progress has been made but great care is being taken to avoid any data-loss.

Posted Nov 15, 2019 - 06:20 AEDT

Update

Recovery efforts continue.

Alternate infrastructure is in the process of being stood up to recover customer VPS onto as a contingency plan.

We are also continuing to work with the storage vendor to recover the cluster.

Posted Nov 15, 2019 - 03:43 AEDT

Update

Yesterday afternoon approximately 18:00 Sydney time we experienced a hardware failure within the storage infrastructure that supports the Blaze platform.

Recovery efforts were started immediately and have been ongoing since.

Alternate storage and hypervisor infrastructure is being built as a contingency in parallel while we work to engage the storage vendor to recover the cluster.

Some virtual machines will require bare metal restores, others will be able to be moved to the new infrastructure.

Posted Nov 15, 2019 - 01:42 AEDT

Update

We are continuing to work on our recovery efforts for this issue. We will continue to post further status updates here within 60 minutes, or if there is a significant change on the issue.

Posted Nov 15, 2019 - 01:19 AEDT

Update

We are continuing to work on our recovery efforts for this issue. We will continue to post further status updates here within 60 minutes, or if there is a significant change on the issue.

Posted Nov 15, 2019 - 00:21 AEDT

Update

We are continuing to work on our recovery efforts for this issue. We will continue to post further status updates here within 60 minutes, or if there is a significant change on the issue.

Posted Nov 14, 2019 - 23:25 AEDT

Update

We are continuing to make progress on our recovery efforts for this issue. We will continue to post further status updates here within 60 minutes, or if there is a significant change on the issue.

Posted Nov 14, 2019 - 22:20 AEDT

Update

We are making progress on our recovery efforts for this issue. We will continue to post further status updates here within 60 minutes, or if there is a significant change on the issue.

Posted Nov 14, 2019 - 21:59 AEDT

Update

We are continuing to troubleshoot the storage platform. We will endeavour to post a further status update within the next 30-60 minutes.

Posted Nov 14, 2019 - 21:56 AEDT

Update

We are continuing to troubleshoot the storage platform. We will endeavour to post a further status update within the next 30-60 minutes.

Posted Nov 14, 2019 - 20:47 AEDT

Update

We are continuing to troubleshoot the storage platform. No restoration progress to report as of yet. However, we are slowly but surely working our way through a list of potential outage causes.

Posted Nov 14, 2019 - 19:29 AEDT

Update

Multiple personnel are on site and currently manually taking stock of the storage platform health. We are continuing to look for the root cause of the issue, but as of yet do not have any further progress to report.

Posted Nov 14, 2019 - 18:41 AEDT

Identified

We have identified a major outage affecting the underlying storage layer for the Blaze VPS platform.

An incident response team has been formed and multiple personnel are currently working the problem as quickly as possible.

We will endeavour to post a further status update within the next 30-60 minutes.

Posted Nov 14, 2019 - 18:07 AEDT

Investigating

We are currently investigating an outage affecting multiple VPS hypervisor nodes.

We are working to resolve this with top priority and will provide an update in 60 minutes, or if there is a significant change in the situation.

Posted Nov 14, 2019 - 18:04 AEDT

This incident affected: Blaze Cloud Platform.