Resolved -
Our investigation determined that the incident was driven by IOPS saturation on a few compute nodes in one of the node groups in the us-east-1 region. This condition impacted a subset of our infrastructure and contributed to intermittent downstream service errors for a handful of customers in this region.
As part of remediation, we will be increasing the provisioned IOPS capacity for the impacted node group to provide additional IO headroom and improve overall resilience. We are also working closely with AWS to analyse the underlying increase in disk usage further and validate long-term corrective measures.
Services are currently stable. We are continuing to monitor closely to ensure sustained reliability.
We apologise for the inconvenience and will follow up with a detailed root cause analysis.
Feb 19, 20:42 UTC
Update -
All services are currently operational. Some customers may still experience intermittent slowness while remediation efforts continue.
We are actively working with AWS to address the underlying infrastructure instability, and our internal engineering teams are simultaneously reviewing application functionality to ensure stability and performance. We remain focused on driving the environment to full recovery and are closely monitoring service health.
Further updates will be shared as progress continues.
Feb 19, 17:37 UTC
Update -
We continue to observe intermittent instability related to the underlying cloud infrastructure.
We are actively working with the cloud provider to investigate the cause of this intermittent behaviour and to implement corrective measures. Our teams remain engaged and are closely monitoring the environment.
We will provide further updates as more information becomes available.
Feb 19, 15:22 UTC
Monitoring -
Services have fully recovered, and we are no longer observing customer impact.
The issue was related to underlying infrastructure instability within our cloud environment, which led to downstream service disruption. We are actively working with our cloud provider to complete the root cause analysis and ensure preventive measures are implemented.
Feb 19, 14:48 UTC
Identified -
We are currently investigating a recurrence of the issue affecting Alation services. A downstream service is intermittently returning HTTP 500 errors, resulting in service failures and renewed impact.
Our team is actively analyzing the issue and working to mitigate the impact. We will provide further updates as additional information becomes available.
Feb 19, 13:49 UTC
Monitoring -
The issue originating from the hosted control plane has been mitigated. All impacted Alation services have been fully restored and are operating normally.
We will continue to monitor system stability and will share a detailed RCA once the investigation is complete
Feb 19, 11:51 UTC
Identified -
We are actively investigating the incident; preliminary analysis suggests a fault within the hosted control plane that caused downstream impact across multiple Alation services. Service recovery is now underway and metrics indicate stabilization.
Feb 19, 11:39 UTC
Investigating -
We are investigating reports of degraded performance affecting the Alation service. We will update as soon as we have more information available.
Feb 19, 11:01 UTC