On Monday, October 20, 2025, multiple KnowBe4 products, including Defend, KSAT, and PhishER, experienced major service degradation following a regional AWS outage in US-EAST-1.
According to AWS’s official post-event summary, the root cause was a race condition in the control-plane DNS infrastructure for Amazon DynamoDB, which prevented proper DNS resolution for DynamoDB endpoints. This failure caused a cascade of dependency failures across several AWS services that KnowBe4 leverages for delivery of our products, including ECS, Lambda, SQS, and Networking, all of which depend on DynamoDB for state management and metadata coordination.
The disruption led to widespread queue accumulation, container launch failures, and elevated latency across KnowBe4’s systems. Basic platform functionality (web navigation, login, and API access) was restored for most services by 18:15 (UTC), and full service restoration was completed by 22:00 (UTC), with no data loss reported.
The following terms are provided for clarity and reference.
ECS (Elastic Container Service) – An AWS service for hosting containers, the lightweight “servers” that KnowBe4 uses to process website traffic, background jobs, and campaign tasks.
Lambda – AWS’s serverless compute platform that runs small, event-driven functions without needing a full-blown server. KnowBe4 uses this service for short-lived jobs like message classification and workflow automation.
DynamoDB – AWS’s fully managed NoSQL database. Many AWS services (and some KnowBe4 components) depend on it for storing state and metadata used to coordinate system activity.
SQS (Simple Queue Service) – A message-queuing system that temporarily stores jobs or events between services. KnowBe4 uses it to handle spikes in workload safely and ensure reliable event delivery.
Control Plane – The set of management systems that create, update, and coordinate resources (e.g., starting containers or functions). If the control plane fails, new resources can’t launch even if old ones keep running.
DNS (Domain Name System) – The internet’s “address book”, translating names like training.knowbe4.com into network addresses. A DNS failure can prevent services from finding each other inside AWS.
Retry Storm – A sudden surge of repeated attempts to reach a failed service. During outages, these retries can flood systems and make recovery slower.
Queue Backlog – A term used to describe when many jobs are piled up waiting for processing. This happened in Defend, KSAT, and PhishER during the outage.
At 06:49 (UTC) on October 20, 2025, AWS observed elevated error rates across multiple services within the US-EAST-1 region. By 07:26 (UTC), AWS identified the issue as DNS resolution failures affecting Amazon DynamoDB endpoints, caused by a race condition in the DynamoDB control-plane DNS infrastructure. The fault disrupted DynamoDB’s ability to resolve internal service addresses, leading to widespread dependency failures across AWS services that rely on it for metadata and state management.
AWS mitigated the underlying DNS race condition by 09:24 (UTC), restoring core DynamoDB functionality. However, extensive retry traffic and queued requests from dependent services (including ECS, Lambda, and SQS) continued to overload control-plane capacity for several hours. AWS applied rate-limiting and phased service restarts to stabilize the region and prevent further saturation.
Because many of KnowBe4’s workloads depend on ECS and Lambda orchestration, the outage manifested as container and task launch failures across Defend, KSAT, and PhishER. Existing workloads largely remained functional but were unable to scale or replace failed tasks. As AWS restored regional control-plane stability, KnowBe4 systems began recovering, with basic functionality (web navigation, login, API access) available again by 18:15 (UTC) and full service restoration completed by 22:00 (UTC), with no data loss reported.
During the AWS service disruption, KnowBe4’s phishing campaign mail senders were affected by the same control-plane limitations that prevented new ECS task launches.
As a result, some recipients may have experienced false-positive click activity or mail filtering inconsistencies. The issue was identified and resolved the day after the primary outage.
Customers are advised not to whitelist the backup IP and to contact support if discrepancies appeared in campaign metrics.
The service disruptions on October 20, 2025 originated from a race condition in the control-plane DNS infrastructure for Amazon DynamoDB in the US-EAST-1 region, preventing proper DNS resolution for DynamoDB endpoints and disrupting multiple dependent AWS services — most notably ECS, Lambda, and SQS — that rely on DynamoDB for metadata and state management.
As a result, new container and function launches failed across US-EAST-1, while existing workloads generally remained operational but could not scale or replace failed tasks. AWS mitigated the core DNS race condition by 09:24 (UTC), but cascading retry storms and queued requests from dependent services prolonged instability through the morning. AWS applied rate-limiting and phased restarts to gradually restore control-plane capacity.
For KnowBe4, this manifested differently across products and regions:
The underlying cause of the incident was entirely within AWS’s managed infrastructure. AWS has since implemented changes to the DynamoDB DNS subsystem to remove the race condition and improve isolation of control-plane workloads.
On KnowBe4’s side, the event exposed dependencies between AWS regional metrics, DNS health checks, and traffic routing logic that can complicate failover when control-plane telemetry becomes unavailable. Internal mitigations have been applied to improve observability, region-aware failover logic, and operational readiness during third-party service disruptions.
| Time (UTC) | Event |
|---|---|
| 06:49 | AWS identifies elevated error rates in US-EAST-1; DynamoDB control-plane DNS race condition begins, causing service failures in Defend, KSAT, PhishER. |
| 07:30 | Defend US and UK regions detect DNS health-check anomalies; automatic failover partially executes. AWS metrics unavailable. |
| 08:30 - 09:30 | Engineers identify and remove legacy DNS record after AWS console access is restored; Microsoft retries failed SMTP deliveries. |
| 09:24 | AWS mitigates the underlying DynamoDB DNS issue but dependent services (ECS, Lambda, SQS, ECR, ENI) remain impaired due to retry traffic and backlogs. |
| 10:00 | Defend US routing fully shifts to US-WEST-2; residual queued messages remain in US-EAST-1 awaiting processing. |
| 11:56 | All queued Defend US messages processed; Defend UK and US regions fully stable. |
| 18:00 - 19:00 | AWS control-plane stability largely restored; ECS/Lambda task launches resume across KSAT and PhishER; backlogged jobs begin processing normally. |
| 20:55 | PhishER message-flow normalization completed; PhishER appears to be healthy. |
| 21:45 | KSAT queues are fully processed; all campaigns and enrollments complete successfully; KSAT appears to be healthy. |
| 22:00 | All KnowBe4 services confirmed operational across regions; full recovery achieved with no data loss. |
A race condition in the DynamoDB control-plane DNS infrastructure in US-EAST-1 caused widespread dependency failures across AWS services (ECS, Lambda, SQS, ECR, ENI). The failure prevented new resource launches, degraded AWS metrics visibility, and disrupted hosted applications globally.
Mitigations (AWS):
Mitigations (KnowBe4):
During the outage, AWS metric unavailability caused DNS health checks for Defend to fail open. This resulted in incomplete automatic failover between US-EAST-1 and US-WEST-2, and in the UK region, a legacy DNS record was incorrectly evaluated as healthy, directing SMTP traffic to inactive endpoints.
Mitigations:
Outage conditions caused millions of delayed jobs across KSAT and PhishER queues. ECS and Lambda provisioning failures prevented new workers from being launched, and retry behavior risked additional load on recovering infrastructure.
Mitigations:
To restore phishing-campaign delivery during the outage, a temporary alternate network configuration was used, resulting in approximately 80 000 messages being sent from an unlisted IP address (34.199.159.209). This created potential for false-positive clicks and mail-filtering inconsistencies.
Mitigations:
The service disruptions on October 20, 2025 demonstrated how a single upstream control-plane failure within AWS can cascade across dependent systems and regions. The race condition in the DynamoDB DNS infrastructure in US-EAST-1 triggered widespread instability in services fundamental to KnowBe4’s hosting architecture — particularly ECS, Lambda, and SQS — which in turn affected Defend, KSAT, and PhishER.
Despite the scale of the AWS outage, KnowBe4’s platforms remained data-intact, and core functionality recovered within hours of AWS mitigation. Automatic failover mechanisms limited customer impact in the US region, while targeted engineering intervention restored full operation in the UK region once AWS console access returned. Queue-based systems such as KSAT and PhishER processed all backlogged jobs successfully, and no data loss occurred.
The incident validated the recovery processes of our systems following severe third-party degradation while also revealing opportunities to improve scaling control and retry behavior during upstream cloud instability. The preventive measures coming out of this incident — enhanced failover logic, automated DNS validation, job-aware queue scaling, and mail infrastructure hardening — directly address those gaps and strengthen our ability to withstand future regional cloud events.
KnowBe4 continues to monitor AWS’s remediation efforts related to this incident and will integrate any further best practices as they are published. Through these combined improvements, we are better positioned to maintain continuity, minimize customer impact, and ensure operational transparency during large-scale cloud disruptions.