KnowBe4 Service Operational Issues

Incident Report for KnowBe4

Postmortem

External Technical Root Cause Analysis: Service Disruptions October 20, 2025

Summary

On Monday, October 20, 2025, multiple KnowBe4 products, including Defend, KSAT, and PhishER, experienced major service degradation following a regional AWS outage in US-EAST-1.

According to AWS’s official post-event summary, the root cause was a race condition in the control-plane DNS infrastructure for Amazon DynamoDB, which prevented proper DNS resolution for DynamoDB endpoints. This failure caused a cascade of dependency failures across several AWS services that KnowBe4 leverages for delivery of our products, including ECS, Lambda, SQS, and Networking, all of which depend on DynamoDB for state management and metadata coordination.

The disruption led to widespread queue accumulation, container launch failures, and elevated latency across KnowBe4’s systems. Basic platform functionality (web navigation, login, and API access) was restored for most services by 18:15 (UTC), and full service restoration was completed by 22:00 (UTC), with no data loss reported.

Glossary of Technical Terms

The following terms are provided for clarity and reference.

ECS (Elastic Container Service) – An AWS service for hosting containers, the lightweight “servers” that KnowBe4 uses to process website traffic, background jobs, and campaign tasks.

Lambda – AWS’s serverless compute platform that runs small, event-driven functions without needing a full-blown server. KnowBe4 uses this service for short-lived jobs like message classification and workflow automation.

DynamoDB – AWS’s fully managed NoSQL database. Many AWS services (and some KnowBe4 components) depend on it for storing state and metadata used to coordinate system activity.

SQS (Simple Queue Service) – A message-queuing system that temporarily stores jobs or events between services. KnowBe4 uses it to handle spikes in workload safely and ensure reliable event delivery.

Control Plane – The set of management systems that create, update, and coordinate resources (e.g., starting containers or functions). If the control plane fails, new resources can’t launch even if old ones keep running.

DNS (Domain Name System) – The internet’s “address book”, translating names like training.knowbe4.com into network addresses. A DNS failure can prevent services from finding each other inside AWS.

Retry Storm – A sudden surge of repeated attempts to reach a failed service. During outages, these retries can flood systems and make recovery slower.

Queue Backlog – A term used to describe when many jobs are piled up waiting for processing. This happened in Defend, KSAT, and PhishER during the outage.

WHAT HAPPENED

AWS Infrastructure Outage

At 06:49 (UTC) on October 20, 2025, AWS observed elevated error rates across multiple services within the US-EAST-1 region. By 07:26 (UTC), AWS identified the issue as DNS resolution failures affecting Amazon DynamoDB endpoints, caused by a race condition in the DynamoDB control-plane DNS infrastructure. The fault disrupted DynamoDB’s ability to resolve internal service addresses, leading to widespread dependency failures across AWS services that rely on it for metadata and state management.

AWS mitigated the underlying DNS race condition by 09:24 (UTC), restoring core DynamoDB functionality. However, extensive retry traffic and queued requests from dependent services (including ECS, Lambda, and SQS) continued to overload control-plane capacity for several hours. AWS applied rate-limiting and phased service restarts to stabilize the region and prevent further saturation.

Because many of KnowBe4’s workloads depend on ECS and Lambda orchestration, the outage manifested as container and task launch failures across Defend, KSAT, and PhishER. Existing workloads largely remained functional but were unable to scale or replace failed tasks. As AWS restored regional control-plane stability, KnowBe4 systems began recovering, with basic functionality (web navigation, login, API access) available again by 18:15 (UTC) and full service restoration completed by 22:00 (UTC), with no data loss reported.

Platform-Level Effects

Defend

US Region

At 07:30 (UTC), Defend received alerts indicating service degradation. DNS health checks failed open due to unavailable AWS metrics affecting both US-EAST-1 and US-WEST-2. Although automatic failover executed successfully, the absence of health-check metrics caused traffic to remain evenly split between regions rather than fully shifting west.
By 10:00 (UTC), AWS metrics had recovered, allowing new traffic to route exclusively to US-WEST-2. A small number of messages that had already passed the load balancer in US-EAST-1 were temporarily queued and unable to complete processing while the regional AWS issues persisted.
All queued messages were successfully processed by 11:56 (UTC), and message throughput returned to normal. A limited subset of customers experienced delayed message processing, while the majority were unaffected thanks to the partial success of automatic failover.

UK Region

At 07:30 (UTC), AWS metric outages caused DNS health checks in the UK region to fail open. Engineering teams identified a legacy DNS record that was incorrectly evaluated as the healthiest endpoint, returning inactive IP addresses to clients. These addresses no longer pointed to live infrastructure, causing SMTP connection failures across all mail traffic.
Once AWS restored console access around 09:30 (UTC), engineers deployed a configuration change to remove the legacy record. Microsoft’s mail infrastructure automatically retried previously failed deliveries between 09:30 and 10:30 (UTC), and traffic volumes had normalized by 10:30.
All SMTP customers in the UK region experienced temporary service disruption, while customers using Microsoft Graph API for mail delivery were unaffected.

KSAT

KSAT web services, campaign dispatchers, and enrollment workers failed to provision new tasks from 06:50 through 18:00 (UTC), resulting in delayed campaign launches, notification emails, and user enrollments.
KSAT APIs and the web console experienced intermittent timeouts and latency until the DynamoDB and ECS control-plane operations recovered.
Recovery began around 18:00 (UTC), with basic product functionality restored by 18:15 (UTC) and backlog queues clearing completely around 21:45 (UTC).
All training and phishing campaigns resumed automatically, with no queued data lost.

PhishER

PhishER’s email-categorization and YARA rule-evaluation workers, which rely on Lambda and SQS orchestration, began failing at 06:50 (UTC), producing ingestion and processing backlogs.
Message classification latency increased significantly throughout the morning as Lambda provisioning errors persisted.
AWS service recovery between 18:00 and 19:00 (UTC) allowed task provisioning to resume and queues to drain.
Full message flow was restored by 20:55 (UTC), with zero data loss.

Secondary Issue: Email Delivery

During the AWS service disruption, KnowBe4’s phishing campaign mail senders were affected by the same control-plane limitations that prevented new ECS task launches.
- To restore outbound campaign capacity, the mail service was temporarily reconfigured to operate using an alternate networking path that bypassed normal IP-restriction controls.
- On October 20, from 18:06 UTC until October 21 at 16:33 UTC, approximately 80,000 phishing simulation emails were delivered from a KnowBe4-owned backup IP address (34.199.159.209) that is not published on the official allowlist for phishing mail delivery. This IP does appear in other internal documentation (such as the Syslog whitelist) but is not intended for use with phishing campaigns.
As a result, some recipients may have experienced false-positive click activity or mail filtering inconsistencies. The issue was identified and resolved the day after the primary outage.
Customers are advised not to whitelist the backup IP and to contact support if discrepancies appeared in campaign metrics.

ROOT CAUSE ANALYSIS

The service disruptions on October 20, 2025 originated from a race condition in the control-plane DNS infrastructure for Amazon DynamoDB in the US-EAST-1 region, preventing proper DNS resolution for DynamoDB endpoints and disrupting multiple dependent AWS services — most notably ECS, Lambda, and SQS — that rely on DynamoDB for metadata and state management.

As a result, new container and function launches failed across US-EAST-1, while existing workloads generally remained operational but could not scale or replace failed tasks. AWS mitigated the core DNS race condition by 09:24 (UTC), but cascading retry storms and queued requests from dependent services prolonged instability through the morning. AWS applied rate-limiting and phased restarts to gradually restore control-plane capacity.

For KnowBe4, this manifested differently across products and regions:

In Defend (US region), DNS health checks failed open due to unavailable AWS service metrics, causing the global traffic manager to split requests roughly 50/50 between US-EAST-1 and US-WEST-2 instead of fully failing over. Although automatic failover executed successfully, missing AWS metrics prevented it from completing cleanly. Once AWS metrics returned (approximately 10:00 (UTC)), traffic shifted fully to US-WEST-2, and remaining messages queued in US-EAST-1 processed by 11:56 (UTC).
In Defend (UK region), AWS console and metric unavailability led to a legacy DNS record being erroneously evaluated as healthy. This caused mail traffic to resolve to inactive IPs, resulting in SMTP connection failures from 07:30 - 10:30 (UTC). Once AWS restored console access, engineers removed the legacy record, and Microsoft retried failed deliveries automatically.
In KSAT and PhishER, failures in ECS, Lambda, and SQS prevented new task provisioning from 06:50 (UTC) onward, leading to significant backlog accumulation. Basic functionality across most products returned by 16:15 (UTC), and all queued workloads cleared successfully by 22:00 (UTC), with no data loss.

The underlying cause of the incident was entirely within AWS’s managed infrastructure. AWS has since implemented changes to the DynamoDB DNS subsystem to remove the race condition and improve isolation of control-plane workloads.

On KnowBe4’s side, the event exposed dependencies between AWS regional metrics, DNS health checks, and traffic routing logic that can complicate failover when control-plane telemetry becomes unavailable. Internal mitigations have been applied to improve observability, region-aware failover logic, and operational readiness during third-party service disruptions.

Detailed Timeline (UTC)

‌

Time (UTC)	Event
06:49	AWS identifies elevated error rates in US-EAST-1; DynamoDB control-plane DNS race condition begins, causing service failures in Defend, KSAT, PhishER.
07:30	Defend US and UK regions detect DNS health-check anomalies; automatic failover partially executes. AWS metrics unavailable.
08:30 - 09:30	Engineers identify and remove legacy DNS record after AWS console access is restored; Microsoft retries failed SMTP deliveries.
09:24	AWS mitigates the underlying DynamoDB DNS issue but dependent services (ECS, Lambda, SQS, ECR, ENI) remain impaired due to retry traffic and backlogs.
10:00	Defend US routing fully shifts to US-WEST-2; residual queued messages remain in US-EAST-1 awaiting processing.
11:56	All queued Defend US messages processed; Defend UK and US regions fully stable.
18:00 - 19:00	AWS control-plane stability largely restored; ECS/Lambda task launches resume across KSAT and PhishER; backlogged jobs begin processing normally.
20:55	PhishER message-flow normalization completed; PhishER appears to be healthy.
21:45	KSAT queues are fully processed; all campaigns and enrollments complete successfully; KSAT appears to be healthy.
22:00	All KnowBe4 services confirmed operational across regions; full recovery achieved with no data loss.

‌

FINDINGS AND MITIGATIONS

1. AWS Control-Plane Instability

A race condition in the DynamoDB control-plane DNS infrastructure in US-EAST-1 caused widespread dependency failures across AWS services (ECS, Lambda, SQS, ECR, ENI). The failure prevented new resource launches, degraded AWS metrics visibility, and disrupted hosted applications globally.

Mitigations (AWS):

AWS resolved the DNS race condition by 09:24 (UTC) and implemented architectural changes to eliminate single-threaded DNS dependencies in DynamoDB’s control plane.
Introduced improved isolation boundaries between control-plane workloads and enhanced retry and back-pressure logic to prevent saturation from cascading retry traffic.
Committed to improved monitoring and internal alerting for cross-service dependency failures.

Mitigations (KnowBe4):

Adjusted internal service scaling and SQS polling intervals to avoid excessive retries during third-party degradation.
Updated dependency tracking to recognize AWS service health events earlier in our observability stack.
Enhanced escalation procedures for AWS-originating regional incidents.

2. Defend Regional Failover and DNS Dependencies

During the outage, AWS metric unavailability caused DNS health checks for Defend to fail open. This resulted in incomplete automatic failover between US-EAST-1 and US-WEST-2, and in the UK region, a legacy DNS record was incorrectly evaluated as healthy, directing SMTP traffic to inactive endpoints.

Mitigations:

Updated failover logic to include fallback routing when AWS health metrics are unavailable.
Implemented automated alerts for DNS records that reference inactive infrastructure.
Strengthened the DNS validation process to prevent reintroduction of deprecated records.
Verified that Graph-based mail routing remains isolated from DNS-based health metrics.

3. Queue Backlog and Job Prioritization

Outage conditions caused millions of delayed jobs across KSAT and PhishER queues. ECS and Lambda provisioning failures prevented new workers from being launched, and retry behavior risked additional load on recovering infrastructure.

Mitigations:

Implemented priority-aware scaling logic to allow controlled queue draining once AWS service capacity resumes.
Introduced additional visibility metrics for queue length, processing throughput, and per-region lag.
Added safeguards to prevent retry amplification during partial recoveries.

4. Temporary Email Delivery Workaround

To restore phishing-campaign delivery during the outage, a temporary alternate network configuration was used, resulting in approximately 80 000 messages being sent from an unlisted IP address (34.199.159.209). This created potential for false-positive clicks and mail-filtering inconsistencies.

Mitigations:

Issue identified and configuration reverted by October 21, 2025, at 16:33 (UTC).
Customers notified not to whitelist the backup IP; monitoring added to flag non-approved sender routes.
Engineering has initiated a project to establish pre-approved, resilient failover paths for outbound mail that maintain customer-visible IP consistency.

CUSTOMER IMPACT

Defend - US Region

A subset of customers experienced delayed message processing between 07:30 and 11:56 (UTC) due to queued traffic in US-EAST-1.
Automatic failover to US-WEST-2 executed successfully but operated at a partial 50/50 traffic split until AWS metrics recovered at 10:00 (UTC).
The majority of customers experienced no outage, and all queued messages were successfully processed with no loss of telemetry or data.

Defend - UK Region

From 07:30 to 10:30 (UTC), customers using SMTP-based mail delivery experienced a complete service interruption caused by DNS mis-routing to inactive infrastructure.
Customers using Microsoft Graph API for email delivery were not affected.
Microsoft’s automated retry mechanisms successfully delivered all previously failed messages once the DNS configuration was corrected.

KSAT

Customers experienced delayed phishing campaign launches, training enrollments, and notification deliveries from 06:50 through 21:45 (UTC).
Platform login, dashboard, and API access were intermittently degraded until 16:15 (UTC), when basic functionality was restored.
All queued campaigns resumed automatically; no user data or campaign content was lost.

PhishER

Customers observed slower email classification, rule evaluation, and workflow execution from 06:50 through 20:55 (UTC), corresponding to Lambda provisioning failures and SQS backlog buildup.
All queued messages were processed successfully, with zero data loss or permanent rule failures.

Email Delivery (Secondary Issue)

On October 20, from 18:06 (UTC) until October 21 at 16:33 (UTC), approximately 80 000 phishing simulation emails were sent from a KnowBe4-owned backup IP address (34.199.159.209) not listed on the standard allowlist.
This temporary measure allowed campaigns to continue sending but introduced a risk of false-positive clicks and mail-filtering inconsistencies for some customers.
The configuration was reverted on October 21, and impacted customers were notified and advised not to whitelist the backup IP.

PREVENTIVE MEASURES

Region-Aware Failover ImprovementsUpdated Defend’s failover logic to better handle conditions where AWS metrics or health checks are unavailable. DNS failover can now evaluate multiple signals, ensuring full traffic diversion when metric sources fail open.
Cross-Region Deployment ValidationAdded automated verification of DNS and health-check configurations in all active Defend regions to prevent legacy or inactive endpoints from being marked as healthy.
Priority-Aware Scaling and Queue DrainingImplemented job-aware scaling logic for ECS and Lambda workers to allow controlled queue recovery after service interruptions without overloading dependent systems.
Retry Behavior ControlsIntroduced stronger client-side back-off and exponential retry behavior in KnowBe4 systems to prevent retry amplification during upstream cloud instability.
Mail Infrastructure HardeningWe will be implementing a project to expand mail delivery routes to ensure that we can maintain consistent IP reputation and eliminate the need for manual overrides during future regional events.

CONCLUSION

The service disruptions on October 20, 2025 demonstrated how a single upstream control-plane failure within AWS can cascade across dependent systems and regions. The race condition in the DynamoDB DNS infrastructure in US-EAST-1 triggered widespread instability in services fundamental to KnowBe4’s hosting architecture — particularly ECS, Lambda, and SQS — which in turn affected Defend, KSAT, and PhishER.

Despite the scale of the AWS outage, KnowBe4’s platforms remained data-intact, and core functionality recovered within hours of AWS mitigation. Automatic failover mechanisms limited customer impact in the US region, while targeted engineering intervention restored full operation in the UK region once AWS console access returned. Queue-based systems such as KSAT and PhishER processed all backlogged jobs successfully, and no data loss occurred.

The incident validated the recovery processes of our systems following severe third-party degradation while also revealing opportunities to improve scaling control and retry behavior during upstream cloud instability. The preventive measures coming out of this incident — enhanced failover logic, automated DNS validation, job-aware queue scaling, and mail infrastructure hardening — directly address those gaps and strengthen our ability to withstand future regional cloud events.

KnowBe4 continues to monitor AWS’s remediation efforts related to this incident and will integrate any further best practices as they are published. Through these combined improvements, we are better positioned to maintain continuity, minimize customer impact, and ensure operational transparency during large-scale cloud disruptions.

Posted Oct 27, 2025 - 15:33 UTC

Resolved

The performance issues impacting our platform have been addressed and this incident has been resolved. Our team will be able to provide a root cause analysis once we have received full details on the service event incident from AWS.

Posted Oct 21, 2025 - 13:57 UTC

Update

We continue to experience intermittent degraded performance across parts of the platform due to an AWS service incident in the US-EAST-1 region. Service levels continue to improve, and our teams continue to monitor the situation closely.

Posted Oct 20, 2025 - 20:06 UTC

Update

We are continuing to monitor for any further issues.

Posted Oct 20, 2025 - 12:06 UTC

Update

Phish Alert, Secure Workspace, and Defend Mail Flow are operational again. We are continuing to monitor PhishER and KSAT.

Posted Oct 20, 2025 - 12:00 UTC

Update

We are continuing to monitor for any further issues.

Posted Oct 20, 2025 - 10:19 UTC

Update

We are continuing to monitor for any further issues.

Posted Oct 20, 2025 - 10:15 UTC

Monitoring

We experienced degraded performance across parts of the platform earlier today due to an AWS service incident in the US-EAST-1 region. Service levels are improving, and our teams continue to monitor the situation closely.

Posted Oct 20, 2025 - 09:47 UTC

This incident affected: PhishER (Console, Notification Service, Inbox, Integrations, PhishRIP, PhishML, API / Webhooks), KnowBe4 Security Awareness Training (KSAT) (Console, Phishing, Training, Learner Experience (LX), Email Delivery, User Provisioning, Reporting, APIs), SecurityCoach (Security Vendor Integration Service, Real-Time Coaching Campaigns, Real-Time Delivery Methods), Defend (Mail Flow), and Secure Workspace, Phish Alert Button.