This report outlines the detailed findings and mitigations associated with a long-running system degradation in the KnowBe4 Security Awareness Training (KSAT) platform. The degradation was caused by an interaction between a newly enabled Datadog auto-instrumentation feature and internal job-processing components. The impact resulted in significant delays to scheduled and background jobs across KSAT, including phishing campaigns, training notifications, and data synchronization workers.
Multiple customers experienced delays in expected notifications and campaign execution, as initially reported on June 30, 2025. The issue persisted across several weeks, escalating through multiple support tickets and internal investigations. It was ultimately resolved on August 4, 2025, with the deactivation of the problematic Datadog auto-instrumentation feature.
The KSAT system is designed to manage and schedule security awareness campaigns and training notifications at scale. It employs a microservice architecture and distributed workers (AWS Lambdas) that enqueue, execute, and synchronize jobs across key systems like SQS & Valkey as well as internal systems.
On or around June 30, 2025, customers began reporting significant delays in campaign and notification execution. Jobs (listed below) exhibited degraded performance. Initial investigation efforts were inconclusive, leading to multiple iterative mitigation attempts.
PhishingCampaignsProcessor
TrainingNotificationsProcessor
PhishingCampaignRecipientAggregator
The root issue was eventually traced to a previously benign update made on June 17, when the Datadog auto-instrumentation feature was enabled for KSAT. The feature began collecting additional traces and spans, placing increasing pressure on Lambda invocations that interfaced heavily with external APIs. Under load, this instrumentation introduced overheads that led to longer execution times and retry loops across workers.
In parallel, a second issue was discovered: inefficient SQL queries in phishing campaign workflows that further exacerbated delays when large batches of translation UUIDs were retrieved from the ModStore. This specific query attempted to retrieve all localized versions of campaign content, meaning that at scale, it was pulling hundreds of thousands of templates concurrently. As a result, execution time ballooned to over 22 minutes —well beyond the AWS Lambda timeout threshold —contributing significantly to system-wide delays.
Mitigations:
Mitigations:
Lambda concurrency and timeout settings were suboptimal during times of high throughput.
Mitigations:
July 29: Reduced batch size in workers
1. **Concurrency**: More workers handling smaller chunks means tasks can be processed in parallel, leading to better throughput.
2. **AWS Lambda Timeout Protection**: When batches are too large, there's a risk that a worker will hit the maximum execution time allowed by AWS Lambda. Smaller batches reduce the chance of this happening, improving reliability and avoiding dropped or stalled jobs. This was a mitigation for inefficient query execution that risked pushing the worker beyond its timeout when processing a large data set.
The PhishingCampaingsProcessor
and TrainingNotificationsProcessor
both rely heavily on inter-service communication and data lookup chains. When the Datadog instrumentation was enabled, it wrapped outbound HTTP, Valkey, and SQS calls, significantly increasing execution latency.
During peak load, this additional overhead extended job execution past Lambda timeout thresholds, triggering SQS retries. This cascading retry pattern overwhelmed the system’s processing capacity, even with increased concurrency.
Concurrently, long-running SQL queries were responsible for campaign creation delays. These queries retrieved all template translations in one large query, which failed to scale with large campaigns. The eventual fix was batching and aggregating results across multiple smaller queries.
Multiple customers reported delays in phishing campaign launches and training notification delivery between June 30 and August 4, 2025. Following the August 4 instrumentation deactivation, system performance stabilized, and queued jobs began processing within expected thresholds. No data loss occurred, but there was a significant delay in time-sensitive events. Full operational recovery was confirmed on August 5, 2025.
This incident highlighted the risks of enabling system-wide instrumentation without performance load testing, especially on time-sensitive distributed jobs. It also exposed insufficient monitoring and test coverage for queue-based workers under stress.
As a result of this RCA:
We are committed to ensuring high availability, transparency, and performance for all KSAT customers, and will continue to invest in system reliability and proactive monitoring.