KSAT - Phishing and Training Campaign Delays (US)

Incident Report for KnowBe4

Postmortem

External Technical Root Cause Analysis - KSAT Job Queue Delays

This report outlines the detailed findings and mitigations associated with a long-running system degradation in the KnowBe4 Security Awareness Training (KSAT) platform. The degradation was caused by an interaction between a newly enabled Datadog auto-instrumentation feature and internal job-processing components. The impact resulted in significant delays to scheduled and background jobs across KSAT, including phishing campaigns, training notifications, and data synchronization workers.

Multiple customers experienced delays in expected notifications and campaign execution, as initially reported on June 30, 2025. The issue persisted across several weeks, escalating through multiple support tickets and internal investigations. It was ultimately resolved on August 4, 2025, with the deactivation of the problematic Datadog auto-instrumentation feature.

WHAT HAPPENED

The KSAT system is designed to manage and schedule security awareness campaigns and training notifications at scale. It employs a microservice architecture and distributed workers (AWS Lambdas) that enqueue, execute, and synchronize jobs across key systems like SQS & Valkey as well as internal systems.

On or around June 30, 2025, customers began reporting significant delays in campaign and notification execution. Jobs (listed below) exhibited degraded performance. Initial investigation efforts were inconclusive, leading to multiple iterative mitigation attempts.

PhishingCampaignsProcessor
TrainingNotificationsProcessor
PhishingCampaignRecipientAggregator

The root issue was eventually traced to a previously benign update made on June 17, when the Datadog auto-instrumentation feature was enabled for KSAT. The feature began collecting additional traces and spans, placing increasing pressure on Lambda invocations that interfaced heavily with external APIs. Under load, this instrumentation introduced overheads that led to longer execution times and retry loops across workers.

In parallel, a second issue was discovered: inefficient SQL queries in phishing campaign workflows that further exacerbated delays when large batches of translation UUIDs were retrieved from the ModStore. This specific query attempted to retrieve all localized versions of campaign content, meaning that at scale, it was pulling hundreds of thousands of templates concurrently. As a result, execution time ballooned to over 22 minutes —well beyond the AWS Lambda timeout threshold —contributing significantly to system-wide delays.

ROOT CAUSE

Primary Root Cause: Datadog Auto-Instrumentation Overhead

Instrumentation Overload: Enabling Datadog auto-instrumentation added runtime overhead to each Lambda's interaction with external systems (SQS, Valkey, Echelon, etc.). During periods of high job throughput, this instrumentation caused job execution to exceed Lambda timeout thresholds, leading to job retries and concurrency bottlenecks.
Hidden Bug in Datadog Layer: A suspected bug within the Datadog instrumentation libraries introduced performance degradation under stress, which compounded the issue.
Missing Monitoring: At the time, there was no dedicated Datadog monitor for job enqueue/dequeue time, making the degradation harder to detect at onset.

Secondary Root Cause: Long-Running SQL Query

Query Inefficiency: The template translation logic for phishing campaigns used a single large query containing up to hundreds of thousands of UUIDs.
Execution Latency: These queries occasionally crossed slow-query thresholds, leading to worker bottlenecks during campaign assembly.
Lack of Query Batching: No batching mechanism was implemented until after the issue was identified, compounding the performance problem.

FINDINGS AND MITIGATIONS

1. Datadog Auto-Instrumentation Overhead

Finding: Enabled June 17, began causing system-wide latency by June 30 under scale.
Mitigations:
- Aug 4: Disabled auto-instrumentation
- Introduced benchmark instrumentation
- Originally, this latency appeared to be caused by our new caching platform, Valkey. However, after opening a case with AWS and conducting a thorough validation of the Valkey infrastructure, including reviewing system latency metrics, we were able to rule out Valkey as a bottleneck. The platform was functioning as expected with no performance degradation attributable to it.

2. SQL Query Optimization

Finding: Inefficient batching of template UUIDs in campaign preparation caused slow queries.
Mitigations:
- Split UUIDs into more manageable chunks of 1,000-record batches to be processed by the workers.
- Lambda concurrency and timeout settings were suboptimal during times of high throughput.
  - July 1: Increased Lambda concurrency temporarily
  - July 24: Disabled SQS retry on high-impact workers

3. Observability Improvements

Finding: Lack of granularity in metrics made early detection difficult. Specifically, we lacked global catch-all Datadog monitors for identifying long-running queries across all services. Existing dashboards and alerts were scoped too narrowly to specific known patterns, which meant novel or unexpected query slowdowns-such as those observed during this incident-were not flagged early. This gap delayed our ability to correlate performance regressions with specific application-layer operations or query paths, increasing time-to-diagnosis.

Mitigations:

July 15: Added logging for recipient query performance
July 21: Expanded logs for notifications

July 29: Reduced batch size in workers

By decreasing the number of items each worker attempts to process at once (i.e., smaller batches), the system allows more workers to run in parallel. This change achieves two goals:

1. **Concurrency**: More workers handling smaller chunks means tasks can be processed in parallel, leading to better throughput.
2. **AWS Lambda Timeout Protection**: When batches are too large, there's a risk that a worker will hit the maximum execution time allowed by AWS Lambda. Smaller batches reduce the chance of this happening, improving reliability and avoiding dropped or stalled jobs. This was a mitigation for inefficient query execution that risked pushing the worker beyond its timeout when processing a large data set.

TECHNICAL DETAILS

The PhishingCampaingsProcessor and TrainingNotificationsProcessor both rely heavily on inter-service communication and data lookup chains. When the Datadog instrumentation was enabled, it wrapped outbound HTTP, Valkey, and SQS calls, significantly increasing execution latency.

During peak load, this additional overhead extended job execution past Lambda timeout thresholds, triggering SQS retries. This cascading retry pattern overwhelmed the system’s processing capacity, even with increased concurrency.

Concurrently, long-running SQL queries were responsible for campaign creation delays. These queries retrieved all template translations in one large query, which failed to scale with large campaigns. The eventual fix was batching and aggregating results across multiple smaller queries.

PREVENTIVE MEASURES

Disable Unvetted Instrumentation: Datadog auto-instrumentation will remain disabled pending further analysis with Datadog support.
Dedicated Monitors: Implement Datadog monitors to track job throughput and processing delays at scale and global catch-alls of abnormal system behaviors.
Improved Batching Logic: All query-intensive workers will be reviewed for batching improvements.
Expanded Test Coverage: Edge cases for worker throughput and API integration timing will be covered in automated testing scenarios that will execute throughout the day as another alerting mechanism.
Observability by Default: Workers must now log queue, processing, and exit times.

CUSTOMER IMPACT AND RECOVERY

Multiple customers reported delays in phishing campaign launches and training notification delivery between June 30 and August 4, 2025. Following the August 4 instrumentation deactivation, system performance stabilized, and queued jobs began processing within expected thresholds. No data loss occurred, but there was a significant delay in time-sensitive events. Full operational recovery was confirmed on August 5, 2025.

CONCLUSION

This incident highlighted the risks of enabling system-wide instrumentation without performance load testing, especially on time-sensitive distributed jobs. It also exposed insufficient monitoring and test coverage for queue-based workers under stress.

As a result of this RCA:

The Datadog auto-instrumentation has been permanently disabled.
Enhanced observability and benchmarking systems are in place.
Query optimizations and batching logic improvements have been implemented.

We are committed to ensuring high availability, transparency, and performance for all KSAT customers, and will continue to invest in system reliability and proactive monitoring.

Posted Aug 11, 2025 - 13:39 UTC

Resolved

This incident has been resolved, and we are no longer seeing delays with campaigns. An RCA will be posted when available.

Posted Aug 06, 2025 - 15:15 UTC

Update

We have implemented several fixes to help improve performance on Phishing and Training campaigns. We are continuing to monitor the results and will continue to make improvements as edge-cases come up.

Posted Aug 05, 2025 - 17:03 UTC

Update

We are continuing to monitor for any further issues.

Posted Aug 01, 2025 - 17:57 UTC

Update

A fix has been implemented that has resolved a majority of cases. We are continuing to monitor the results while we make additional improvements and investigate edge-cases.

Posted Jul 31, 2025 - 16:59 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 29, 2025 - 20:57 UTC

Identified

We've identified the cause of delays in phishing campaigns and are working on implementing a fix. We'll continue to post on our status page with any new information or updates.

Posted Jul 29, 2025 - 17:08 UTC

This incident affected: KnowBe4 Security Awareness Training (KSAT) (Phishing, Training).