This report outlines the findings and mitigations related to a login outage affecting the KnowBe4 Security Awareness Training (KSAT) and PhishER platforms within the Canada (CA) environment.
The issue originated from a recently deployed data processing workflow that interacted unexpectedly with internal database components responsible for authentication and enrollment. The resulting database locks caused complete login failures for all users in the Canada region between 7:30 p.m. and 9:58 p.m. UTC on October 13, 2025.
The incident was fully resolved following the removal of the data processing workflow, which immediately restored normal authentication operations. No data loss occurred.
On October 13, 2025 at 7:30 p.m. UTC, engineers observed login errors in the KSAT Canada environment, quickly followed by similar failures in PhishER. Automated monitoring confirmed that authentication requests were timing out at the database layer.
Initial hypotheses focused on campaign update backlogs and database saturation. However, further analysis revealed the issue coincided with the deployment of a new data export workflow earlier that same day. This workflow initiated intensive queries against shared authentication tables, resulting in locks that blocked concurrent login and enrollment operations.
A secondary issue compounded the impact: a configuration mismatch in the workflow caused its exclusion filters to fail. These filters were intended to skip large records to reduce load, but due to a naming discrepancy in the production database schemas, they were inadvertently ignored. As a result, the workflow attempted to process all records, including oversized ones, triggering crashes and automatic restarts. This “crash loop” continuously re-acquired locks on critical tables, preventing recovery until the workflow was removed.
Primary Root Cause: Data Processing Workflow Deployment
Secondary Root Cause: Configuration Mismatch and Crash Loop
The data processing workflow introduced locking contention on production authentication tables.Mitigation:
Authentication timeouts persisted intermittently during investigation due to residual database contention.Mitigation:
The lack of targeted database lock monitoring delayed the identification of the blocking component.Mitigation:
The deployed data export workflow executed read-heavy queries on shared user authentication and enrollment tables. These operations held shared locks incompatible with concurrent updates, blocking normal authentication processes.
Due to the configuration mismatch, exclusion filters did not apply, allowing the workflow to process large binary data columns. Each crash triggered automatic restarts, which in turn retried the same operations—reacquiring locks and creating a repeating outage loop.
Removal of the workflow immediately stopped lock contention, allowing the database to clear queued transactions and restore login functionality.
Between 7:30 p.m. and 9:58 p.m. UTC, customers in the Canada (CA) region experienced issues logging in to KSAT and PhishER. All other regions remained unaffected.
Following workflow removal, services recovered immediately, and all users regained access. No data loss or corruption occurred.
This incident was caused by a combination of resource contention and configuration mismatch in a newly deployed workflow. The interplay between database locking and restart behavior created a self-sustaining failure loop that fully blocked authentication.
As a result of this incident:
KnowBe4 remains committed to ensuring high availability, transparency, and reliability across all KSAT and PhishER environments.