KSAT & PER - Login Issues

Incident Report for KnowBe4

Postmortem

External Technical Root Cause Analysis:KSAT & PhishER Login Failures – Canada Environment

This report outlines the findings and mitigations related to a login outage affecting the KnowBe4 Security Awareness Training (KSAT) and PhishER platforms within the Canada (CA) environment.

The issue originated from a recently deployed data processing workflow that interacted unexpectedly with internal database components responsible for authentication and enrollment. The resulting database locks caused complete login failures for all users in the Canada region between 7:30 p.m. and 9:58 p.m. UTC on October 13, 2025.

The incident was fully resolved following the removal of the data processing workflow, which immediately restored normal authentication operations. No data loss occurred.

WHAT HAPPENED

On October 13, 2025 at 7:30 p.m. UTC, engineers observed login errors in the KSAT Canada environment, quickly followed by similar failures in PhishER. Automated monitoring confirmed that authentication requests were timing out at the database layer.

Initial hypotheses focused on campaign update backlogs and database saturation. However, further analysis revealed the issue coincided with the deployment of a new data export workflow earlier that same day. This workflow initiated intensive queries against shared authentication tables, resulting in locks that blocked concurrent login and enrollment operations.

A secondary issue compounded the impact: a configuration mismatch in the workflow caused its exclusion filters to fail. These filters were intended to skip large records to reduce load, but due to a naming discrepancy in the production database schemas, they were inadvertently ignored. As a result, the workflow attempted to process all records, including oversized ones, triggering crashes and automatic restarts. This “crash loop” continuously re-acquired locks on critical tables, preventing recovery until the workflow was removed.

ROOT CAUSE

Primary Root Cause: Data Processing Workflow Deployment

  • Database Lock Contention: The new workflow issued heavy queries against authentication tables, creating locks that prevented user sessions and enrollment updates from completing.
  • Timeout Cascade: Under sustained load, these locks led to cascading authentication timeouts and full login outages.

Secondary Root Cause: Configuration Mismatch and Crash Loop

  • Failed Exclusions: The workflow’s exclusion settings targeted outdated schema names, invalidating safeguards designed to prevent oversized record processing.
  • Crash and Restart Behavior: When encountering unprocessable records, the workflow automatically restarted, repeatedly locking the same tables and prolonging the outage.

FINDINGS AND MITIGATIONS

  1. Workflow Removal and Database RecoveryFinding:

The data processing workflow introduced locking contention on production authentication tables.Mitigation:

  • Oct 13, 9:58 p.m. UTC: Workflow fully removed from CA environment.
  • Database locks released immediately; login operations restored.
  • Post-removal monitoring confirmed that there was no residual impact on enrollment or authentication services.
  1. Application Reconfiguration and VerificationFinding:

Authentication timeouts persisted intermittently during investigation due to residual database contention.Mitigation:

  • Redeployed affected services to reinitialize connections.
  • Refactored export connector logic to prevent interaction with high-traffic authentication tables.
  1. Observability and Isolation TestingFinding:

The lack of targeted database lock monitoring delayed the identification of the blocking component.Mitigation:

  • Introduced component isolation tests to identify lock sources more quickly.
  • Added diagnostic telemetry for query lock patterns and workflow restart behavior.

TECHNICAL DETAILS

The deployed data export workflow executed read-heavy queries on shared user authentication and enrollment tables. These operations held shared locks incompatible with concurrent updates, blocking normal authentication processes.

Due to the configuration mismatch, exclusion filters did not apply, allowing the workflow to process large binary data columns. Each crash triggered automatic restarts, which in turn retried the same operations—reacquiring locks and creating a repeating outage loop.

Removal of the workflow immediately stopped lock contention, allowing the database to clear queued transactions and restore login functionality.

PREVENTIVE MEASURES

  • Environment Validation: Align database naming conventions between test and production environments to ensure exclusion filters function as intended.
  • Database Lock Monitoring: Implement proactive monitoring and alerting for lock contention on critical tables.
  • Workflow Design Safeguards: Introduce alternative mechanisms to avoid live locks on transactional databases.

CUSTOMER IMPACT AND RECOVERY

Between 7:30 p.m. and 9:58 p.m. UTC, customers in the Canada (CA) region experienced issues logging in to KSAT and PhishER. All other regions remained unaffected.

Following workflow removal, services recovered immediately, and all users regained access. No data loss or corruption occurred.

CONCLUSION

This incident was caused by a combination of resource contention and configuration mismatch in a newly deployed workflow. The interplay between database locking and restart behavior created a self-sustaining failure loop that fully blocked authentication.

As a result of this incident:

  • The problematic workflow has been removed from production.
  • Enhanced deployment validation and observability have been implemented.
  • Database lock monitoring and alternative extraction approaches are being deployed to prevent recurrence.

KnowBe4 remains committed to ensuring high availability, transparency, and reliability across all KSAT and PhishER environments.

Posted Oct 27, 2025 - 14:20 UTC

Resolved

This incident has been resolved.
Posted Oct 14, 2025 - 11:37 UTC

Update

We are continuing to monitor for any further issues.
Posted Oct 13, 2025 - 22:07 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Oct 13, 2025 - 22:00 UTC

Update

We have received reports that this is intermittently impacting all instances of KSAT and PhishER. You may have issues logging in and accessing your console. Our team is continuing to investigate this issue, and we will update this page as soon as we have more information.
Posted Oct 13, 2025 - 20:30 UTC

Investigating

Customers may be unable to log in to their KSAT and PhishER console in the CA instance. We are investigating this issue and will update this page when we have more information.
Posted Oct 13, 2025 - 19:50 UTC
This incident affected: PhishER (Console) and KnowBe4 Security Awareness Training (KSAT) (Console).