KSAT, SCH, & PER - Login Issues

Incident Report for KnowBe4

Postmortem

External Technical Root Cause Analysis:KSAT, SCH, & PER - Login Issues

This report outlines the findings and mitigations related to a service degradation that affected the KnowBe4 Security Awareness Training (KSAT), Security Coach (SCH), and PhishER (PER) platforms on October 14, 2025.

The degradation was caused by an interaction between a recently deployed software update and the system’s internal connection management layer, which handles active session routing and authentication requests. The update introduced an inefficient database operation that caused high connection usage and resource exhaustion. Combined with inconsistent configuration parameters across service components, the system experienced widespread delays and login failures in multiple regions.

The issue was reported beginning 1:00 p.m. UTC, escalated through multiple channels, and was fully resolved by 6:45 p.m. UTC after the software update was rolled back and the affected connection manager was replaced.

WHAT HAPPENED

The KSAT platform manages user authentication and campaign scheduling across a distributed architecture that balances thousands of concurrent sessions in the impacted regions (U.S. and E.U.).

On October 14 at 1:00 p.m. UTC, customers began to experience login delays and timeouts. Internal monitoring also alerted that session handling services were under heavy load and that response times had degraded significantly.

Initial investigations pointed to abnormal resource consumption in database operations. The team later traced the problem to a software update deployed earlier in the day that contained an inefficient database operation executed at high frequency. This change sharply increased the number of active connections and caused spikes in backend resource utilization, resulting in authentication timeouts and intermittent login failures.

ROOT CAUSE

Primary Root Cause: Inefficient Database Operation

  • Resource Overload: The newly deployed update introduced a repetitive operation that consumed disproportionate database and connection resources under normal user load.
  • Insufficient Load Testing: The operation passed standard validation but had not been tested under full production-scale concurrency, allowing the degradation to manifest only post-deployment.

Secondary Root Cause: Connection Pool Fragmentation

  • Configuration Inconsistency: Varying connection parameters across services caused the connection manager to open independent resource pools, reducing efficiency and preventing resource reuse.
  • Stale Resource Retention: Locked pools held unused connections beyond their intended lifespan, contributing to system-wide slowdowns.
  • Limited Observability: Lack of detailed logging and automated recovery mechanisms delayed detection and response.

FINDINGS AND MITIGATIONS

1. Inefficient Database Operation

Finding: The operation deployed on October 14 led to rapid degradation within minutes under production traffic.Mitigations:

  • 2:20 p.m. UTC: Rolled back the software update, immediately stabilizing performance in the E.U. region.
  • Documented and removed the problematic query path for future validation and regression testing.

2. Connection Manager Fragmentation

Finding: Inconsistent configuration parameters caused isolated pools and prevented efficient resource reuse.Mitigations:

  • 6:45 p.m. UTC: Deployed a replacement connection management service, restoring performance in the U.S. region.
  • Enabled enhanced diagnostic logging for connection lifecycle events.
  • Established a verified escalation path for recycling resource pools if similar symptoms appear.

3. Configuration Standardization

Finding: Non-standardized connection settings across components led to fragmentation.Mitigations:

  • Initiated engineering tasks to standardize connection parameters across all services.
  • Introduced a single configuration pattern per authentication context, with clear exceptions only for read/write separation.

4. Observability Improvements

Finding: Limited insight into connection pool utilization delayed identification of the root cause.Mitigations:

  • Enhanced monitoring to surface anomalies in active connection counts and pool usage.
  • Documented the connection management architecture and incident playbook for on-call teams.

TECHNICAL DETAILS

The database connection layer distributes incoming requests through multiple service nodes, each maintaining a limited set of reusable connection pools. When the inefficient operation was deployed, it introduced a unique connection pattern inconsistent with the rest of the application. This caused the manager to allocate new, isolated pools across each node, instead of reusing existing ones.

As a result, thousands of connections remained tied up within these fragmented pools and could not be recycled by other services. Although sufficient overall database capacity remained, the platform’s session routing logic was unable to access it efficiently, resulting in high latency and login failures.

Replacing the connection manager and reverting the software update ultimately allowed the release of resources and restored normal throughput.

PREVENTIVE MEASURES

  • Configuration Standardization: All services will adhere to a unified connection configuration model to prevent fragmentation.
  • Expanded Load Testing: New performance tests simulate realistic traffic and concurrency to detect inefficiencies pre-deployment.
  • Improved Connection Lifecycle Controls: Standardized procedures now exist for resetting connection pools during active incidents.
  • Documentation and Training: Engineering teams have completed updated documentation for connection handling and troubleshooting.
  • Enhanced Monitoring: System metrics now include detailed connection pool utilization and latency correlation across nodes.

CUSTOMER IMPACT AND RECOVERY

Between 1:00 p.m. and 6:45 p.m. UTC, some customers in the U.S. and E.U. regions experienced login failures and performance delays when accessing KSAT, SCH, or PER.

Following the rollback at 2:20 p.m. and connection manager replacement at 6:45 p.m., all regions stabilized and users regained normal access. No data loss occurred, though time-sensitive operations such as authentication, campaign delivery, and training notifications were delayed during the incident window.

CONCLUSION

This incident underscored the importance of consistent configuration management and comprehensive load testing for high-traffic authentication systems. The combination of an inefficient operation and fragmented connection handling led to resource exhaustion and widespread login disruptions.

As a result of this RCA:

  • The faulty update has been permanently reverted.
  • The connection management service has been modernized and redeployed.
  • Standardization and observability improvements are underway across engineering teams.

KnowBe4 remains committed to maintaining high availability, transparency, and reliability across all KSAT, SCH, and PER environments, and continues to invest in proactive safeguards to prevent recurrence.

Posted Oct 27, 2025 - 14:20 UTC

Resolved

This incident has been resolved.
Posted Oct 15, 2025 - 13:43 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Oct 14, 2025 - 19:10 UTC

Update

We continue to work to resolve this issue. We have a few mitigating items in place and the console is loading successfully. We are beginning to process a backlog of campaigns
Posted Oct 14, 2025 - 18:25 UTC

Identified

We've identified the cause of the login and are working on implementing a fix for the US instance. We've applied a fix for the EU instance and are monitoring the results to make sure no further issues occur. We'll continue to post on our status page with any new information or updates.
Posted Oct 14, 2025 - 15:22 UTC

Investigating

Customers may be unable to log in to their KSAT, SecurityCoach, or PhishER consoles. Additionally, this is causing initialization issues with the Phish Alert Button. We are investigating this issue and will update this page when we have more information.
Posted Oct 14, 2025 - 13:41 UTC
This incident affected: PhishER (Console), KnowBe4 Security Awareness Training (KSAT) (Console), and Phish Alert Button.