This report outlines the findings and mitigations related to a service degradation that affected the KnowBe4 Security Awareness Training (KSAT), Security Coach (SCH), and PhishER (PER) platforms on October 14, 2025.
The degradation was caused by an interaction between a recently deployed software update and the system’s internal connection management layer, which handles active session routing and authentication requests. The update introduced an inefficient database operation that caused high connection usage and resource exhaustion. Combined with inconsistent configuration parameters across service components, the system experienced widespread delays and login failures in multiple regions.
The issue was reported beginning 1:00 p.m. UTC, escalated through multiple channels, and was fully resolved by 6:45 p.m. UTC after the software update was rolled back and the affected connection manager was replaced.
The KSAT platform manages user authentication and campaign scheduling across a distributed architecture that balances thousands of concurrent sessions in the impacted regions (U.S. and E.U.).
On October 14 at 1:00 p.m. UTC, customers began to experience login delays and timeouts. Internal monitoring also alerted that session handling services were under heavy load and that response times had degraded significantly.
Initial investigations pointed to abnormal resource consumption in database operations. The team later traced the problem to a software update deployed earlier in the day that contained an inefficient database operation executed at high frequency. This change sharply increased the number of active connections and caused spikes in backend resource utilization, resulting in authentication timeouts and intermittent login failures.
Finding: The operation deployed on October 14 led to rapid degradation within minutes under production traffic.Mitigations:
Finding: Inconsistent configuration parameters caused isolated pools and prevented efficient resource reuse.Mitigations:
Finding: Non-standardized connection settings across components led to fragmentation.Mitigations:
Finding: Limited insight into connection pool utilization delayed identification of the root cause.Mitigations:
The database connection layer distributes incoming requests through multiple service nodes, each maintaining a limited set of reusable connection pools. When the inefficient operation was deployed, it introduced a unique connection pattern inconsistent with the rest of the application. This caused the manager to allocate new, isolated pools across each node, instead of reusing existing ones.
As a result, thousands of connections remained tied up within these fragmented pools and could not be recycled by other services. Although sufficient overall database capacity remained, the platform’s session routing logic was unable to access it efficiently, resulting in high latency and login failures.
Replacing the connection manager and reverting the software update ultimately allowed the release of resources and restored normal throughput.
Between 1:00 p.m. and 6:45 p.m. UTC, some customers in the U.S. and E.U. regions experienced login failures and performance delays when accessing KSAT, SCH, or PER.
Following the rollback at 2:20 p.m. and connection manager replacement at 6:45 p.m., all regions stabilized and users regained normal access. No data loss occurred, though time-sensitive operations such as authentication, campaign delivery, and training notifications were delayed during the incident window.
This incident underscored the importance of consistent configuration management and comprehensive load testing for high-traffic authentication systems. The combination of an inefficient operation and fragmented connection handling led to resource exhaustion and widespread login disruptions.
As a result of this RCA:
KnowBe4 remains committed to maintaining high availability, transparency, and reliability across all KSAT, SCH, and PER environments, and continues to invest in proactive safeguards to prevent recurrence.