Multiple Products - Azure related incident

Incident Report for KnowBe4

Postmortem

External Technical Root Cause Analysis: Microsoft Azure Front Door Outage Impact on KnowBe4 Products

October 29, 2025

Summary

This report details the findings and mitigations related to a global Microsoft Azure Front Door (AFD) outage that occurred between 15:45 UTC on October 29 and 00:05 UTC on October 30, 2025.

The incident disrupted multiple Microsoft services — including Azure Active Directory, Azure Portal, Microsoft Entra ID, and Office 365 add-in infrastructure — which, in-turn, directly affected several KnowBe4 products that depend on these services for routing, authentication, and Office add-in initialization.

While user interfaces and administrative consoles experienced intermittent unavailability, all backend mail-processing and delivery systems within KnowBe4 remained operational and continued to function as designed.

The outage originated entirely within Microsoft’s Azure Front Door global network, following an inadvertent tenant-configuration deployment. The faulty configuration propagated inconsistently across AFD nodes worldwide, causing widespread routing errors and service timeouts until Microsoft rolled back to a last-known-good state.

Glossary of Technical Terms

Azure Front Door (AFD): Microsoft’s global, scalable content delivery network (CDN) and web application firewall (WAF) service used to route internet traffic securely and efficiently to Azure-hosted applications.

Azure Front Door Configuration: A set of routing and security rules that define how traffic flows through Azure Front Door to customer applications. A configuration error or invalid deployment can cause global routing disruptions.

Azure Active Directory / Microsoft Entra ID: Microsoft’s identity and access management service that provides user authentication, single sign-on, and authorization across Azure and Microsoft 365 applications.

Office Web Add-in: A lightweight web application that extends Microsoft Outlook, Excel, Word, and other Office products. KnowBe4’s add-ins use this framework to provide in-app phishing-reporting and email-classification functionality.

appsforoffice.microsoft.com: A Microsoft-controlled endpoint that serves the Office.js library, which is required to initialize all Office web add-ins. If unavailable, no add-in code can execute.

Office.js: The client-side JavaScript framework that provides APIs for Office web add-ins. It must load successfully from Microsoft’s servers before an add-in can function.

Last Known Good Configuration: The most recent stable configuration version used by Microsoft to restore service after an invalid deployment or system failure.

Web Application Firewall (WAF): A network security layer that filters and monitors HTTP(S) traffic between a web application and the Internet, protecting against common web exploits and attacks.

Mailflow: The process by which emails are received, analyzed, classified, and delivered within KnowBe4’s mail-security products. This component was unaffected during the Azure outage.

Magic Link: A secure, single-use authentication link that allows KnowBe4 administrators or new customers to access a portal or begin deployment without entering credentials.

Telemetry: Automated data collection used for monitoring system performance and availability, enabling detection of latency, errors, or external dependency issues.

Fail-Open: A system design approach in which, during an outage or dependency failure, limited functionality remains available instead of the service fully blocking operations.

WHAT HAPPENED

Azure Infrastructure Outage

At 15:45 UTC on October 29, 2025, Microsoft Azure Front Door (AFD) experienced a global service disruption caused by an inadvertent tenant configuration deployment that introduced an invalid state across the AFD network.

The invalid configuration caused a significant number of nodes to fail to load properly, resulting in widespread latencies, timeouts, and connection errors for both Microsoft services and customer applications.

As unhealthy nodes were dropped from service, traffic was automatically rerouted to the remaining healthy nodes, creating further imbalances and intermittent regional outages.

Microsoft responded by blocking all configuration changes, deploying a “last known good” configuration, and manually recovering the affected nodes.

Full mitigation was confirmed by Microsoft at 00:05 UTC on October 30, 2025, following the gradual rebalancing of global traffic.

During the outage window, dependent services such as Microsoft Entra ID, Office 365 add-in frameworks (appsforoffice.microsoft.com), and the Azure Portal were also impacted. These dependencies are critical to KnowBe4’s authentication, routing, and Office-based integrations, and their unavailability directly contributed to product-level interruptions.

Platform-Level Effects

Prevent & Protect

  • Users of Microsoft Outlook add-ins for Prevent and Protect experienced hanging send processes and error messages such as “Outlook can’t send this message because there’s a problem with the add-in.”
  • Because the Microsoft-hosted Office.js library could not initialize, the task pane failed to load entirely, leaving users unable to send or classify emails.
  • Gateway Moderation features and backend mailflow continued to function normally.

Defend

  • Administrators were unable to log in to the Defend console, which relies on the KnowBe4 Security Center (KSC) for authentication.
  • KSC, in turn, uses Azure Front Door and Microsoft Entra ID for routing and identity validation.
  • Although the Defend console was unavailable, core email processing, analysis, and delivery pipelines remained fully operational.

Phish Alert Button (PAB – Hybrid & MSR)

  • Users attempting to launch the Phish Alert Button (PAB) encountered indefinite loading and initialization failures.
  • Because the add-in could not reach appsforoffice.microsoft.com, the Office.js script never executed, preventing the add-in from rendering or reporting emails.
  • PAB for Gmail users were unaffected.

KnowBe4 Security Center (KSC)

  • The KSC administrative dashboard became temporarily inaccessible due to its reliance on Azure Front Door and Microsoft Entra ID for authentication and session management.
  • All backend data collection and product reporting pipelines remained intact.

Webforms

  • Customer-facing webforms hosted through Azure were inaccessible during the incident.
  • Because Azure Front Door manages TLS certificate termination and inbound routing, form submissions and UI access were blocked until Microsoft restored AFD.

KnowBe4 Deployment Center

  • Customers were unable to receive magic links or access onboarding interfaces during the outage.
  • In-progress deployments may have failed to complete due to API call disruptions with Microsoft Graph and Exchange Online.
  • Once service was restored, administrators could retry affected deployments with no lasting customer impact.

Workspace

  • No confirmed customer impact. Active sessions remained stable, and no user reports indicated degradation of service during the outage window.

ROOT CAUSE ANALYSIS

The Azure Front Door outage was caused by an inadvertent tenant configuration change within the AFD internals that introduced an invalid state to AFD nodes globally.

A software defect in Microsoft’s deployment validation process allowed the faulty configuration to bypass safety checks and propagate to production, causing nodes to fail to load properly and drop from the network.

As nodes failed, traffic was rerouted to remaining healthy regions, overloading them and amplifying global latency and timeout rates. Microsoft blocked further configuration changes, rolled back to the previous known-good state, and manually rebalanced traffic to restore stability.

Because some of KnowBe4’s services depend on Azure Front Door for routing and Microsoft’s Office 365 endpoints (e.g., appsforoffice.microsoft.com) for add-in initialization, no failover mechanism could circumvent this Microsoft-controlled dependency during the event.

Scope of Impact

Impact systems globally using these foundational services and followed Microsoft’s global impact window (15:45 UTC, Oct 29 – 00:05 UTC, Oct 30):

Prevent & Protect

Outlook web add-ins failed to load; email send operations hung with error messages. Gateway Moderation unaffected.

Defend

Admin console login unavailable due to Front Door authentication dependency; email processing unaffected.

PAB – Hybrid & MSR

Add-in failed to initialize via appsforoffice.microsoft.com; Gmail extension unaffected.

KSC

Reporting UI and login unavailable; no mailflow impact.

Webforms

External forms temporarily inaccessible as certificates and routing were managed via Front Door.

KnowBe4 Deployment Center

Unable to request new magic links or access onboarding UI; some in-progress deployments failed but were recoverable after service restoration.

Workspace

No confirmed customer impact; active sessions remained stable.

No customer data loss occurred**,** and normal operation resumed once Microsoft completed its rollback.

Detailed Timeline (UTC)

Oct 29 15:45

Microsoft Azure Front Door Impact Begins

Global service disruption begins after an inadvertent tenant configuration change within Azure Front Door (AFD). Customers and Microsoft services begin to experience latency, timeouts, and connectivity errors.

16:04

Microsoft Investigation Initiated

Azure monitoring systems trigger alerts. Microsoft engineers begin reviewing recent AFD configuration changes.

16:15

Root-Cause Isolation Begins

Microsoft identifies the likely source as a tenant configuration deployment that entered an invalid state across global nodes.

16:18

First Public Communication Posted by Microsoft

Initial Microsoft status update published to Azure Status page. Internal KnowBe4 monitoring detects elevated timeouts across Office web add-ins and administrative consoles.

16:20

Targeted Notifications Issued

Microsoft sends targeted impact notifications through Azure Service Health. KnowBe4 teams begin correlation of customer reports with AFD dependency failures.

16:22

KnowBe4 Publishes StatusPage Alerts Acknowledging Outage

“We are investigating elevated error rates and loading issues across multiple products. Initial investigations point to a widespread Azure incident. More info will be added after our engineering teams are able to assess the scope of the issues.” (Status Page)

17:26

Azure Portal Fails Away from Front Door

Microsoft routes Azure Portal traffic off AFD as part of mitigation. KnowBe4 observes continuing errors across products using AFD endpoints.

17:30

Configuration Changes Blocked Globally

Microsoft freezes all customer and internal configuration updates to stop further propagation of the faulty state.

17:40

‘Last Known Good’ Configuration Deployed

Microsoft initiates rollout of the most recent validated configuration across AFD infrastructure. KnowBe4 confirms stabilization in limited regions.

18:30

Global Configuration Push Begins

Fixed configuration distributed worldwide. Gradual traffic recovery observed. KnowBe4 services begin to auto-recover region by region.

18:45

Manual Node Recovery and Rebalancing

Microsoft engineers start manually recovering affected AFD nodes and gradually restore routing to healthy nodes.

19:00

KnowBe4 Internal Status Update

Engineering publishes initial internal assessment: backend mailflow unaffected, user-facing add-ins and consoles unavailable.

20:30

Partial Recovery Confirmed

Microsoft reports substantial improvement in latency and error rates. KnowBe4 verifies restored access for portions of US and EU customer traffic.

23:15

Downstream Dependencies Stabilize

Microsoft confirms mitigation for PowerApps and related services. KnowBe4 validates near-normal operation across most regions.

Oct 30 00:05

Global Mitigation Complete

Microsoft confirms full restoration of Azure Front Door and Office 365 add-in endpoints. KnowBe4 services fully recovered with no remaining customer impact.

FINDINGS AND MITIGATIONS

1. Azure Front Door Dependency

Finding: All impacted KnowBe4 applications use Azure Front Door for secure global routing, TLS termination, and authentication relay.Mitigations:

  • Continued engagement with Microsoft for further clarity on resilience plans and post-mortem recommendations.

2. Microsoft Office Add-In Initialization

Finding: Office web add-ins are hard-coded to initialize through appsforoffice.microsoft.com before executing customer code.Mitigations:

  • Request Microsoft to consider manifest-level fail-open or offline initialization mechanisms.
  • Internal research into safe caching of Office.js to improve resilience testing (acknowledging Microsoft does not currently support self-hosting).

3. Monitoring and Communications

Finding: The outage was initiated within Microsoft’s control plane and communicated via Azure Status and Service Health dashboards.Mitigations:

  • Enhanced internal telemetry to detect third-party latency patterns earlier.
  • Streamlined incident communication templates to accelerate customer notifications during vendor outages.

CUSTOMER IMPACT

Between 15:45 UTC on October 29 and 00:05 UTC on October 30, 2025, customers experienced periodic errors, timeouts, and login failures across Microsoft-integrated KnowBe4 interfaces.

Administrative consoles, add-ins, and forms were unavailable for segments of this window, depending on regional propagation of the faulty AFD configuration.

Backend systems—including Prevent, Protect, Defend, and PhishER pipelines—continued to deliver mail security and training events without interruption.

Following Microsoft’s rollback and global node rebalancing, service availability gradually returned to normal with no customer action required.

PREVENTIVE MEASURES

Improved Internal Monitoring and Observability

To accelerate detection and correlation of third-party dependency failures, new synthetic monitoring tests have been added to continuously validate access to appsforoffice.microsoft.com, Azure Front Door endpoints, and related Microsoft authentication paths.

Add-in and Client Resilience Testing

Although Office web add-ins cannot currently bypass Microsoft’s initialization endpoint, KnowBe4 is:

  • Conducting research into safe caching of Office.js libraries for limited resilience testing.
  • Exploring options to detect and message users when Microsoft dependencies are unavailable, improving transparency during third-party outages.

Customer Communication Enhancements

KnowBe4 has updated its incident-response and communication templates to provide faster, clearer updates when external infrastructure failures occur.

  • This includes standardized language to distinguish between core service availability (mailflow and backend processing) and third-party integration impacts (e.g., Office add-ins, Cloud Provider(s) or consoles).

CONCLUSION

This incident was caused by a Microsoft-initiated tenant configuration error within Azure Front Door, which propagated globally and temporarily disrupted routing and authentication services for Microsoft and dependent applications.

KnowBe4’s platform maintained operational integrity and security throughout the event; however, user-facing Microsoft-dependent interfaces were interrupted until Azure Front Door was fully restored.

KnowBe4 continues to work with Microsoft to understand root-cause remediation and to evaluate resilience enhancements for critical dependencies. We remain committed to transparency, reliability, and continuous improvement of our service availability for all customers.

Posted Nov 04, 2025 - 16:50 UTC

Resolved

This incident has been resolved.
Posted Oct 30, 2025 - 13:42 UTC

Monitoring

While our systems are back to operational, we are continuing to monitor the health of the overall platform and all its subcomponents to ensure stability.

Further information can be found on Microsoft's status page: https://azure.status.microsoft/en-gb/status
Posted Oct 30, 2025 - 00:11 UTC

Update

We are continuing to work on a fix for this issue.
Posted Oct 29, 2025 - 21:49 UTC

Identified

Microsoft has performed a rollback to their "Last Known Good" configuration and we hope to see signs of recovery soon.

Note: the upstream Azure outage is also impacting a Microsoft-hosted office.js library, resulting in loading issues with Hybrid/Ribbon PAB and the KnowBe4 Email Security Add-in.

We'll continue to post on our status page with any new information or updates.

Further information can be found on Microsoft's status page: https://azure.status.microsoft/en-gb/status
Posted Oct 29, 2025 - 19:49 UTC

Update

We continue to investigate elevated error rates and loading issues across multiple products. This outage is due to a widespread Azure incident. More info will be added after our engineering teams are able to assess the scope of the issues.

Further information can be found on Microsoft's status page: https://azure.status.microsoft/en-gb/status
Posted Oct 29, 2025 - 17:28 UTC

Update

We are continuing to investigate this issue.
Posted Oct 29, 2025 - 17:18 UTC

Update

We are continuing to investigate this issue.
Posted Oct 29, 2025 - 17:17 UTC

Update

We are continuing to investigate this issue.
Posted Oct 29, 2025 - 17:06 UTC

Update

We are continuing to investigate this issue.
Posted Oct 29, 2025 - 16:46 UTC

Update

We are continuing to investigate this issue.
Posted Oct 29, 2025 - 16:37 UTC

Update

We are continuing to investigate this issue.
Posted Oct 29, 2025 - 16:35 UTC

Investigating

We are investigating elevated error rates and loading issues across multiple products. Initial investigations point to a widespread Azure incident. More info will be added after our engineering teams are able to assess the scope of the issues.
Posted Oct 29, 2025 - 16:22 UTC
This incident affected: Prevent (Console, Mail Flow, Web Add-In), Protect (Console, Mail Flow, Web Add-In), SecurityCoach (Security Vendor Integration Service), PhishER (Integrations, PhishRIP), Defend (Console, Mail Flow), KnowBe4 Security Awareness Training (KSAT) (User Provisioning), and KnowBe4 Security Center, Secure Web Forms, Phish Alert Button.