Mozilla explains the January 2022 Firefox outage that blocked connections
On January 13, 2022, Firefox users from all over the world started to report connection issues. The browser failed to connect to any site and users were reporting hangs and crashes.
Mozilla published a detailed technical explanation of the incident on the company's Mozilla Hacks website on February 2, 2022.
The organization received reports about Firefox hanging during connection attempts on January 13, 2022. At the time, it saw that crash reports were spiking but did not have much information about what was causing the issue.
Mozilla engineers discovered that a network request was causing the hangs for Firefox users. Engineers looked at recent changes or updates, but did not find any that could cause the issue that users experienced.
Mozilla suspected that the issue could have been caused by a recent "invisible" configuration change by one of the cloud providers that it uses for load balancing. The organization uses the infrastructure of several providers for services such as crash reporting, telemetry, updating or certificate management.
Settings were not changed in inspection, but engineers noticed that the Telemetry service was serving HTTP/3 connections, which it had not done before. HTTP/3 was disabled by Mozilla and users could finally use Firefox again to connect to services. The HTTP/3 setting at the cloud provider was configured with the automatic value.
Mozilla investigated the issue in more detail after the most pressing issue had been taken care of. All HTTP/3 connections go through the networking stack Necko, but Rust components use a library called viaduct to call Necko.
Necko checks if a header is present and if it is not, will add it. HTTP/3 relies on the header to determine the request size. Necko checks are case-sensitive. It now happened that the requests that passed through viaduct were put into lower-case automatically by the library; this meant that any request through viaduct that added a content-length header passed Necko but ran into troubles with the HTTP/3 code.
The only Rust component that uses the network stack and adds a content-length header is the Telemetry component of the Firefox web browser. Mozilla notes that this was the reason why disabling Telemetry in Firefox resolved the issue on the user side. Disabling HTTP/3 did also resolve it.
The issue would cause an infinite loop, which blocked all further network communication because "all network requests go through one socket thread" according to Mozilla.
Mozilla states that it has learned several lessons from the issue. It is investigating all load balancers and reviewing their configurations so that future issues like that can be avoided. The deployment of HTTP/3 at Google, which was the cloud provider in question, was unannounced. Lastly, Mozilla plans to run more system tests in the future with "different HTTP versions".
Mozilla reacted quickly to the emergency situation and has resolved it. It may have damaged the reputation, and some users may have switched to a different browser in the process. Mozilla should ask itself whether it is a good idea to rely on cloud infrastructure that is operated by its biggest rival in the browser space. Some Firefox users may also suggest that the organization looks at the browser's handling of requests to make sure that unnecessary ones, e.g. the reporting of Telemetry or crash reporting, will never block connections the user attempts to make in the future.
Now You: what is your take on the incident?Advertisement