GitHub's Reliability Journey: Addressing Rapid Scale and Ensuring Availability

Introduction

GitHub has experienced two recent incidents that fell short of the reliability standards we hold ourselves to. We sincerely apologize for the disruption caused to your workflows. In this article, we provide a transparent look at the challenges behind these incidents, the immediate fixes implemented, and the long-term changes underway to make GitHub more robust for everyone.

GitHub's Reliability Journey: Addressing Rapid Scale and Ensuring Availability — Source: github.blog

The Exponential Growth Challenge

The software development landscape is evolving at an unprecedented pace. Starting in October 2025, GitHub began a plan to increase capacity by 10×, aiming for major improvements in reliability and failover. By February 2026, it became clear that the future would demand a 30× scale from today’s levels.

What drove this shift? A dramatic acceleration in agentic development workflows since late December 2025. Almost every metric—repository creation, pull request activity, API usage, automation, and large-repository workloads—is climbing sharply. This isn’t a strain on one component; it’s a systemic challenge. A single pull request can touch Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. As scale multiplies, minor inefficiencies cascade: queues deepen, cache misses trigger database overload, indexes lag, retries amplify traffic, and one slow dependency can ripple across many features.

Our Approach to Reliability

We’ve set clear priorities: availability first, capacity second, then new features. To achieve this, we are systematically reducing unnecessary work, improving caching, isolating critical services, removing single points of failure, and migrating performance-sensitive paths to systems designed for these workloads. This is all about distributed systems engineering: reducing hidden coupling, limiting blast radius, and ensuring graceful degradation when a subsystem is under pressure. Progress has been swift, but the recent incidents show there is still work to be done.

Short-Term Actions

Immediate bottlenecks appeared faster than anticipated. We resolved them by:

Moving webhooks to a different backend, away from MySQL, to reduce database load.
Redesigning the user session cache to improve efficiency.
Reworking authentication and authorization flows to substantially cut database queries.
Leveraging our migration to Azure to spin up significantly more compute resources.

These steps stabilized the platform in the short run while laying groundwork for deeper changes.

Long-Term Strategies

Next, we focused on isolating critical services—especially Git and GitHub Actions—from other workloads. This limits the blast radius of any single failure. The process began with careful dependency analysis and traffic tiering to understand what must be separated and how to protect legitimate traffic from attacks. We addressed risks in order of their impact. Concurrently, we accelerated migrating performance- and scale-sensitive code from the Ruby monolith into Go, which can handle higher concurrency and lower latency.

While already moving from smaller custom data centers to public cloud, we started work on a multi-cloud architecture. This provides redundancy and flexibility, reducing the risk of a single provider outage affecting GitHub availability.

Conclusion

GitHub is committed to transparency about our reliability challenges and the steps we’re taking. The exponential growth driven by agentic development workflows demands that we rethink every layer of our infrastructure. We are investing in reducing hidden dependencies, improving isolation, and moving to modern architectures. We apologize again for the recent incidents and thank you for your patience as we build a more resilient GitHub. For ongoing updates, please follow our status page.

Tags: