{"id":20776,"date":"2024-11-22T14:34:31","date_gmt":"2024-11-22T07:34:31","guid":{"rendered":"https:\/\/gcloudvn.com\/?p=20776"},"modified":"2024-11-25T08:39:25","modified_gmt":"2024-11-25T01:39:25","slug":"speed-scale-and-reliability-25-years-of-google-data-center-networking-evolution","status":"publish","type":"post","link":"https:\/\/gcloudvn.com\/en\/kienthuc\/speed-scale-and-reliability-25-years-of-google-data-center-networking-evolution\/","title":{"rendered":"Speed, scale and reliability: 25 years of Google data-center networking evolution"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">Rome wasn\u2019t built in a day, and neither was Google\u2019s network. But 25 years in, we\u2019ve built out network infrastructure with scale and technical sophistication that\u2019s nothing short of remarkable.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20778\" src=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/11\/Thang-72024-2024-11-22T134757.801.jpg\" alt=\"\" width=\"600\" height=\"375\" srcset=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/11\/Thang-72024-2024-11-22T134757.801.jpg 600w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/11\/Thang-72024-2024-11-22T134757.801-18x12.jpg 18w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">It\u2019s all the more impressive because in the beginning, Google\u2019s network infrastructure was relatively simple. But as our user base and the demand for our services grew exponentially, we realized that we needed a network that could handle an unprecedented scale of data and traffic, and that could adapt to dynamic traffic patterns as our workloads changed over time. This ignited a 25-year journey marked by numerous engineering innovations and milestones, ultimately leading to our current fifth-generation Jupiter data center network architecture, which now scales to 13 Petabits\/sec of bisectional bandwidth. To put this data rate in perspective, this network could support a video call (@1.5 Mb\/s) for all 8 billion people on Earth!<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Today, we have hundreds of Jupiter fabrics deployed around the world, simultaneously supporting hundreds of services, billions of active daily users, all of our <a href=\"https:\/\/gcloudvn.com\/en\/google-cloud-platform\/\">Google Cloud<\/a> customers, and some of the largest ML training and serving infrastructures in the world. I would like to share more about our journey as we look ahead to the next generation of data center network infrastructure.<\/span><\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewbox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewbox=\"0 0 24 24\" version=\"1.2\" baseprofile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/speed-scale-and-reliability-25-years-of-google-data-center-networking-evolution\/#Nguyen_tac\" >Guiding principles<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/speed-scale-and-reliability-25-years-of-google-data-center-networking-evolution\/#Tai_lieu_tham_khao\" >Further resources<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Nguyen_tac\"><\/span>Guiding principles<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Our network evolution has been guided by a few key principles:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Anything, anywhere:<\/b><span style=\"font-weight: 400;\"> Our data center networks support efficiency and simplicity by allowing large-scale jobs to be placed anywhere among 100k+ servers within the same network fabric, with high-speed access to needed storage and support services. This scale improves application performance for internal and external workloads and eliminates internal fragmentation.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Predictable, low latency:<\/b><span style=\"font-weight: 400;\"> We prioritize consistent performance and minimizing tail latency by provisioning bandwidth headroom, maintaining 99.999% network availability, and proactively managing congestion through end-host and fabric cooperation.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Software-defined and systems-centric:<\/b><span style=\"font-weight: 400;\"> Leveraging software-defined networking (SDN) for flexibility and agility, we qualify and globally release dozens of new features every two weeks across our global network.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Incremental evolution and dynamic topology:<\/b><span style=\"font-weight: 400;\"> Incremental evolution helps us to refresh the network granularly (rather than bringing it down wholesale), while dynamic topology helps us to continuously adapt to changing workload demands. The combination of optical circuit switching and SDN supports in-place physical upgrades and an ever-evolving, heterogeneous network that supports multiple hardware generations in a single fabric.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Traffic engineering and application-centric QoS:<\/b><span style=\"font-weight: 400;\"> Optimizing traffic flows and ensuring Quality of Service helps us tailor the network to each application's needs.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">Integrating across the above principles is the foundation for our work. The network is the foundation of reliability for all other compute services, from storage to AI. As such, the network must fail last and fail least. To support this foundational responsibility, we rigorously define and monitor every bad minute1 across hundreds of clusters and millions of ports across our global network. Our progress on reliability is such that our in-house, software-defined Jupiter networks deliver a factor of 50x more reliability than prior versions of our data center networks.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20777\" src=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/11\/Thang-72024-2024-11-22T134925.737.jpg\" alt=\"\" width=\"600\" height=\"375\" srcset=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/11\/Thang-72024-2024-11-22T134925.737.jpg 600w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/11\/Thang-72024-2024-11-22T134925.737-18x12.jpg 18w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/p>\n<p><b>2015 - Jupiter, the first Petabit network<\/b><\/p>\n<p><span style=\"font-weight: 400;\">In a seminal paper, we showed that Jupiter data center networks scaled to 1.3 Pb\/s of aggregate bandwidth by leveraging merchant switch silicon, Clos topologies and Software Defined Networking (SDN). This generation of Jupiter was the culmination of five generations of data center networks developed in house by the Google networking team. At that time, this data rate \u2014 in one Google data center \u2014 was more than the estimated aggregate IP traffic data rate for the global internet.<\/span><\/p>\n<p><b>2022 - Enabling 6 Petabit per second<\/b><\/p>\n<p><span style=\"font-weight: 400;\">In 2022 we announced that our Jupiter networks scaled to over 6 Pb\/s, with deep integration of optical circuit switching (OCS), wave division multiplexing (WDM), and a highly scalable Orion SDN controller. These technologies unlocked a range of advancements, including incremental network builds, enhanced performance, reduced costs, lower power consumption, dynamic traffic management, and seamless upgrades.<\/span><\/p>\n<p><b>2023 - 13 Petabit per second network<\/b><\/p>\n<p><span style=\"font-weight: 400;\">We have further enhanced Jupiter to support native 400 Gb\/s link speeds in the network core. The fundamental building block of Jupiter networks (called the aggregation block) now consists of 512 ports of 400 Gb\/s of connectivity both to end hosts and to the rest of the data center, for an aggregate of 204.8 Tb\/s of bidirectional non-blocking bandwidth per block. We support 64 such blocks for a total bisection bandwidth of 64*204.8 Tb\/s = 13.1 Pb\/s. This technology has been powering Google's production data centers for over a year, fueling the rapid advancement of artificial intelligence, machine learning, web search, and other data-intensive applications.<\/span><\/p>\n<p><b>2024 and beyond - Extreme networking in the age of AI<\/b><\/p>\n<p><span style=\"font-weight: 400;\">While celebrating over two decades of innovation in data center networking, we\u2019re already charting the course for the next generation of network infrastructure to support the age of AI. For example, our teams are busy working on networking infrastructure needs for our upcoming A3 Ultra VMs, that feature NVIDIA ConnectX-7 networking,  supports non-blocking 3.2 Tbps per server of GPU-to-GPU traffic over RoCE (RDMA over converged ethernet) and our future offerings based on NVIDIA GB200 NVL72.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Over the next few years, we will deliver significant advances in network scale and bandwidth, both per-port and network-wide. We will continue to push the boundaries of end-host integration, including the transport and congestion control stack, and streamline network stages to achieve even lower latency with tighter tails. Real-time topology engineering, deeper integration with the compute and storage stacks, and continued refinements to host-based load balancing techniques will further enhance network reliability and latency. With these innovations, our network will remain a cornerstone for the transformative applications and services that enrich the lives of our users throughout the world while simultaneously supporting the groundbreaking AI capabilities that power both our internal services and Google Cloud products.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We are excited to take on these challenges and opportunities to see what the next 25 years hold for Google networking!<\/span><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Tai_lieu_tham_khao\"><\/span><b>Further resources<\/b><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google\u2019s Datacenter Network<\/b><span style=\"font-weight: 400;\"> (SIGCOMM \u201815) [<\/span><a href=\"https:\/\/research.google\/pubs\/jupiter-rising-a-decade-of-clos-topologies-and-centralized-control-in-googles-datacenter-network\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">link. link<\/span><\/a><span style=\"font-weight: 400;\">]<\/span>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Journey of the first Jupiter datacenter network leveraging merchant switch silicon, Clos topologies and Software Defined Networking (SDN).<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">First deployed in production in 2012.<\/span><\/li>\n<\/ul>\n<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Mission Apollo: Landing Optical Circuit Switching at Datacenter Scale<\/b><span style=\"font-weight: 400;\"> (arxiv.org, 2022) [<\/span><a href=\"https:\/\/arxiv.org\/abs\/2208.10041\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">link. link<\/span><\/a><span style=\"font-weight: 400;\">]<\/span>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">First deployed in production in 2013.<\/span><\/li>\n<\/ul>\n<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Orion: Google's Software-Defined Networking Control Plane<\/b><span style=\"font-weight: 400;\"> (NSDI \u201821) [<\/span><a href=\"https:\/\/research.google\/pubs\/orion-googles-software-defined-networking-control-plane\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">link. link<\/span><\/a><span style=\"font-weight: 400;\">]<\/span>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Google's high-performance, scalable, intent-based distributed SDN platform used in both datacenter and wide area networks.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">First deployed in production in 2016.<\/span><\/li>\n<\/ul>\n<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Jupiter Evolving: Transforming Google's Datacenter Network via Optical Circuit Switches and Software-Defined Networking<\/b><span style=\"font-weight: 400;\"> (SIGCOMM \u201922) [<\/span><a href=\"https:\/\/research.google\/pubs\/jupiter-evolving-transforming-googles-datacenter-network-via-optical-circuit-switches-and-software-defined-networking\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">link. link<\/span><\/a><span style=\"font-weight: 400;\">]<\/span>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Enabling technologies: OCS (2013), Orion SDN (2016), 200Gbps networking (2020), direct-connect topology (2017), dynamic traffic engineering (2018), dynamic topology engineering (2021).<\/span><\/li>\n<\/ul>\n<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Swift: Delay is Simple and Effective for Congestion Control in the Datacenter<\/b><span style=\"font-weight: 400;\"> (SIGCOMM \u201820) [<\/span><a href=\"https:\/\/research.google\/pubs\/swift-delay-is-simple-and-effective-for-congestion-control-in-the-datacenter\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">link. link<\/span><\/a><span style=\"font-weight: 400;\">]<\/span>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Swift, a congestion control protocol using hardware timestamps and AIMD control with a delay target, delivers excellent performance in Google datacenters with low flow completion times for short RPCs and high throughput for long RPCs.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">First deployed in production in 2017<\/span><\/li>\n<\/ul>\n<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>PLB: Congestion Signals are Simple and Effective for Network Load Balancing<\/b><span style=\"font-weight: 400;\"> (SIGCOMM \u201822) [<\/span><a href=\"https:\/\/research.google\/pubs\/plb-congestion-signals-are-simple-and-effective-for-network-load-balancing\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">link. link<\/span><\/a><span style=\"font-weight: 400;\">]<\/span>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Protective Load Balancing (PLB) is a simple, effective host-based load balancing design that reduces network congestion and improves performance by randomly changing paths for congested connections, preferring to repath after idle periods to minimize packet reordering.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">First deployed in production in 2020<\/span><\/li>\n<\/ul>\n<\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>Rome kh\u00f4ng \u0111\u01b0\u1ee3c x\u00e2y d\u1ef1ng trong m\u1ed9t ng\u00e0y, v\u00e0 m\u1ea1ng l\u01b0\u1edbi c\u1ee7a Google c\u0169ng v\u1eady. Tuy nhi\u00ean, ch\u1ec9 sau 25 n\u0103m, Google \u0111\u00e3 x\u00e2y d\u1ef1ng c\u01a1 s\u1edf h\u1ea1 t\u1ea7ng m\u1ea1ng v\u1edbi quy m\u00f4 v\u00e0 s\u1ef1 tinh vi v\u1ec1 m\u1eb7t k\u1ef9&hellip;<\/p>","protected":false},"author":2,"featured_media":20778,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[1,135],"tags":[],"class_list":["post-20776","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-kienthuc","category-google-cloud-platform","entry","has-media"],"_links":{"self":[{"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/posts\/20776","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/comments?post=20776"}],"version-history":[{"count":0,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/posts\/20776\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/media\/20778"}],"wp:attachment":[{"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/media?parent=20776"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/categories?post=20776"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/tags?post=20776"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}