{"id":9393,"date":"2022-08-17T09:48:57","date_gmt":"2022-08-17T02:48:57","guid":{"rendered":"https:\/\/gcloudvn.com\/?p=9393"},"modified":"2023-03-22T13:39:29","modified_gmt":"2023-03-22T06:39:29","slug":"sharing-is-caring-how-nvdia-gpu-sharing-on-gke-saves-you-money","status":"publish","type":"post","link":"https:\/\/gcloudvn.com\/en\/kienthuc\/sharing-is-caring-how-nvdia-gpu-sharing-on-gke-saves-you-money\/","title":{"rendered":"Sharing is caring: How NVIDIA GPU sharing on GKE saves you money"},"content":{"rendered":"<p style=\"text-align: justify;\">Developers and data scientists are increasingly turning to <a href=\"https:\/\/gcloudvn.com\/en\/google-kubernetes-engine-gke\/\">Google Kubernetes Engine<\/a> (GKE) \u0111\u1ec3 ch\u1ea1y c\u00e1c kh\u1ed1i l\u01b0\u1ee3ng c\u00f4ng vi\u1ec7c y\u00eau c\u1ea7u cao nh\u01b0 m\u00e1y h\u1ecdc, h\u00ecnh \u1ea3nh h\u00f3a \/ k\u1ebft xu\u1ea5t v\u00e0 t\u00ednh to\u00e1n hi\u1ec7u su\u1ea5t cao, t\u1eadn d\u1ee5ng s\u1ef1 h\u1ed7 tr\u1ee3 c\u1ee7a GKE cho GPU NVIDIA. Trong b\u1ed1i c\u1ea3nh kinh t\u1ebf hi\u1ec7n nay, kh\u00e1ch h\u00e0ng ph\u1ea3i ch\u1ecbu \u00e1p l\u1ef1c ph\u1ea3i l\u00e0m nhi\u1ec1u h\u01a1n v\u1edbi \u00edt ngu\u1ed3n l\u1ef1c h\u01a1n v\u00e0 ti\u1ebft ki\u1ec7m chi ph\u00ed l\u00e0 y\u1ebfu t\u1ed1 quan tr\u1ecdng h\u00e0ng \u0111\u1ea7u. \u0110\u1ec3 gi\u00fap \u0111\u1ee1, v\u00e0o th\u00e1ng 7, Google \u0111\u00e3 ra m\u1eaft t\u00ednh n\u0103ng chia s\u1ebb th\u1eddi gian GPU tr\u00ean GKE cho ph\u00e9p nhi\u1ec1u container chia s\u1ebb m\u1ed9t GPU v\u1eadt l\u00fd duy nh\u1ea5t, do \u0111\u00f3 c\u1ea3i thi\u1ec7n vi\u1ec7c s\u1eed d\u1ee5ng n\u00f3. Ngo\u00e0i h\u1ed7 tr\u1ee3 hi\u1ec7n c\u00f3 c\u1ee7a GKE d\u00e0nh cho multi-instance GPU d\u00e0nh cho GPU NVIDIA A100 Tensor Core, t\u00ednh n\u0103ng n\u00e0y m\u1edf r\u1ed9ng l\u1ee3i \u00edch c\u1ee7a vi\u1ec7c chia s\u1ebb GPU cho t\u1ea5t c\u1ea3 c\u00e1c d\u00f2ng GPU tr\u00ean GKE.<\/p>\n<p style=\"text-align: justify;\">Contrast this to open source Kubernetes, which only allows for allocation of one full GPU per container. For workloads that only require a fraction of the GPU, this results in under-utilization of the GPU\u2019s massive computational power. Examples of such applications include notebooks and chat bots, which stay idle for prolonged periods, and when they are active, only consume a fraction of GPU.<\/p>\n<p style=\"text-align: justify;\">Underutilized GPUs are an acute problem for many inference workloads such as real-time advertising and product recommendations. Since these applications are revenue-generating, business-critical and latency-sensitive, the underlying infrastructure needs to handle sudden load spikes gracefully. While GKE\u2019s autoscaling feature comes in handy, not being able to share a GPU across multiple containers often leads to over-provisioning and cost overruns.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewbox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewbox=\"0 0 24 24\" version=\"1.2\" baseprofile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/sharing-is-caring-how-nvdia-gpu-sharing-on-gke-saves-you-money\/#GPU_chia_se_thoi_gian_trong_GKE\" >Time-sharing GPUs in GKE<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/sharing-is-caring-how-nvdia-gpu-sharing-on-gke-saves-you-money\/#NVIDIA_multi-instance_GPU_MIG_trong_GKE\" >NVIDIA multi-instance GPU (MIG) in GKE<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/sharing-is-caring-how-nvdia-gpu-sharing-on-gke-saves-you-money\/#GPU_chia_se_thoi_gian_so_voi_multi-instance_GPU\" >Time-sharing GPUs vs. multi-instance GPUs<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/sharing-is-caring-how-nvdia-gpu-sharing-on-gke-saves-you-money\/#Bat_dau_tu_hom_nay\" >Starting from today<\/a><\/li><\/ul><\/nav><\/div>\n<h2 style=\"text-align: justify;\"><span class=\"ez-toc-section\" id=\"GPU_chia_se_thoi_gian_trong_GKE\"><\/span>Time-sharing GPUs in GKE<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p style=\"text-align: justify;\">GPU time-sharing works by allocating time slices to containers sharing a physical GPU in a round-robin fashion. Under the hood, time-slicing works by context switching among all the processes that share the GPU. At any point in time, only one container can occupy the GPU. However, at a fixed time interval, the context switch ensures that each container gets a fair time-slice.<\/p>\n<p style=\"text-align: justify;\">The great thing about the time-slicing is that if only one container is using the GPU, it gets the full capacity of the GPU. If another container is added to the same GPU, then each container gets 50% of the GPU\u2019s compute time. This means time-sharing is a great way to oversubscribe GPUs and improve their utilization. By combining GPU sharing capabilities with GKE\u2019s industry-leading auto-scaling and auto-provisioning capabilities, you can scale GPUs automatically up or down, offering superior performance at lower costs.<\/p>\n<p style=\"text-align: justify;\">Early adopters of time-sharing GPU nodes are using the technology to turbocharge their use of GKE for demanding workloads. San Diego Supercomputing Center (SDSC) benchmarked the performance of time-sharing GPUs on GKE and found that even for the low-end T4 GPUs, sharing increased job throughput by about 40%. For the high-end A100 GPUs, GPU sharing offered a 4.5x throughput increase, which is truly transformational.<\/p>\n<h2 style=\"text-align: justify;\"><span class=\"ez-toc-section\" id=\"NVIDIA_multi-instance_GPU_MIG_trong_GKE\"><\/span>NVIDIA multi-instance GPU (MIG) in GKE<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p style=\"text-align: justify;\">GKE\u2019s GPU time-sharing feature is complementary to multi-instance GPUs, which allow you to partition a single NVIDIA A100 GPU into up to seven instances, thus improving GPU utilization and reducing your costs. Each instance with its own high-bandwidth memory, cache and compute cores can be allocated to one container, for a maximum of seven containers per single NVIDIA A100 GPU. Multi-instance GPUs provide hardware isolation between workloads, and consistent and predictable QoS for all containers running on the GPU.<\/p>\n<h2 style=\"text-align: justify;\"><span class=\"ez-toc-section\" id=\"GPU_chia_se_thoi_gian_so_voi_multi-instance_GPU\"><\/span>Time-sharing GPUs vs. multi-instance GPUs<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p style=\"text-align: justify;\">You can configure time-sharing GPUs on any NVIDIA GPU on GKE including the A100. Multi-instance GPUs are only available in the A100 accelerators.<\/p>\n<p style=\"text-align: justify;\">If your workloads require hardware isolation from other containers on the same physical GPU, you should use multi-instance GPUs. A container that uses a multi-instance GPU instance can only access the CPU and memory resources available to that instance. As such, multi-instance GPUs are better suited to when you need predictable throughput and latency for parallel workloads. But if there are fewer containers running on a multi-instance GPU than available instances then the remaining instances will be unused.<\/p>\n<p style=\"text-align: justify;\">On the other hand, in the case of time-sharing, context switching lets every container access the full power of the underlying physical GPU. Therefore, if only one container is running, it still gets the full capacity of the GPU. Time-shared GPUs are ideal for running workloads that need only a fraction of GPU power and burstable workloads.<\/p>\n<p style=\"text-align: justify;\">Time-sharing allows a maximum of 48 containers to share a physical GPU whereas multi-instance GPUs on A100 allows up to a maximum of 7 partitions.<\/p>\n<p style=\"text-align: justify;\">If you want to maximize your GPU utilization, you can configure time-sharing for each multi-instance GPU partition. You can then run multiple containers on each partition, with those containers sharing access to the resources on that partition.<\/p>\n<h2 style=\"text-align: justify;\"><span class=\"ez-toc-section\" id=\"Bat_dau_tu_hom_nay\"><\/span>Starting from today<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p style=\"text-align: justify;\">The combination of GPUs and GKE is proving to be a real game-changer. GKE brings auto-provisioning, autoscaling and management simplicity, while GPUs bring superior processing power. With the help of GKE, data scientists, developers and infrastructure teams can build, train and serve the workloads without having to worry about underlying infrastructure, portability, compatibility, load balancing and scalability issues. And now, with GPU time-sharing, you can match your workload acceleration needs with right-sized GPU resources. Moreover, you can leverage the power of GKE to automatically scale the infrastructure to efficiently serve your acceleration needs while delivering a better user experience and minimizing operational costs. To get started with time-sharing GPUs in GKE, check out the documentation.<\/p>\n<p style=\"text-align: justify;\">If your business is interested in the <a href=\"https:\/\/gcloudvn.com\/en\/google-cloud-platform\/\">Google Cloud<\/a>\u00a0Platform then you can connect to Gimasys - Google Premier Partner - for consulting solutions according to the unique needs of your business. Contact now:<\/p>\n<ul style=\"text-align: justify;\">\n<li aria-level=\"1\"><b>Gimasys \u2013 Google Cloud Premier Partner<\/b><\/li>\n<li aria-level=\"1\"><b>Hotline:\u00a0<\/b>Ha Noi:\u00a00987 682 505\u00a0\u2013 Ho Chi Minh:\u00a00974 417 099<\/li>\n<li aria-level=\"1\"><b>Email:\u00a0<\/b>gcp@gimasys.com<\/li>\n<\/ul>\n<p style=\"text-align: right;\"><em><strong>Source: Gimasys<\/strong><\/em><\/p>","protected":false},"excerpt":{"rendered":"<p>C\u00e1c nh\u00e0 ph\u00e1t tri\u1ec3n v\u00e0 nh\u00e0 khoa h\u1ecdc d\u1eef li\u1ec7u \u0111ang ng\u00e0y c\u00e0ng chuy\u1ec3n sang s\u1eed d\u1ee5ng Google Kubernetes Engine (GKE) \u0111\u1ec3 ch\u1ea1y c\u00e1c kh\u1ed1i l\u01b0\u1ee3ng c\u00f4ng vi\u1ec7c y\u00eau c\u1ea7u cao nh\u01b0 m\u00e1y h\u1ecdc, h\u00ecnh \u1ea3nh h\u00f3a \/ k\u1ebft xu\u1ea5t&hellip;<\/p>","protected":false},"author":2,"featured_media":9394,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[1,135],"tags":[],"class_list":["post-9393","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-kienthuc","category-google-cloud-platform","entry","has-media"],"_links":{"self":[{"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/posts\/9393","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/comments?post=9393"}],"version-history":[{"count":0,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/posts\/9393\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/media\/9394"}],"wp:attachment":[{"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/media?parent=9393"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/categories?post=9393"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/tags?post=9393"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}