{"id":11070,"date":"2023-03-16T11:47:04","date_gmt":"2023-03-16T04:47:04","guid":{"rendered":"https:\/\/gcloudvn.com\/?p=11070"},"modified":"2023-03-20T11:10:25","modified_gmt":"2023-03-20T04:10:25","slug":"building-streaming-data-pipelines-on-google-cloud","status":"publish","type":"post","link":"https:\/\/gcloudvn.com\/en\/kienthuc\/building-streaming-data-pipelines-on-google-cloud\/","title":{"rendered":"Building streaming data pipelines on Google Cloud"},"content":{"rendered":"<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Many customers build streaming data pipelines to ingest, process and then store data for later analysis. Google will focus on a common pipeline design shown below. It consists of three steps:<\/span><\/p>\n<ul style=\"text-align: justify;\">\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Data sources send messages with data to a Pub\/Sub topic.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Pub\/Sub buffers the messages and forwards them to a processing component.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">After processing, the processing component stores the data in BigQuery.<\/span><\/li>\n<\/ul>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">For the processing component, Google will consider 3 alternatives, from basic to advanced: registration <a href=\"https:\/\/gcloudvn.com\/en\/bigquery\/\">BigQuery<\/a>, d\u1ecbch v\u1ee5 Cloud Run v\u00e0 Dataflow pipeline.<\/span><\/p>\n<p style=\"text-align: justify;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-11072 size-full\" src=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-8.png\" alt=\"How Google Cloud builds a streaming data pipeline (1)\" width=\"512\" height=\"191\" srcset=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-8.png 512w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-8-300x112.png 300w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-8-18x7.png 18w\" sizes=\"auto, (max-width: 512px) 100vw, 512px\" \/><\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewbox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewbox=\"0 0 24 24\" version=\"1.2\" baseprofile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/building-streaming-data-pipelines-on-google-cloud\/#Truong_hop_su_dung\" >Example use cases<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/building-streaming-data-pipelines-on-google-cloud\/#Nhap_du_lieu_voi_PubSub\" >Ingesting data with Pub\/Sub<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/building-streaming-data-pipelines-on-google-cloud\/#Cac_lua_chon_thay_the_de_xu_ly\" >Three alternatives for processing<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/building-streaming-data-pipelines-on-google-cloud\/#Phuong_phap_1_Luu_tru_du_lieu_khong_thay_doi_bang_cach_su_dung_dang_ky_BigQuery\" >Approach 1: Storing data unchanged using a BigQuery subscription<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/building-streaming-data-pipelines-on-google-cloud\/#Phuong_phap_2_Xu_ly_thu_rieng_le_bang_Cloud_Run\" >Approach 2: Processing messages individually using Cloud Run<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/building-streaming-data-pipelines-on-google-cloud\/#Phuong_phap_3_Xu_ly_nang_cao_va_tong_hop_thong_bao_bang_Dataflow\" >Approach 3: Advanced processing and aggregation of messages using Dataflow<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/building-streaming-data-pipelines-on-google-cloud\/#Phuong_phap_nao_se_phu_hop_voi_ban\" >Which approach should you choose?<\/a><\/li><\/ul><\/li><\/ul><\/nav><\/div>\n<h2 style=\"text-align: justify;\"><span class=\"ez-toc-section\" id=\"Truong_hop_su_dung\"><\/span><span style=\"font-size: 16px;\"><b>Example use cases<\/b><\/span><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Before Googlee dive deeper into the implementation details, let\u2019s look at a few example use cases of streaming data pipelines:<\/span><\/p>\n<ul style=\"text-align: justify;\">\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Processing ad clicks. Receiving ad clicks, running fraud prediction heuristics on a click-by-click basis, and discarding or storing them for further analysis.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Canonicalizing data formats. Receiving data from different sources, canonicalizing them into a single data model, and storing them for later analysis or further processing.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Capturing telemetry. Storing user interactions and displaying real-time statistics, such as active users, or the average session length grouped by device type.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Keeping a change data capture log. Logging all database updates from a database to BigQuery through Pub\/Sub.<\/span><\/li>\n<\/ul>\n<h2 style=\"text-align: justify;\"><span class=\"ez-toc-section\" id=\"Nhap_du_lieu_voi_PubSub\"><\/span><span style=\"font-size: 16px;\"><b>Ingesting data with Pub\/Sub<\/b><\/span><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Let\u2019s start at the beginning. You have one or multiple data sources that publish messages to a Pub\/Sub topic. Pub\/Sub is a fully-managed messaging service. You publish messages, and Pub\/Sub takes care of delivering the messages to one or many subscribers. The most convenient way to publish messages to Pub\/Sub <\/span><span style=\"font-weight: 400;\">is to use the client library.<\/span><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">To authenticate with Pub\/Sub you need to provide credentials. If your data producer runs on Google Cloud, the client libraries take care of this for you and use <\/span><span style=\"font-weight: 400;\">the built-in service identity<\/span><span style=\"font-weight: 400;\">. If your workload is not running on <a href=\"https:\/\/gcloudvn.com\/en\/google-cloud-platform\/\">Google Cloud<\/a>, <\/span><span style=\"font-weight: 400;\">you should use identity federation<\/span><span style=\"font-weight: 400;\">, or as a last resort,<\/span><span style=\"font-weight: 400;\">download a service account key<\/span><span style=\"font-weight: 400;\"> (but make sure to have a strategy to rotate these long-lived credentials).<\/span><\/p>\n<h2 style=\"text-align: justify;\"><span class=\"ez-toc-section\" id=\"Cac_lua_chon_thay_the_de_xu_ly\"><\/span><span style=\"font-size: 16px;\"><b>Three alternatives for processing<\/b><\/span><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">It\u2019s important to realize that some pipelines are straightforward, and some are complex. Straightforward pipelines don\u2019t do any (or lightweight) processing before persisting the data. Advanced pipelines aggregate groups of data to reduce data storage requirements and can have multiple processing steps.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Google will cover how to do processing using either one of the following 3 options:<\/span><\/p>\n<ul style=\"text-align: justify;\">\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A BigQuery subscription, a no-code pass-through solution that stores messages unchanged in a BigQuery dataset.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A Cloud Run service, for lightweight processing of individual messages without aggregation.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A Dataflow pipeline, for advanced processing (more on that later).<\/span><\/li>\n<\/ul>\n<h3 style=\"text-align: justify;\"><span class=\"ez-toc-section\" id=\"Phuong_phap_1_Luu_tru_du_lieu_khong_thay_doi_bang_cach_su_dung_dang_ky_BigQuery\"><\/span><strong><span style=\"font-size: 16px;\"><i>Approach 1: Storing data unchanged using a BigQuery subscription<\/i><\/span><\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p style=\"text-align: justify;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-11073 size-full\" src=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-9.png\" alt=\"How Google Cloud builds a streaming data pipeline (2)\" width=\"512\" height=\"141\" srcset=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-9.png 512w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-9-300x83.png 300w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-9-18x5.png 18w\" sizes=\"auto, (max-width: 512px) 100vw, 512px\" \/><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">The first approach is the most straightforward one. You can stream messages from a Pub\/Sub topic directly into a BigQuery dataset using a <\/span><a href=\"https:\/\/cloud.google.com\/pubsub\/docs\/bigquery\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">Sign up for BigQuery<\/span><\/a><span style=\"font-weight: 400;\">. Use it when you\u2019re ingesting messages and don\u2019t need to perform any processing before storing the data.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">When setting up a new subscription to a topic, you select the Write to BigQuery option, as shown here:<\/span><\/p>\n<p style=\"text-align: justify;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-11074 size-full\" src=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-10.png\" alt=\"How Google Cloud builds a streaming data pipeline (3)\" width=\"512\" height=\"144\" srcset=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-10.png 512w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-10-300x84.png 300w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-10-18x5.png 18w\" sizes=\"auto, (max-width: 512px) 100vw, 512px\" \/><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">The details of how this subscription is implemented are completely abstracted away from users. That means there is no way to execute any code on the incoming data. In essence, it is a no-code solution. That means you can\u2019t apply filtering on data before storing.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">You can also use this pattern if you want to first store, and perform processing later in BigQuery. This is commonly referred to as ELT (extract, load, transform).<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Tip: One thing to keep in mind is that there are no guarantees that messages are written to BigQuery exactly once, so make sure to deduplicate the data when you\u2019re querying it later.<\/span><\/p>\n<h3 style=\"text-align: justify;\"><span class=\"ez-toc-section\" id=\"Phuong_phap_2_Xu_ly_thu_rieng_le_bang_Cloud_Run\"><\/span><strong><span style=\"font-size: 16px;\"><i>Approach 2: Processing messages individually using Cloud Run<\/i><\/span><\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Use Cloud Run if you do need to perform some lightweight processing on the individual messages before storing them. A good example of a lightweight transformation is canonicalizing data formats - where every data source uses its own format and fields, but you want to store the data in one data format.<\/span><\/p>\n<p style=\"text-align: justify;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-11075 size-full\" src=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-11.png\" alt=\"How Google Cloud builds a streaming data pipeline (4)\" width=\"512\" height=\"126\" srcset=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-11.png 512w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-11-300x74.png 300w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-11-18x4.png 18w\" sizes=\"auto, (max-width: 512px) 100vw, 512px\" \/><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Cloud Run lets you run your code as a web service directly on top of Google\u2019s infrastructure. You can configure Pub\/Sub to <\/span><span style=\"font-weight: 400;\">send every message as an HTTP request using a push subscription<\/span><span style=\"font-weight: 400;\"> to the Cloud Run service\u2019s HTTPS endpoint. When a request comes in, your code does its processing and calls the <\/span><span style=\"font-weight: 400;\">BigQuery Storage Write API<\/span><span style=\"font-weight: 400;\"> to insert data into a BigQuery table. You can use any programming language and framework you want on Cloud Run.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">As of February 2022, push subscriptions are the recommended way to integrate Pub\/Sub with Cloud Run. A push subscription automatically retries requests if they fail and you can set a dead-letter topic to receive messages that failed all delivery attempts. Refer to <\/span><a href=\"https:\/\/cloud.google.com\/pubsub\/docs\/handling-failures\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">handling message failures<\/span><\/a><span style=\"font-weight: 400;\"> to learn more.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">There might be moments when no data is submitted to your pipeline. In this case, Cloud Run automatically scales the number of instances to zero. Conversely, it scales all the way up to 1,000 container instances to handle peak load. If you\u2019re concerned about costs, you can set a maximum number of instances.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">It\u2019s easier to evolve the data schema with Cloud Run. You can use established tools to define and manage data schema migrations like Liquibase. Read more on <\/span><a href=\"https:\/\/docs.liquibase.com\/start\/install\/tutorials\/bigquery.html\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">using Liquibase with BigQuery.<\/span><\/a><span style=\"font-weight: 400;\">\u00a0<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">For added security, set the ingress policy on your Cloud Run microservices to be <\/span><span style=\"font-weight: 400;\">internal<\/span><span style=\"font-weight: 400;\"> so that they can only be reached from Pub\/Sub (and other internal services), create a service account for the subscription, and only give that service account access to the Cloud Run service. Read more about <\/span><a href=\"https:\/\/cloud.google.com\/run\/docs\/triggering\/pubsub-push\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">setting up push subscriptions in a secure way.<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Consider using Cloud Run as the processing component in your pipeline in these cases:<\/span><\/p>\n<ul style=\"text-align: justify;\">\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">You can process messages individually, without requiring grouping and aggregating messages.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">You prefer using a general programming model over using a specialized SDK.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">You\u2019re already using Cloud Run to serve web applications and prefer simplicity and consistency in your solution architecture.<\/span><\/li>\n<\/ul>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Tip: The<\/span><span style=\"font-weight: 400;\"> API <\/span><span style=\"font-weight: 400;\">g<\/span><span style=\"font-weight: 400;\">hi archive<\/span><span style=\"font-weight: 400;\"> is more efficient than the older <\/span><span style=\"font-weight: 400;\">insertAll method<\/span><span style=\"font-weight: 400;\"> because it uses gRPC streaming rather than REST over HTTP.<\/span><\/p>\n<h3 style=\"text-align: justify;\"><span class=\"ez-toc-section\" id=\"Phuong_phap_3_Xu_ly_nang_cao_va_tong_hop_thong_bao_bang_Dataflow\"><\/span><strong><span style=\"font-size: 16px;\"><i>Approach 3: Advanced processing and aggregation of messages using Dataflow<\/i><\/span><\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Cloud Dataflow, a fully managed service for executing <\/span><span style=\"font-weight: 400;\">Apache Beam<\/span><span style=\"font-weight: 400;\"> pipelines on Google Cloud, has long been the bedrock of building streaming pipelines on Google Cloud. It is a good choice for pipelines that aggregate groups of data to reduce data and those that have multiple processing steps.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">n a data stream, grouping is done using windowing. Windowing functions group unbounded collections by the timestamps. There are multiple windowing strategies available, including fixed, sliding, and session windows. Dataflow has built-in support to handle late data. Late data comes in when a window has already closed, and you might want to discard that data or trigger a recalculation. Refer to the <\/span><a href=\"https:\/\/cloud.google.com\/dataflow\/docs\/concepts\/streaming-pipelines#windows\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">documentation on data streaming<\/span><\/a><span style=\"font-weight: 400;\"> to learn more.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Cloud Dataflow can also be leveraged for AI\/ML workloads and is suited for users that want to preprocess, train, and make predictions on a machine learning model using Tensorflow. Here\u2019s a <\/span><span style=\"font-weight: 400;\">list of great tutorials that integrate Dataflow into end-to-end machine learning workflows.<\/span><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">When dealing with a complex pipeline in production - or even a simple one - you want to have visibility into the state and performance of your pipeline. Cloud Dataflow has a UI that makes it easier to troubleshoot issues in multi-step pipelines. Through its integration with Cloud Monitoring, Dataflow provides tailored metrics, logs, and alerting. If you want to learn more, refer to this excellent <\/span><span style=\"font-weight: 400;\">overview of all the observability features in Dataflow.<\/span><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Cloud Dataflow is geared toward massive scale data processing. Spotify notably uses it to compute its yearly personalized Wrapped playlists. Read this <\/span><a href=\"https:\/\/engineering.atspotify.com\/2021\/02\/how-spotify-optimized-the-largest-dataflow-job-ever-for-wrapped-2020\/\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">insightful blogpost about the 2020 Wrapped pipeline<\/span><\/a><span style=\"font-weight: 400;\"> on the Spotify engineering blog.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Dataflow can autoscale its clusters both vertically and horizontally. Users can even go as far as using GPU powered instances in their clusters and Cloud Dataflow will take care of bringing new workers into the cluster to meet demand, and also destroy them afterwards when they are no longer needed.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">If you decide that Dataflow is the right match for your workload, look at the <\/span><a href=\"https:\/\/cloud.google.com\/dataflow\/docs\/guides\/templates\/provided-templates\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">provided templates<\/span><\/a><span style=\"font-weight: 400;\"> that solve common scenarios. These will help you get started faster. You can deploy the templates as pre-packaged pipelines. To adapt the templates to your needs, explore <\/span><span style=\"font-weight: 400;\">the source code on GitHub.<\/span><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Tip: Cap the maximum number of workers in the cluster to reduce cost and set up billing alerts.<\/span><\/p>\n<h3 style=\"text-align: justify;\"><span class=\"ez-toc-section\" id=\"Phuong_phap_nao_se_phu_hop_voi_ban\"><\/span><span style=\"font-size: 16px;\"><b>Which approach should you choose?<\/b><\/span><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p style=\"text-align: justify;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-11076 size-full\" src=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-12.png\" alt=\"How Google Cloud builds a streaming data pipeline (5)\" width=\"512\" height=\"181\" srcset=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-12.png 512w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-12-300x106.png 300w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-12-18x6.png 18w\" sizes=\"auto, (max-width: 512px) 100vw, 512px\" \/><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">The 3 tools have different capabilities and levels of complexity. Dataflow is the most powerful option and the most complex, requiring users to use a specialized SDK (Apache Beam) to build their pipelines. On the other end, a BigQuery subscription doesn\u2019t allow any processing logic and can be configured using the web console. Choosing the tool that best suits your needs will help you get better results faster.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">For massive (Spotify scale) pipelines, or when you need to reduce data using windowing, or have a complex multi-step pipeline, choose Dataflow. In all other cases, starting with Cloud Run is best, unless you\u2019re looking for a no-code solution to connect Pub\/Sub to BigQuery. In that case, choose the BigQuery subscription.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Cost is another factor to consider. Cloud Dataflow does apply automatic scaling, but won\u2019t scale to zero instances when there is no incoming data. For some teams, this is a reason to choose Cloud Run over Dataflow.<\/span><\/p>\n<p style=\"text-align: justify;\"><i><span style=\"font-weight: 400;\">This comparison table summarizes the key differences.<\/span><\/i><\/p>\n<p style=\"text-align: justify;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-11077 size-full\" src=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-13.png\" alt=\"How Google Cloud builds a streaming data pipeline (6)\" width=\"512\" height=\"133\" srcset=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-13.png 512w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-13-300x78.png 300w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2023\/03\/unnamed-13-18x5.png 18w\" sizes=\"auto, (max-width: 512px) 100vw, 512px\" \/><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Cloud has been and is an inevitable trend in the technology development and optimization system of enterprises. Gimasys - Premier Partner of Google in Vietnam is the unit providing, consulting the structure, designing the optimal Cloud solution for you. For technical support, you can contact Gimasys - Premier Partner of Google in Vietnam at the following information:<\/span><\/p>\n<ul style=\"text-align: justify;\">\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Hotline: <\/b><span style=\"font-weight: 400;\">0974 417 099 (HCM) | 0987 682 505 (HN)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Email: <\/b><a href=\"mailto:gcp@gimasys.com\"><span style=\"font-weight: 400;\">gcp@gimasys.com<\/span><\/a><\/li>\n<\/ul>\n<p style=\"text-align: right;\"><b>Source: <a href=\"https:\/\/gcloudvn.com\/en\/\">Gimasys<\/a><\/b><\/p>","protected":false},"excerpt":{"rendered":"<p>Many customers build streaming data pipelines to ingest, process, and then store data for later analysis. Google will focus on a popular pipeline design\u2026<\/p>","protected":false},"author":2,"featured_media":11071,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[135],"tags":[],"class_list":["post-11070","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-google-cloud-platform","entry","has-media"],"_links":{"self":[{"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/posts\/11070","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/comments?post=11070"}],"version-history":[{"count":0,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/posts\/11070\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/media\/11071"}],"wp:attachment":[{"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/media?parent=11070"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/categories?post=11070"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/tags?post=11070"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}