skip to Main Content
Welcome to Gimasys!
Hotline: +84 974 417 099 (HCM) | +84 987 682 505 (HN)

Introducing Firehose: An open source tool from Gojek for seamless data ingestion to BigQuery and Cloud Storage

Indonesia’s largest hyperlocal company, Gojek has evolved from a motorcycle ride-hailing service into an on-demand mobile platform, providing a range of services that include transportation, logistics, food delivery, and payments. A total of 2 million driver-partners collectively cover an average distance of 16.5 million kilometers each day, making Gojek Indonesia’s de-facto transportation partner.

To continue supporting this growth, Gojek runs hundreds of microservices that communicate across multiple data centers. Applications are based on an event-driven architecture and produce billions of events every day. To empower data-driven decision-making, Gojek uses these events across products and services for analytics, machine learning, and more.

Data warehouse ingestion challenges

Để hiểu được lượng lớn dữ liệu – và để hiểu rõ hơn về khách hàng cho mục đích phát triển ứng dụng, hỗ trợ khách hàng, tăng trưởng và mục đích tiếp thị – dữ liệu trước tiên phải được nhập vào kho dữ liệu. Gojek sử dụng BigQuery làm kho dữ liệu chính. Nhưng việc nhập các sự kiện ở quy mô của Gojek, với những thay đổi nhanh chóng, đặt ra những thách thức sau:

  • Với nhiều sản phẩm và microservices được cung cấp, Gojek phát hành các chủ đề Kafka mới hầu như mỗi ngày và chúng cần được nhập cho mục đích phân tích. Điều này có thể nhanh chóng dẫn đến chi phí hoạt động đáng kể cho nhóm kỹ thuật dữ liệu đang triển khai các công việc mới để tải dữ liệu vào BigQuery và Cloud Storage.
  • Frequent schema changes in Kafka topics require consumers of those topics to load the new schema to avoid data loss and capture more recent changes.
  • Data volumes can vary and grow exponentially as people start building new products and logging new activities on top of a new topic. Each topic can also have a different load during peak business hours. Customers need to handle the rising volume of data to quickly scale per their business needs.

Để giải quyết những thách thức này, Gojek sử dụng Firehose, một dịch vụ đám mây gốc để cung cấp dữ liệu phát trực tuyến theo thời gian thực đến các điểm đến như service endpoints, cơ sở dữ liệu được quản lý, data lakes và kho dữ liệu như Cloud Storage và BigQuery. Firehose là một phần của Open Data Ops Foundation (ODPF) và hoàn toàn là mã nguồn mở. Gojek là một trong những người đóng góp chính cho ODPF.

Here are Firehose’s key features:

  • Sinks - Firehose supports sinking stream data to the log console, HTTP, GRPC, PostgresDB (JDBC), InfluxDB, Elastic Search, Redis, Prometheus, MongoDB, GCS, and BigQuery.
  • Extensibility - Firehose allows users to add a custom sink with a clearly defined interface, or choose from existing sinks.
  • Scale - Firehose scales in an instant, both vertically and horizontally, for a high-performance streaming sink with zero data drops.
  • Runtime – Firehose có thể chạy bên trong các containers hoặc máy ảo trong môi trường thời gian chạy được quản lý hoàn toàn như Kubernetes.
  • Metrics - Firehose always lets you know what’s going on with your deployment, with built-in monitoring of throughput, response times, errors, and more.

Giới thiệu Firehose: công cụ mã nguồn mở của Gojek để nhập dữ liệu liền mạch vào BigQuery và Cloud Storage 1

Key advantages

Using Firehose for ingesting data in BigQuery and Cloud Storage has multiple advantages.


Firehose is battle-tested for large-scale data ingestion. At Gojek, Firehose streams 600 Kafka topics in BigQuery and 700 Kafka topics in Cloud Storage. On average, 6 billion events are ingested daily in BigQuery, resulting in more than 10 terabytes of daily data ingestion.

Streaming ingestion

A single Kafka topic can produce up to billions of records in a day. Depending on the nature of the business, scalability and data freshness are key to ensuring the usability of that data, regardless of the load. Firehose uses BigQuery streaming ingestion to load data in near-real-time. This allows analysts to query data within five minutes of it being produced.

Schema evolution

With multiple products and microservices offered, new Kafka topics are released almost every day, and the schema of Kafka topics constantly evolves as new data is produced. A common challenge is ensuring that as these topics evolve, their schema changes are adjusted in BigQuery tables and Cloud Storage. Firehose tracks schema changes by integrating with Stencil, a cloud-native schema registry, and automatically updates the schema of BigQuery tables without human intervention. This reduces data errors and saves developers hundreds of hours.

Elastic infrastructure

Firehose can be deployed on Kubernetes and runs as a stateless service. This allows Firehose to scale horizontally as data volumes vary.

Organizing data in cloud storage

Firehose GCS Sink provides capabilities to store data based on specific timestamp information, allowing users to customize how their data is partitioned in Cloud Storage.

Supporting a wide range of open source software

Built for flexibility and reliability, Google Cloud products like BigQuery and Cloud Storage are made to support a multi-cloud architecture. Open source software like Firehose is just one of many examples that can help developers and engineers optimize productivity. Taken together, these tools can deliver a seamless data ingestion process, with less maintenance and better automation.

If your business is interested in the Google Cloud Platform Platform then you can connect to Gimasys - Google Premier Partner - for consulting solutions according to the unique needs of your business. Contact now:

  • Gimasys – Google Cloud Premier Partner
  • Hotline: Hanoi: 0987 682 505 - Ho Chi Minh: 0974 417 099
  • Email:

Source: Gimasys

Back To Top
0974 417 099