Traveloka: Switch to Google Cloud Platform for Powerful Big Data Analytics

21/06/2021

SUMMARY

Business problem:

Debugging problems in Kafka clusters proved difficult and time consuming
Adding more nodes to MongoDB required a lengthy rebalancing process – and the pool quickly ran out of disk space
Enterprises can only store data for 14 days in MemSQL due to memory limitations, while queries sometimes return out of memory errors.

Solution:

Google Cloud Platform
BigQuery
Cloud Pub/Sub
Cloud Dataflow
Kubernetes Engine
Cloud Storage
Cloud Composer
Cloud SQL

Result:

Engineers relax, take their time to bring value to the business
Record more than 99.9% availability
Warehouse 400TB (about 500 billion lines) of data

Traveloka: Switch to Google Cloud Platform for Powerful Big Data Analytics

With Google Cloud technologies such as BigQuery, Traveloka has established a data architecture that meets all performance and availability requirements and enables businesses to gain meaningful insights and can act from large volumes of data. Collect and analyze data in real time for enterprise-wide decision making.

Founded in 2012, Traveloka is a unicorn business that provides reservations for travel, dining and other options. The organization has grown to establish a presence in six ASEAN countries and employs more than 2,000 people, including 400 engineers. Traveloka aims to be a one-stop travel and lifestyle platform for Indonesians and is diversifying and personalizing its services. The business introduced features like car rental bookings and travel destination guides in 2018, and added a host of extras to existing services, such as status notifications flights for flight booking service. Notable last year was the launch of an online credit service.

Traveloka relies on data analytics to provide tailored, personalized services to consumers. This poses a major challenge to the enterprise data analytics team. This team must support the growing business need for actionable insights by collecting data from multiple sources, choosing the right framework for data analysis, managing multiple use cases, and delivering data. Real-time data for stream analysis and reporting. At the same time, businesses must scale infrastructure while reducing costs.

Analytics team activities should support business goals to increase agility and faster time to market for new features and apps. From a technology standpoint, this means speeding up development and delivery without compromising security.

“As part of the Google Cloud Platform data transfer analytics solution, support from BigQuery for streaming data is a key advantage for us in supporting the real-time analytics use case.”

—Rendy Bambang, Data Engineering Lead, Traveloka—

Table of contents

Data analytics not keeping pace

However, as the business expands, Traveloka's current data analytics environment cannot keep up. This has impacted online data processing that supports a number of use cases – including fraud detection, personalization, ad optimization, side-selling, A/B testing and calculation. ad-eligible – allows business analysts to track performance.

To run the data analytics pipeline, Traveloka relied on an architecture that includes Apache Kafka for ingesting user events, fragmented MongoDB to provide an operational datastore that spans multiple machines, and fragmented MemSQL for Real-time analytics queries. Traveloka processed data from Kafka through a Java user and stored it with the user ID as the primary key in MongoDB. For analysis, Traveloka used event data from Kafka and stored it in MemSQL, which is accessible to business intelligence tools.

“Cloud Pub/Sub is particularly convenient for us because – unlike the previous architecture, which required capacity planning for the entry of events – we can rely on its automation to handle volume and throughput changes without having to do anything.”

—Rendy Bambang, Data Engineering Lead, Traveloka—

Low latency and fully managed infrastructure

The business decides to explore the market and establish an alternative service needed to provide:

Low end-to-end data latency in guaranteed service level agreement
Fully managed infrastructure to relax engineers, help solve business problems (and spend less time on maintenance and firefighting), including resiliency or availability use of the 99.9% end-to-end system and automatically scale up storage and computing

These requirements are filtered into a necessity for a fully managed technology with low end-to-end latency, high performance and availability, and minimal operational demands.

Google Cloud Platform as a platform

Traveloka conducted an assessment and concluded that Google Cloud Platform provided the services and performance to act as the foundation of the data architecture.

For its data pipeline project, Traveloka implemented a cross-cloud environment combining Cloud Pub/Sub (https://cloud.google.com/pubsub) real-time messaging manager for event data ingestion, Cloud Dataflow (https://cloud.google.com/dataflow) for processing streamed data and BigQuery analytics data warehouses to store historical and actual data generated by customer operations, as well as processed data. Each Google Cloud Platform service has helped overcome previous pipeline bottlenecks.

BigQuery Analytical Data Warehouse is key to the new architecture. Rendy Bambang, Data Engineering Lead, Traveloka said: “As part of the Google Cloud Platform data transfer analytics solution, support from BigQuery for streaming data is a key advantage for us in supporting real-time analytics use case”. “Furthermore, we no longer have to worry about storing historical data for only 14 days because BigQuery stores all that data for us, with computing resources that automatically scale as required. we need."

“Cloud Dataflow's ability to create new pipelines and auto-scale without user intervention is a big plus for us, especially when we have to fill up a pipeline for processing. historical data management.”

—Rendy Bambang, Data Engineering Lead, Traveloka—

“Cloud Pub/Sub is particularly convenient for us because – unlike our previous architecture, which required capacity planning for event ingestion – we can rely on automated its partitioning to handle volume and throughput changes without any work,” added Bambang. “Ultimately, Cloud Dataflow's ability to create new pipelines and automatically scale without user intervention was a big plus for us, especially when we had to fill a pipeline to historical data processing.”

Cloud Dataflow's Apache Beam-based unified programming model eases the transition between batch and stream data processing, while its windowing and triggering functions allow for easy processing of slow incoming data.

400TB of data successfully stored

The Google Cloud Platform infrastructure is now managing large volumes quickly and well across the organization with over 99.9% end-to-end availability. Over 4TB of data per day is moved into the Cloud Pub/Sub, while BigQuery stores around 400TB (about 500 billion lines) of data. Approximately 250TB of data resides in Cloud Storage, while 60,000 pools of work are executed per day. Cloud Dataflow processes about 2,500 jobs per day, while about 1,500 charts using BigQuery are generated with business intelligence tools.

BigQuery Warehouse is also integral to changes in how Traveloka gives its product teams access to data. Imre Nagi, Software Engineer, Data Team, Traveloka said: “In the past, when a product team requested data from our data warehouse, we simply gave them direct read access to the groups. or the board they need”.

However, this approach requires the customer system to be tightly coupled to the data storage technology and format, which means that any change to the technology or format requires updating. system. Furthermore, because access is at the group level, the data team cannot be sure that product groups are not accessing columns they are not authorized to. Finally, the data team finds it difficult to track and check what users are doing with the data.

Normalized data distribution

“Based on issues across the enterprise, we decided to build a standardized way to serve our data, which would later become our data provisioning API,” said Nagi.

The API currently delivers millions of records totaling several gigabytes from the BigQuery repository to the on-demand production system. Cloud Composer (https://cloud.google.com/composer) schedule BigQuery queries to convert raw data into summary and re-edit versions to pass into intermediate and final tables that have been processed.

Cloud Storage provides temporary storage for query results and handles sending results to clients, and Cloud SQL keeps track of associations, state, and other metadata, while APIs are stored in Kubernetes clusters powered by Google Kubernetes. Management engine. Kubernetes Clusters (https://cloud.google.com/kubernetes) communicates with Cloud Storage and Cloud SQL to store the results and job metadata of queries made by the requester.

Issues solved

“With Google Cloud Platform technology, our new data provisioning API has successfully solved a number of issues during data delivery,” said Nagi. “We now have a clear contract API that standardizes how product teams access our data warehouse.”

Using the API means that product teams no longer access the physical layer of Traveloka's infrastructure, improving the data team's ability to audit data usage. Teams can also define column-level access controls, ensuring product teams use only the columns they need. Additionally, the API provides a standard yet flexible definition that other groups can use to query data. “We can now restrict how product teams access our data, while still allowing a wide variety of queries,” says Nagi. “Overall, we now have the flexibility, along with the security and control we need.”

Data analytics not keeping pace

Low latency and fully managed infrastructure

Google Cloud Platform as a platform

400TB of data successfully stored

Normalized data distribution

Issues solved

Related Posts