In the modern business world, technological solutions have become…
Brazil is one of the world’s most promising renewables markets, in which Casa Dos Ventos is a leading pioneer and investor. With our innovation and investments we are leading the transition into a more competitive and sustainable future.
Rely on “Big Data” to support big and important business decisions. Most of the data is stored in the BigQuery serverless enterprise data warehouse. Google continuously uses its improved tools and services Google Cloud GCP to accelerate business operations more efficiently.
For example, in wind farm operations, the data is used to quantify energy production, losses and efficiency. For meteorological masts (also known as metmasts), the sensor data and configurations are constantly ingested and analyzed for their health. In newer and green field projects, we use data to make decisions on our investments.
We need trusted data to make these decisions to avoid going wrong with our goals around uptime, efficiency, and investment returns! However, controlling data quality has been a challenge for us, frequently leading us to data firefighting.
Previously, we built homegrown solutions that could have worked for us better — like setting rules and alerts in BI tools,or writing custom Python scripts. These approaches were hard to scale, standardize, and often costly.
To solve these problems, we turned to Dataplex, an intelligent data fabric that unifies distributed data, to achieve better data governance in our organization and build trust in the data. With Dataplex, we now have a very streamlined way of organizing our data, securing, and monitoring data for data quality.
We started Dataplex implementation with three key goals:
- Define a data governance framework for the organization.
- Create reports that routinely measure adherence to the framework.
- Create reports that routinely measure the quality of the data.
Define a data governance framework for the organization.
We started by organizing the data in alignment with the business and then using Dataplex to set policies for this organization.
Dataplex abstracts away the underlying data storage systems by using the constructs like lake, data zone, and assets. We decided to map these constructs to our business with the following framework:
- Lake - One lake per department in the company
- Data Zone - Separate data in subareas using the zone
- Raw zone - Contains datasets used for raw tables or tables with few modifications / aggregations
- Curated zone - Contains datasets with aggregate tables or prediction tables (used by ML models)
Create reports for data assets and govern data assets
To monitor our data's governance stature, we have two reports that capture the current state.
The first report tracks the entire data estate. We used BigQuery APIs and developed Python scripts (scheduled by Composer) to extract the metadata of all BigQuery tables in the organization. It also measures critical aspects like # of documented tables and views.
Secondly, we also track our progress in continuously bringing our data estate into Dataplex governance. We followed the same process (API + Python code) to build the following dashboard. Currently, the datasets under Dataplex stand at 71.6% per this dashboard. Our goal is to get to 100% and then maintain that.
Create data quality scans and reports
Once data is under management in Dataplex, we build data quality reports and a dashboard in Dataplex with a few simple clicks.
Multiple data quality scans run within Dataplex, one for each critical table.
To create rules we used the built-in rules but also created our own using custom SQL statements. For example, to make sure we do not ever have any rows that match a particular condition, we created a SQL rule that returns FALSE when we have even a single row matching the condition.
(SELECT COUNT( ) as count_values
WHERE `columnX` IS NULL and columnY<>”some string”
When these checks fail, we rely on the query shown by Dataplex AutoDQ to find the rows that failed.
To build a dashboard for data quality, we use the logs in Cloud Logging and set up a sink to BigQuery. Once the data lands in BigQuery, we create a view with following query:
jsonpayload_v1_datascanevent.type as type_scan_event,
SPLIT(jsonpayload_v1_datascanevent.datasource, '/')[offset(1)] datasource_project,
SPLIT(jsonpayload_v1_datascanevent.datasource, '/')[offset(3)] datasource_location,
SPLIT(jsonpayload_v1_datascanevent.datasource, '/')[offset(5)] datasource_lake,
SPLIT(jsonpayload_v1_datascanevent.datasource, '/')[offset(7)] datasource_zone,
FROM `datalake-cver.Analytics_Data_Quality_cdv.dataplex_googleapis_com_data_scan` DATA_SCAN
Creating this view enables separating data quality scan results by lake and zones.
We then use Tableau to:
- Create a dashboard and
- Send notifications by email for responsible users using alerts in Tableau
Here is our Tableau dashboard:
While we have achieved a much better governance posture, we also look forward to expanding our usage of Dataplex further. We are starting to use the Lineage feature for BigQuery tables and learning how to integrate Data Quality with Lineage. This will enable us to check the dashboard and views impacted by data quality issues easily. We are also planning to manage SQL scripts in our Github account.
Cloud has been and is an inevitable trend in the technology development and optimization system of enterprises. Gimasys - Premier Partner of Google in Vietnam is the unit providing, consulting the structure, designing the optimal Cloud solution for you. For technical support, you can contact Gimasys - Premier Partner of Google in Vietnam at the following information:
- Hotline: 0974 417 099 (HCM) | 0987 682 505 (HN)
- Email: email@example.com