{"id":20968,"date":"2024-12-16T10:14:15","date_gmt":"2024-12-16T03:14:15","guid":{"rendered":"https:\/\/gcloudvn.com\/?p=20968"},"modified":"2024-12-18T13:50:07","modified_gmt":"2024-12-18T06:50:07","slug":"how-to-simplify-building-rag-pipelines-in-bigquery-with-document-ai-layout-parser","status":"publish","type":"post","link":"https:\/\/gcloudvn.com\/en\/kienthuc\/how-to-simplify-building-rag-pipelines-in-bigquery-with-document-ai-layout-parser\/","title":{"rendered":"How to simplify building RAG pipelines in BigQuery with Document AI Layout Parser"},"content":{"rendered":"<section class=\"wpb-content-wrapper\"><div class=\"vc_row wpb_row vc_row-fluid\"><div class=\"wpb_column vc_column_container vc_col-sm-12\"><div class=\"vc_column-inner\"><div class=\"wpb_wrapper\">\n\t<div class=\"wpb_text_column wpb_content_element\" >\n\t\t<div class=\"wpb_wrapper\">\n\t\t\t<p><span style=\"font-weight: 400;\">Document preprocessing is a common stumbling block when building retrieval-augmented generation (RAG) pipelines. It often requires python skills and external libraries to parse documents like PDFs into manageable chunks that can be used to generate embeddings. In this article, we\u2019ll help you get the most value out of your massive document repositories and explore how Google can help you transform complex documents into structured data that\u2019s ready to feed AI models.<\/span><\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewbox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewbox=\"0 0 24 24\" version=\"1.2\" baseprofile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/how-to-simplify-building-rag-pipelines-in-bigquery-with-document-ai-layout-parser\/#Xu_ly_Streamline_Document_trong_BigQuery\" >Streamline document processing in BigQuery<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/how-to-simplify-building-rag-pipelines-in-bigquery-with-document-ai-layout-parser\/#Document_preprocessing_cho_RAG\" >Document preprocessing for RAG<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/how-to-simplify-building-rag-pipelines-in-bigquery-with-document-ai-layout-parser\/#Xay_dung_mot_RAG_pipeline_trong_BigQuery\" >Build a RAG pipeline in BigQuery<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/how-to-simplify-building-rag-pipelines-in-bigquery-with-document-ai-layout-parser\/#Tao_Layout_Parser_processor\" >Create a Layout Parser processor<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/how-to-simplify-building-rag-pipelines-in-bigquery-with-document-ai-layout-parser\/#Goi_processor_de_tao_chunks\" >Call the processor to create chunks<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/how-to-simplify-building-rag-pipelines-in-bigquery-with-document-ai-layout-parser\/#Tao_vector_embeddings_cho_chunks\" >Create vector embeddings for the chunks<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/how-to-simplify-building-rag-pipelines-in-bigquery-with-document-ai-layout-parser\/#Tao_vector_index_tren_embeddings\" >Create a vector index on the embeddings<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/how-to-simplify-building-rag-pipelines-in-bigquery-with-document-ai-layout-parser\/#Truy_xuat_cac_khoi_co_lien_quan_va_gui_den_LLM_de_tao_cau_tra_loi\" >Retrieve relevant chunks and send to LLM for answer generation\u00a0<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/gcloudvn.com\/en\/kienthuc\/how-to-simplify-building-rag-pipelines-in-bigquery-with-document-ai-layout-parser\/#Bat_dau_phan_tich_tai_lieu_trong_BigQuery\" >Get started with document parsing in BigQuery<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Xu_ly_Streamline_Document_trong_BigQuery\"><\/span><b>Streamline document processing in BigQuery<\/b><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>BigQuery now provides document preprocessing capabilities for Rag pipelines and other document-centric applications through tight integration with Document AI. The ML.PROCESS_DOCUMENT function can help you access new processors, including Document AI\u2019s Layout Parser processor, which lets you parse PDF documents with SQL syntax..<\/p>\n<p><span style=\"font-weight: 400;\">The GA of ML.PROCESS_DOCUMENT provides developers with new benefits:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Improved scalability: <\/b><span style=\"font-weight: 400;\">The ability to handle larger documents up to 100 pages and process them faster<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Simplified syntax: <\/b><span style=\"font-weight: 400;\">Streamlined SQL syntax to interact with Document AI\u2019s Layout Parser processor<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Additional documents: <\/b><span style=\"font-weight: 400;\">Access additional documentation with AI processor capabilities such as Layout parsers, to generate the document blocks needed for RAG pipelines<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">In particular, supplementary documentation is a critical \u2013 yet challenging \u2013 component of building a Rag pipeline. An AI document parser will help simplify this process.\u00a0<\/span><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Document_preprocessing_cho_RAG\"><\/span><b>Document preprocessing for RAG<\/b><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Breaking down large documents into smaller, semantically related units improves the relevance of the retrieved information, leading to more accurate answers from a large language model (LLM).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Generating metadata such as document source, chunk location, and structural information alongside chunks can further enhance your RAG pipeline, making it possible to filter, refine your search results, and debug your code.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The following diagram provides a high-level overview of the preprocessing steps in a basic RAG pipeline:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20975\" src=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/12\/Thang-72024-2024-12-16T090050.433.jpg\" alt=\"\" width=\"600\" height=\"375\" srcset=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/12\/Thang-72024-2024-12-16T090050.433.jpg 600w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/12\/Thang-72024-2024-12-16T090050.433-18x12.jpg 18w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Xay_dung_mot_RAG_pipeline_trong_BigQuery\"><\/span><b>Build a RAG pipeline in BigQuery<\/b><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Comparing financial documents like earnings statements can be challenging due to their complex structure and mix of text, figures, and tables. Let\u2019s demonstrate how to build a RAG pipeline in BigQuery using Document AI's Layout Parser to analyze the Federal Reserve\u2019s 2023 Survey of Consumer Finances (SCF) report. Feel free to follow along in the notebook here. <a href=\"https:\/\/www.federalreserve.gov\/publications\/files\/scf23.pdf\" target=\"_blank\" rel=\"noopener\">Survey of Consumer Finances (SCF)<\/a> 2023 report of the US Federal Reserve.<\/p>\n<p><span style=\"font-weight: 400;\">Dense financial documents like the Federal Reserve\u2019s SCF report present significant challenges for traditional parsing techniques. This document spans nearly 60 pages and contains a mix of text, detailed tables, and embedded charts, making it difficult to reliably extract information. Document AI\u2019s Layout Parser excels in these scenarios, effectively identifying and extracting key information from complex document layouts such as these.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Building a BigQuery RAG pipeline with Document AI\u2019s Layout Parser consists of the following broad steps.<\/span><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Tao_Layout_Parser_processor\"><\/span><b>Create a Layout Parser processor<\/b><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>In Document AI, create a new processor with the type LAYOUT_PARSER_PROCESSOR. Then create a remote model in BigQuery that points to this processor, allowing BigQuery to access and process the documents.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Goi_processor_de_tao_chunks\"><\/span><b> Call the processor to create chunks<\/b><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><span style=\"font-weight: 400;\">To access PDFs in Google Cloud Storage, begin by creating an object table over the bucket with the earnings statements. Then, use the ML.PROCESS_DOCUMENT function to pass the objects through to Document AI and return results in BigQuery. Document AI analyzes the document and chunks the PDF. The results are returned as JSON objects and can easily be parsed to extract metadata like source URI and page number.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20974\" src=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/12\/Thang-72024-2024-12-16T090106.361.jpg\" alt=\"\" width=\"600\" height=\"375\" srcset=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/12\/Thang-72024-2024-12-16T090106.361.jpg 600w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/12\/Thang-72024-2024-12-16T090106.361-18x12.jpg 18w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">SELECT * FROM ML.PROCESS_DOCUMENT(<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0MODEL docai_demo.layout_parser,<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0TABLE docai_demo.demo,<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0PROCESS_OPTIONS =&gt; (<br \/>\n<\/span><span style=\"font-weight: 400;\">JSON &#8216;{&#8220;layout_config&#8221;: {&#8220;chunking_config&#8221;: {&#8220;chunk_size&#8221;: 300}}}&#8217;)<br \/>\n<\/span>);<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20973\" src=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/12\/Thang-72024-2024-12-16T090122.503.jpg\" alt=\"\" width=\"600\" height=\"375\" srcset=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/12\/Thang-72024-2024-12-16T090122.503.jpg 600w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/12\/Thang-72024-2024-12-16T090122.503-18x12.jpg 18w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Tao_vector_embeddings_cho_chunks\"><\/span><b> Create vector embeddings for the chunks<\/b><b><br \/>\n<\/b><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><span style=\"font-weight: 400;\">To enable semantic search and retrieval, we\u2019ll generate embeddings for each document chunk using the ML.GENERATE_EMBEDDING function and write them to a BigQuery table. This function takes two arguments:\u00a0<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A remote model, which calls a Vertex AI embedding endpoints<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A column from a BigQuery table that contains data for embedding<\/span><\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20972\" src=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/12\/Thang-72024-2024-12-16T090136.464.jpg\" alt=\"\" width=\"600\" height=\"375\" srcset=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/12\/Thang-72024-2024-12-16T090136.464.jpg 600w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/12\/Thang-72024-2024-12-16T090136.464-18x12.jpg 18w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/p>\n<p>&nbsp;<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Tao_vector_index_tren_embeddings\"><\/span><b> Create a vector index on the embeddings<\/b><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><span style=\"font-weight: 400;\">To efficiently search through large chunks based on semantic similarity, we\u2019ll create a vector index on the embeddings. Without a vector index, performing a search requires comparing each query embedding to every embedding in your dataset, which is computationally expensive and slow when dealing with a large number of chunks. Vector indexes use techniques like approximate nearest neighbor search to speed up this process.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">CREATE VECTOR INDEX my_index<br \/>\n<\/span><span style=\"font-weight: 400;\">ON docai_demo.embeddings(ml_generate_embedding_result)<br \/>\n<\/span><span style=\"font-weight: 400;\">OPTIONS(index_type = &#8220;TREE_AH&#8221;,<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0distance_type = &#8220;EUCLIDIAN&#8221;<br \/>\n<\/span><span style=\"font-weight: 400;\">);<\/span><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Truy_xuat_cac_khoi_co_lien_quan_va_gui_den_LLM_de_tao_cau_tra_loi\"><\/span><b> Retrieve relevant chunks and send to LLM for answer generation\u00a0<\/b><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Now we can perform a <\/span><a href=\"https:\/\/cloud.google.com\/bigquery\/docs\/vector-search-intro\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">vector search<\/span><\/a><span style=\"font-weight: 400;\"> to find chunks that are semantically similar to our input query. In this case, we ask how typical family net worth changed in the three years this report covers.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">SELECT<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0ml_generate_text_llm_result AS generated,<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0prompt<br \/>\n<\/span><span style=\"font-weight: 400;\">FROM<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0ML.GENERATE_TEXT( MODEL `docai_demo.gemini_flash`,<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0(<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0SELECT<\/span><\/p>\n<p>CONCAT( &#8216;Did the typical family net worth change? How does this compare the SCF survey a decade earlier? Be concise and use the following context:&#8217;,<br \/>\n<span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0STRING_AGG(FORMAT(&#8220;context: %s and reference: %s&#8221;, base.content, base.uri), &#8216;,\\n&#8217;)) AS prompt,<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0FROM<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0VECTOR_SEARCH( TABLE<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0`docai_demo.embeddings`,<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8216;ml_generate_embedding_result&#8217;,<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0(<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0SELECT<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0ml_generate_embedding_result,<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0content AS query<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 FROM<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0ML.GENERATE_EMBEDDING( MODEL `docai_demo.embedding_model`,<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0(<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0SELECT<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8216;Did the typical family net worth increase? How does this compare the SCF survey a decade earlier?&#8217; AS content<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0)<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0)<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0),<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0top_k =&gt; 10,<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0OPTIONS =&gt; &#8216;{&#8220;fraction_lists_to_search&#8221;: 0.01}&#8217;)<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0),<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0STRUCT(512 AS max_output_tokens, TRUE AS flatten_json_output)<br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0);<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The retrieved chunks are then sent through the <\/span><a href=\"https:\/\/cloud.google.com\/bigquery\/docs\/reference\/standard-sql\/bigqueryml-syntax-generate-text\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">ML.GENERATE_TEXT<\/span><\/a><span style=\"font-weight: 400;\"> h\u00e0m n\u00e0y g\u1ecdi \u0111i\u1ec3m cu\u1ed1i c\u1ee7a Gemini 1.5 Flash v\u00e0 t\u1ea1o ra c\u00e2u tr\u1ea3 l\u1eddi ng\u1eafn g\u1ecdn cho c\u00e2u h\u1ecfi c\u1ee7a Google..<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20971\" src=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/12\/Thang-72024-2024-12-16T090152.494.jpg\" alt=\"\" width=\"600\" height=\"375\" srcset=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/12\/Thang-72024-2024-12-16T090152.494.jpg 600w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2024\/12\/Thang-72024-2024-12-16T090152.494-18x12.jpg 18w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">And we get an answer: median family net worth increased 37% from 2019 to 2022, which is a significant increase compared to the period a decade earlier, which noted a 2% decrease. Notice that if you check the original document, this information is within text, tables, and footnotes \u2014 traditionally areas that are tricky to parse and infer results together!<\/span><\/p>\n<p>This example demonstrated a basic RAG flow, but real-world applications often require continuous updates. Imagine a scenario where new financial reports are added daily to a Cloud Storage bucket. To keep your RAG pipeline up-to-date, consider using BigQuery Workflows or Cloud Composer to incrementally process new documents and generate embeddings in BigQuery. Vector indexes are automatically refreshed when the underlying data changes, ensuring that you always query the most current information.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Bat_dau_phan_tich_tai_lieu_trong_BigQuery\"><\/span><b>Get started with document parsing in BigQuery<\/b><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 400;\">The integration of Document AI's Layout Parser with BigQuery makes it easier for developers to build powerful RAG pipelines. By leveraging ML.PROCESS_DOCUMENT and other BigQuery machine learning functions, you can streamline document preprocessing, generate embeddings, and perform semantic search \u2014 all within BigQuery using SQL. <strong>Contact Gimasys today<\/strong> to discover more about this solution!<\/span><\/p>\n<p>&nbsp;<\/p>\n\n\t\t<\/div>\n\t<\/div>\n<div class=\"templatera_shortcode\"><div class=\"vc_row wpb_row vc_row-fluid\"><div class=\"wpb_column vc_column_container vc_col-sm-12\"><div class=\"vc_column-inner\"><div class=\"wpb_wrapper\"><div class=\"vc_message_box vc_message_box-standard vc_message_box-rounded vc_color-blue\" ><div class=\"vc_message_box-icon\"><i class=\"vc-mono vc-mono-technorati\"><\/i><\/div><p><a href=\"https:\/\/gcloudvn.com\/en\/main-logo-1\/\"><img loading=\"lazy\" decoding=\"async\" class=\"alignleft wp-image-664\" src=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2021\/06\/main-logo-1.png\" alt=\"\" width=\"221\" height=\"72\" srcset=\"https:\/\/gcloudvn.com\/wp-content\/uploads\/2021\/06\/main-logo-1.png 214w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2021\/06\/main-logo-1-18x6.png 18w, https:\/\/gcloudvn.com\/wp-content\/uploads\/2021\/06\/main-logo-1-183x60.png 183w\" sizes=\"auto, (max-width: 221px) 100vw, 221px\" \/><\/a>As a senior partner of Google in Vietnam, Gimasys has more than 10+ years of experience, consulting on implementing digital transformation for 2000+ domestic corporations. Some typical customers Jetstar, Dien Quan Media, Heineken, Jollibee, Vietnam Airline, HSC, SSI...<\/p>\n<p>Gimasys is currently a strategic partner of many major technology companies in the world such as Salesforce, Oracle Netsuite, Tableau, Mulesoft.<\/p>\n<p>Contact Gimasys - Google Cloud Premier Partner for advice on strategic solutions suitable to the specific needs of your business:<\/p>\n<ul>\n<li>Email: gcp@gimasys.com<\/li>\n<li>Hotline: 0974 417 099<\/li>\n<\/ul>\n<\/div><\/div><\/div><\/div><\/div><\/div><\/div><\/div><\/div><\/div>\n<\/section>","protected":false},"excerpt":{"rendered":"Document preprocessing (X\u1eed l\u00fd tr\u01b0\u1edbc t\u00e0i li\u1ec7u) l\u00e0 m\u1ed9t tr\u1edf ng\u1ea1i ph\u1ed5 bi\u1ebfn khi x\u00e2y d\u1ef1ng c\u00e1c \u0111\u01b0\u1eddng \u1ed1ng retrieval-augmented generation (RAG). N\u00f3 th\u01b0\u1eddng \u0111\u00f2i h\u1ecfi c\u00e1c k\u1ef9 n\u0103ng python v\u00e0 c\u00e1c th\u01b0 vi\u1ec7n b\u00ean ngo\u00e0i \u0111\u1ec3 ph\u00e2n t\u00edch c\u00e1c&hellip;","protected":false},"author":2,"featured_media":20970,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[1,135],"tags":[],"class_list":["post-20968","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-kienthuc","category-google-cloud-platform","entry","has-media"],"_links":{"self":[{"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/posts\/20968","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/comments?post=20968"}],"version-history":[{"count":0,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/posts\/20968\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/media\/20970"}],"wp:attachment":[{"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/media?parent=20968"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/categories?post=20968"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/gcloudvn.com\/en\/wp-json\/wp\/v2\/tags?post=20968"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}