Automatically Masking PII in BigQuery

Table of Contents

  1. Introduction
  2. Background
  3. Hands-on Walkthrough
  4. Deeper Notes
  5. Closing Thoughts

1. Introduction

“I needed to mask PII.”

Query-time logic that automatically detects PII and masks it

While expanding our ELT pipelines recently, I had to implement logic that automatically detects PII in queries and masks it. A specific STRING-typed column contained a mix of PII and non-PII, and the core challenge was to handle this cleanly so that internal data utilization could go up while still fully complying with privacy regulations.

While brainstorming with my team lead, I came across GCP’s Sensitive Data Protection service for the first time, and we ended up adopting an environment that lets us call this service in the form of a BigQuery function. (Colleagues from the DevOps and security teams reviewed it with us.)

Sensitive Data Protection

Adopting this environment required not just data engineering skills, but also a fairly broad understanding of DevOps and cloud infrastructure. At first I felt overwhelmed — “this looks really hard, can I actually pull this off?” — but once I sat down and looked at it carefully, it turned out not to be that complicated.

I wrote this post so that anyone — including data analysts — can follow along and try this themselves. I’ll walk through, step by step, how I built out “automatic PII masking in BigQuery” using GCP Sensitive Data Protection.


2. Background

“There was a specific reason I chose Sensitive Data Protection.”

The company I work at runs its DW on GCP, and we strictly separate PII (personally identifiable information) from Non-PII. PII-only and Non-PII-only projects are split apart, and BigQuery follows the same boundary. In particular, the PII project is locked behind a closed network that’s cut off from the public internet, in line with our internal privacy policy and applicable regulations.

💡 [Author’s Note] What is PII?

PII stands for “Personally Identifiable Information” — any information that can identify an individual. Examples include name, address, phone number, national ID number, and so on.

A hypothetical PII dim__users table

In this environment, I was given a specific goal for the data team: build a batch pipeline that masks (de-identifies) the PII portion of a column that contains a mix of PII and non-PII in a PII table. In other words, a specific Transformation Task in an Airflow DAG needed to carry “BigQuery-side logic that de-identifies PII as part of the transformation.”

💡 [Author’s Note] Why bother with masking?

Because I wanted to keep PII fully protected, while still letting the rest of the organization use the non-PII content as natural-language data.

An Airflow DAG Transformation Task whose role is “BigQuery-side logic that de-identifies PII as part of the transformation”

Before landing on Sensitive Data Protection, I evaluated several other options. Each had blockers.

Option 1. OpenAI API

I thought about something like the snippet below — feeding the PII column to OpenAI and asking the LLM to mask it. But that means PII data leaves the company and goes to a third party (OpenAI), which is incompatible with our internal security policy.

prompt_system = "..."

masked_texts = []
for i in range(len(df)):

  text = df.loc[i, 'pii']
  prompt_individual = f"""
    Please output the following text with all PII masked.
    Use asterisks (*) for the mask.
    Text: {text}
  """

  result = openai_client.chat.completions.create(
    model='xxx',
    max_tokens=1234,
    n=1,
    temperature=0,
    stop=None,
    messages=[
       {"role": "system", "content": prompt_system},
       {"role": "user", "content": prompt_individual}
    ]
  )

  masked_text = result.choices[0].message.content.strip()
  masked_texts.append(masked_text)

df['pii_masked'] = masked_texts
df = df[['column_a', 'column_b', 'pii_masked']]

to_table = bigquery_client.get_table(to_table_id)
bigquery_client.load_table_from_dataframe(df, to_table)

Option 2. Gemini

Staying inside the Google ecosystem is better than OpenAI, but LLMs respond probabilistically, so output consistency isn’t guaranteed. A bad day where PII leaks through unmasked is exactly the kind of incident we want to avoid.

Option 3. Regex-based user-defined functions

I considered hand-defining regex patterns for the major PII types — national ID number, email, phone — and masking with UDFs. But any new, unseen PII pattern would be a “leak first, fix later” situation, and hand-maintaining a library of regexes is a poor use of effort.

Option 4. Python-based PII masking libraries

I looked into libraries like pii-anonymizer. They identify PII via NER (Named Entity Recognition) algorithms, which has the upside of dynamic maintenance.

pii-anonymizer

But running this kind of de-identification directly inside an Airflow DAG’s Python code means a single data engineer holds the keys to the logic — that’s not a great place to be from a corporate security strategy perspective. (Ideally only GCP administrators can change the logic.)

Conclusion. Sensitive Data Protection

After all of that, I went with GCP Sensitive Data Protection. It’s a managed service that automatically detects PII in text or images using NLP and OCR-based algorithms, and applies masking or encryption based on user-defined rules.

The killer feature: GCP publishes both official documentation and an open-source Terraform repo for invoking this service from BigQuery via a Remote Function. That lets you write something like the following to selectively mask only the PII portion and safely use the rest:

SELECT
    fns.dlp_freetext_mask(pii_column) AS pii_masked
FROM
    table

This approach has two big advantages:

  • It enforces internal security policy at the system level, with permission management via IAM.
  • It’s more flexible than regex, and safer than the probabilistic output of an LLM.

For these reasons I judged this to be the most realistic and security-appropriate choice. I wrote up a detailed proposal in Confluence, shared it with the DevOps and security teams, and asked for a review. With that proposal as the starting point, my colleagues could quickly evaluate the design and we moved into adoption fast.


3. Hands-on Walkthrough

“Let’s go through the actual build of the BigQuery DLP Remote Function together.”

Internal execution flow of the BigQuery DLP Remote Function

Below is the build guide based on my own experience. If you read the official material as a non-DevOps / non-backend engineer, it can feel pretty rough and unfriendly. I felt the same way, so I’ll lay it out more gently here.

3.0. Dev environment

  • macOS (Apple Silicon)
  • terraform CLI (v1.8.2) — install via this link.
  • gcloud SDK CLI (v525.0.0) — installation steps below.
  • A GCP project with billing history. (Without it, some services have usage caps that will get in your way during development.)

3.1. Clone the bigquery-dlp-remote-function repo

First, clone GCP’s open-source project locally:

git clone https://github.com/GoogleCloudPlatform/bigquery-dlp-remote-function.git
cd bigquery-dlp-remote-function/

This project is built on Terraform. In other words, it ships the entire environment for “calling Sensitive Data Protection from BigQuery as a Remote Function” as IaC (Infrastructure as Code), deployable in one shot.

💡 [Author’s Note] What is Terraform?

Standing up an environment usually means manually creating a huge number of resources in the right dependency order — tedious, error-prone, and easy to mess up. Terraform lets you describe that whole resource layout as a “blueprint,” and with a single command it’ll “construct” your cloud environment from the blueprint. That’s why we call it “IaC.”

3.2. Install and initialize the gcloud SDK CLI

Because each GCP service has to be called locally, you need to install the gcloud SDK CLI. On macOS, Homebrew makes this easy:

brew install --cask google-cloud-sdk
gcloud init
gcloud version

3.3. Understand what Terraform deploys for you

main.tf defines the deployment process. It’s complex, but in summary, the following environment gets built.

(1) Preparation

  • Create the required service accounts and permissions, plus the storage spaces for code and for build artifacts.

(2) Turn code into a runnable image

  • Use Cloud Build to package the Remote Function code into a Docker image, and store the image in Artifact Registry.

(3) Deploy the function

  • Deploy the Docker image to Cloud Run so the Remote Function becomes invocable as a web service.

(4) Define PII detection and de-identification rules

  • Define how Sensitive Data Protection should “Inspect” data for PII and how to “De-identify” it — these are templates.

(5) Wire it up to BigQuery

  • Create a Connection so BigQuery can invoke the Cloud Run-deployed Remote Function, and finally create the BigQuery function that calls it.

A diagram of the full deployment flow

💡 [Author’s Note] What is Cloud Build?

It’s a service that pulls in your source code, runs automated tests, and turns it into a runnable application (here, an image). If you’ve never touched Docker, this may feel abstract — but think of it as turning an “install manual” into a fully-packaged application bundle.

💡 [Author’s Note] What is Artifact Registry?

The images produced by Cloud Build have to live somewhere so they can be executed later. Artifact Registry is where those build artifacts are stored and managed.

💡 [Author’s Note] What is Cloud Run?

Heard of “serverless”? Cloud Run lets you run code without managing servers. It only runs when something actually invokes it, which is great for cost. When we later call the BigQuery DLP Remote Function, the call flows through Cloud Run and on to Sensitive Data Protection.

💡 [Author’s Note] What is a BigQuery Connection?

BigQuery can’t natively call external APIs. A BigQuery Connection plays the bridge role that lets BigQuery talk to Cloud Run.

3.4. Enable the required GCP service APIs

Terraform will deploy resources by hitting each service’s API, so head to GCP and run the following in Cloud Shell:

gcloud services enable \
artifactregistry.googleapis.com \
bigquery.googleapis.com \
bigqueryconnection.googleapis.com \
cloudbuild.googleapis.com \
cloudkms.googleapis.com \
containerregistry.googleapis.com \
dlp.googleapis.com \
run.googleapis.com \
secretmanager.googleapis.com

💡 [Author’s Note] Where is Cloud Shell?

This doc walks you through it. Be sure to pick the same project you’re building this environment in.

3.5. Understand the Sensitive Data Protection lifecycle

Sensitive Data Protection works in two steps internally — “Inspect” and “De-identify.”

Inspect & De-identify

  • Step 1: Identify which parts of the text are PII. (Inspect)
  • Step 2: Transform the identified parts according to a user-defined rule. (De-identify)

Open the Sensitive Data Protection console and you’ll see that both Inspect and De-identify behavior are declared as templates. You can configure them manually in the UI, but from a development perspective it’s much better to bake these templates into Terraform too. (Always express expected infrastructure state as code.)

The GCP Sensitive Data Protection console

If you look inside the Terraform project we cloned, you’ll find sample_dlp_deid_config.json — that’s the De-identify template config. It’s pretty bare, though, since it’s just a sample. And sample_dlp_inspect_config.json doesn’t exist at all. So the additional work is:

  • Improve sample_dlp_deid_config.json.
  • Add sample_dlp_inspect_config.json.

3.6. Updating the De-identify template config

Replace sample_dlp_deid_config.json with the JSON below.

{
  "deidentifyTemplate" : {
    "displayName" : "Global and Korea specific infoTypes Masker",
    "description":"De-identifies Global and Korea-specific infoTypes with Masking.",
    "deidentifyConfig": {
      "infoTypeTransformations": {
        "transformations": [
          {
            "infoTypes": [{"name": "ADVERTISING_ID"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "AGE"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "BLOOD_TYPE"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "CREDIT_CARD_DATA"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "CREDIT_CARD_EXPIRATION_DATE"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "CREDIT_CARD_NUMBER"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "CREDIT_CARD_TRACK_NUMBER"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "COUNTRY_DEMOGRAPHIC"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "CVV_NUMBER"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "DATE"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "DATE_OF_BIRTH"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "DEMOGRAPHIC_DATA"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "DOMAIN_NAME"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "DRIVERS_LICENSE_NUMBER"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "EMAIL_ADDRESS"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "EMPLOYMENT_STATUS"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "ETHNIC_GROUP"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "FINANCIAL_ACCOUNT_NUMBER"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "FINANCIAL_ID"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "FIRST_NAME"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "GENDER"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "GENERIC_ID"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "GEOGRAPHIC_DATA"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "GOVERNMENT_ID"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "IBAN_CODE"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "HTTP_COOKIE"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "HTTP_USER_AGENT"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "ICCID_NUMBER"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "ICD9_CODE"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "ICD10_CODE"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "IMEI_HARDWARE_ID"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "IMMIGRATION_STATUS"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "IMSI_ID"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "IP_ADDRESS"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "LAST_NAME"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "LOCATION"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "LOCATION_COORDINATES"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "MAC_ADDRESS"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "MAC_ADDRESS_LOCAL"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "MAC_ADDRESS_UNIVERSAL"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "MARITAL_STATUS"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "MEDICAL_DATA"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "MEDICAL_ID"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "MEDICAL_RECORD_NUMBER"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "MEDICAL_TERM"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "ORGANIZATION_NAME"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "PASSPORT"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "PERSON_NAME"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "PHONE_NUMBER"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "POLITICAL_TERM"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "RELIGIOUS_TERM"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "SEXUAL_ORIENTATION"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "STREET_ADDRESS"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "SWIFT_CODE"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "TECHNICAL_ID"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "TIME"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "TRADE_UNION"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "URL"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "VAT_NUMBER"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "VEHICLE_IDENTIFICATION_NUMBER"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "KOREA_ARN"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "KOREA_DRIVERS_LICENSE_NUMBER"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "KOREA_PASSPORT"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          },
          {
            "infoTypes": [{"name": "KOREA_RRN"}],
            "primitiveTransformation": {"characterMaskConfig": {"maskingCharacter": "*"}}
          }
        ]
      }
    }
  }
}

3.7. Adding the Inspect template config

Create a new file, sample_dlp_inspect_config.json, with the contents below.

💡 [Author’s Note] minLikelihood

I recommend minLikelihood: "UNLIKELY". This is the detection threshold for PII identification. From unit testing I observed:

  • Lower than “UNLIKELY” → too eager, masks non-PII content and hurts usability.
  • Higher than “UNLIKELY” → too lax, lets some PII through and violates the policy.
{
  "inspectTemplate": {
    "displayName" : "Global and Korea specific infoTypes Inspector",
    "description":"Inspects Global and Korea-specific infoTypes.",
    "inspectConfig": {
      "infoTypes": [
        {"name": "ADVERTISING_ID"},
        {"name": "AGE"},
        {"name": "BLOOD_TYPE"},
        {"name": "CREDIT_CARD_DATA"},
        {"name": "CREDIT_CARD_EXPIRATION_DATE"},
        {"name": "CREDIT_CARD_NUMBER"},
        {"name": "CREDIT_CARD_TRACK_NUMBER"},
        {"name": "COUNTRY_DEMOGRAPHIC"},
        {"name": "CVV_NUMBER"},
        {"name": "DATE"},
        {"name": "DATE_OF_BIRTH"},
        {"name": "DEMOGRAPHIC_DATA"},
        {"name": "DOMAIN_NAME"},
        {"name": "DRIVERS_LICENSE_NUMBER"},
        {"name": "EMAIL_ADDRESS"},
        {"name": "EMPLOYMENT_STATUS"},
        {"name": "ETHNIC_GROUP"},
        {"name": "FINANCIAL_ACCOUNT_NUMBER"},
        {"name": "FINANCIAL_ID"},
        {"name": "FIRST_NAME"},
        {"name": "GENDER"},
        {"name": "GENERIC_ID"},
        {"name": "GEOGRAPHIC_DATA"},
        {"name": "GOVERNMENT_ID"},
        {"name": "IBAN_CODE"},
        {"name": "HTTP_COOKIE"},
        {"name": "HTTP_USER_AGENT"},
        {"name": "ICCID_NUMBER"},
        {"name": "ICD9_CODE"},
        {"name": "ICD10_CODE"},
        {"name": "IMEI_HARDWARE_ID"},
        {"name": "IMMIGRATION_STATUS"},
        {"name": "IMSI_ID"},
        {"name": "IP_ADDRESS"},
        {"name": "LAST_NAME"},
        {"name": "LOCATION"},
        {"name": "LOCATION_COORDINATES"},
        {"name": "MAC_ADDRESS"},
        {"name": "MAC_ADDRESS_LOCAL"},
        {"name": "MAC_ADDRESS_UNIVERSAL"},
        {"name": "MARITAL_STATUS"},
        {"name": "MEDICAL_DATA"},
        {"name": "MEDICAL_ID"},
        {"name": "MEDICAL_RECORD_NUMBER"},
        {"name": "MEDICAL_TERM"},
        {"name": "ORGANIZATION_NAME"},
        {"name": "PASSPORT"},
        {"name": "PERSON_NAME"},
        {"name": "PHONE_NUMBER"},
        {"name": "POLITICAL_TERM"},
        {"name": "RELIGIOUS_TERM"},
        {"name": "SEXUAL_ORIENTATION"},
        {"name": "STREET_ADDRESS"},
        {"name": "SWIFT_CODE"},
        {"name": "TECHNICAL_ID"},
        {"name": "TIME"},
        {"name": "TRADE_UNION"},
        {"name": "URL"},
        {"name": "VAT_NUMBER"},
        {"name": "VEHICLE_IDENTIFICATION_NUMBER"},
        {"name": "KOREA_ARN"},
        {"name": "KOREA_DRIVERS_LICENSE_NUMBER"},
        {"name": "KOREA_PASSPORT"},
        {"name": "KOREA_RRN"}
      ],
      "minLikelihood": "UNLIKELY",
      "includeQuote": true,
      "excludeInfoTypes": false
    }
  }
}

3.8. Editing main.tf

The Terraform deployment manual needs a few changes:

  • The original blueprint didn’t create an Inspect template — we have to add that.
  • Originally two BigQuery Remote Functions were declared (an Encrypt function and a Decrypt function). We only need a single Mask function, so simplify accordingly.
  • GCS bucket names must be globally unique, so I made it more specific by including the project ID.
terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = ">= 4.0.0"
    }
  }
  provider_meta "google" {
    module_name = "cloud-solutions/deploy-bigquery-dlp-remote-function-v0.1"
  }
}

provider "google" {
  billing_project = var.project_id
  project         = var.project_id
  region          = var.region
}

resource "google_service_account" "run_service_account" {
  account_id = "${var.service_name}-runner"
  project    = var.project_id
}

resource "google_project_iam_member" "grant_role_to_sa" {
  for_each = toset([
    "roles/dlp.reader",
    "roles/dlp.user",
    "roles/dlp.admin",
  ])
  project = var.project_id
  role    = each.key
  member  = "serviceAccount:${google_service_account.run_service_account.email}"
}

resource "google_artifact_registry_repository" "image_registry" {
  format        = "DOCKER"
  repository_id = var.artifact_registry_name
  project       = var.project_id
  location      = var.region
}

resource "google_service_account" "build_service_account" {
  account_id = "${var.service_name}-builder"
  project    = var.project_id
}

resource "google_project_iam_member" "grant_role_to_build_sa" {
  project = var.project_id
  role    = "roles/cloudbuild.builds.builder"
  member  = "serviceAccount:${google_service_account.build_service_account.email}"
}

resource "google_storage_bucket" "cloud_build_bucket" {
  project                     = var.project_id
  location                    = var.region
  name                        = "build_bucket_${var.service_name}_${var.project_id}"
  uniform_bucket_level_access = true
  public_access_prevention    = "enforced"
  force_destroy               = true
}

resource "google_storage_bucket_iam_member" "builder_iam_bucket" {
  for_each = toset([
    "roles/cloudbuild.builds.builder"
  ])
  bucket = google_storage_bucket.cloud_build_bucket.name
  member = "serviceAccount:${google_service_account.build_service_account.email}"
  role   = each.key
}

## Create Image using Cloud Build and store in artifact registry
resource "random_id" "build_version" {
  byte_length = 8

  keepers = {
    project_id = var.project_id
    region     = var.region
  }
}

resource "null_resource" "build_function_image" {
  depends_on = [
    google_artifact_registry_repository.image_registry,
    google_storage_bucket_iam_member.builder_iam_bucket
  ]

  triggers = {
    project_id      = var.project_id
    region          = var.region
    full_image_path = "${var.region}-docker.pkg.dev/${var.project_id}/${google_artifact_registry_repository.image_registry.name}/${var.service_name}:${random_id.build_version.hex}"
  }

  provisioner "local-exec" {
    when    = create
    command = <<EOF
gcloud builds submit \
--project ${var.project_id} \
--region ${var.region} \
--machine-type e2-highcpu-8 \
--substitutions _CONTAINER_IMAGE_NAME=${self.triggers.full_image_path} \
--service-account ${google_service_account.build_service_account.id} \
--default-buckets-behavior regional-user-owned-bucket \
--gcs-source-staging-dir gs://${google_storage_bucket.cloud_build_bucket.name}/source
EOF
  }

  provisioner "local-exec" {
    when    = destroy
    command = <<EOF
gcloud artifacts docker images delete \
${self.triggers.full_image_path} \
--quiet
EOF
  }
}

resource "google_cloud_run_v2_service" "bq_function" {
  location            = var.region
  name                = var.service_name
  project             = var.project_id
  depends_on          = [null_resource.build_function_image]
  deletion_protection = false

  template {
    service_account       = google_service_account.run_service_account.email
    execution_environment = "EXECUTION_ENVIRONMENT_GEN2"

    containers {
      image = null_resource.build_function_image.triggers.full_image_path
      env {
        name  = "PROJECT_ID"
        value = var.project_id
      }
    }
  }
}

resource "google_bigquery_connection" "external_bq_fn_connection" {
  project       = var.project_id
  connection_id = "ext-${var.service_name}"
  location      = var.region
  description   = "External transformation function connection"
  cloud_resource {}
}

resource "google_project_iam_binding" "grant_bq_connection_run_invoker_role" {
  project = var.project_id
  role    = "roles/run.invoker"
  depends_on = [
    google_bigquery_connection.external_bq_fn_connection
  ]
  members = [
    "serviceAccount:${google_bigquery_connection.external_bq_fn_connection.cloud_resource[0].service_account_id}"
  ]
}

resource "google_bigquery_dataset" "routines_dataset" {
  project    = var.project_id
  location   = var.region
  dataset_id = var.bq_dataset
}

## Create DLP De-identify Template

resource "random_id" "random_de_id_template_id_random" {
  byte_length = 8
  prefix      = "dlpdeidfn_"
  keepers = {
    project_id = var.project_id
    region     = var.region
  }
}

locals {
  de_id_template_id         = random_id.random_de_id_template_id_random.hex
  de_identify_template_json = merge(jsondecode(file(var.dlp_deid_template_json_file)), { templateId = local.de_id_template_id })
}

resource "null_resource" "dlp_de_identify_template" {
  triggers = {
    project_id                   = var.project_id
    region                       = var.region
    dlp_de_id_template_id        = local.de_id_template_id
    dlp_de_id_template_full_path = "projects/${var.project_id}/locations/${var.region}/deidentifyTemplates/${local.de_id_template_id}"
    template_file_hash           = filesha256(var.dlp_deid_template_json_file)
  }

  provisioner "local-exec" {
    when    = create
    command = <<EOF
curl -s https://dlp.googleapis.com/v2/projects/${self.triggers.project_id}/locations/${self.triggers.region}/deidentifyTemplates \
--header "X-Goog-User-Project: ${var.project_id}" \
--header "Authorization: Bearer $(gcloud auth print-access-token)" \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--data '${jsonencode(local.de_identify_template_json)}'
EOF
  }

  provisioner "local-exec" {
    when    = destroy
    command = <<EOF
curl -s --request DELETE \
https://dlp.googleapis.com/v2/${self.triggers.dlp_de_id_template_full_path} \
--header "X-Goog-User-Project: ${self.triggers.project_id}" \
--header "Authorization: Bearer $(gcloud auth print-access-token)" \
--header 'Accept: application/json' \
--header "Content-Type: application/json"
EOF
  }
}

## Create DLP Inspect Template

resource "random_id" "random_inspect_template_id_random" {
  byte_length = 8
  prefix      = "dlpinspectfn_"
  keepers = {
    project_id = var.project_id
    region     = var.region
  }
}

locals {
  inspect_template_id   = random_id.random_inspect_template_id_random.hex
  inspect_template_json = merge(jsondecode(file(var.dlp_inspect_template_json_file)), { templateId = local.inspect_template_id })
}

resource "null_resource" "dlp_inspect_template" {
  triggers = {
    project_id                     = var.project_id
    region                         = var.region
    dlp_inspect_template_id        = local.inspect_template_id
    dlp_inspect_template_full_path = "projects/${var.project_id}/locations/${var.region}/inspectTemplates/${local.inspect_template_id}"
    template_file_hash             = filesha256(var.dlp_inspect_template_json_file)
  }

  provisioner "local-exec" {
    when    = create
    command = <<EOF
curl -s https://dlp.googleapis.com/v2/projects/${self.triggers.project_id}/locations/${self.triggers.region}/inspectTemplates \
--header "X-Goog-User-Project: ${var.project_id}" \
--header "Authorization: Bearer $(gcloud auth print-access-token)" \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--data '${jsonencode(local.inspect_template_json)}'
EOF
  }

  provisioner "local-exec" {
    when    = destroy
    command = <<EOF
curl -s --request DELETE \
https://dlp.googleapis.com/v2/${self.triggers.dlp_inspect_template_full_path} \
--header "X-Goog-User-Project: ${self.triggers.project_id}" \
--header "Authorization: Bearer $(gcloud auth print-access-token)" \
--header 'Accept: application/json' \
--header "Content-Type: application/json"
EOF
  }
}

## Create BigQuery remote functions
resource "random_id" "bq_job_random" {
  byte_length = 8
}

resource "null_resource" "bq_dlp_mask_function" {
  depends_on = [null_resource.dlp_de_identify_template, null_resource.dlp_inspect_template, google_cloud_run_v2_service.bq_function, google_bigquery_connection.external_bq_fn_connection, google_bigquery_dataset.routines_dataset]

  triggers = {
    project_id         = var.project_id
    region             = var.region
    dataset_id         = var.bq_dataset
    cloud_service_name = google_cloud_run_v2_service.bq_function.id
    cloud_run_uri      = google_cloud_run_v2_service.bq_function.uri
  }

  provisioner "local-exec" {
    when    = create
    command = <<EOF
bq query --project_id "${self.triggers.project_id}" \
--use_legacy_sql=false \
"CREATE OR REPLACE FUNCTION ${self.triggers.dataset_id}.dlp_freetext_mask(v STRING) RETURNS STRING \
REMOTE WITH CONNECTION \`${self.triggers.project_id}.${self.triggers.region}.${google_bigquery_connection.external_bq_fn_connection.connection_id}\` \
OPTIONS (endpoint = '${self.triggers.cloud_run_uri}', user_defined_context = [('mode', 'deidentify'),('algo','dlp'),('dlp-deid-template','${null_resource.dlp_de_identify_template.triggers.dlp_de_id_template_full_path}'),('dlp-inspect-template','${null_resource.dlp_inspect_template.triggers.dlp_inspect_template_full_path}')]);" \
EOF
  }

  provisioner "local-exec" {
    when    = destroy
    command = <<EOF
bq query --project_id "${self.triggers.project_id}" \
--use_legacy_sql=false \
"DROP FUNCTION ${self.triggers.dataset_id}.dlp_freetext_mask" \
EOF
  }
}

3.9. Editing variables.tf

A few variables were updated and added:

  • Variables that can be treated as constants (project ID, region, etc.) got default values.
  • A new variable for the Inspect template config file path was added, so main.tf can reference it during Terraform runs.
variable "project_id" {
  type    = string
  default = "joshua-private"
}

variable "region" {
  type    = string
  default = "asia-northeast3"
}

variable "artifact_registry_name" {
  type    = string
  default = "bq-remote-functions"
}

variable "bq_dataset" {
  type    = string
  default = "remote_functions"
}

variable "dlp_deid_template_json_file" {
  type    = string
  default = "sample_dlp_deid_config.json"
}

variable "dlp_inspect_template_json_file" {
  type    = string
  default = "sample_dlp_inspect_config.json"
}

variable "service_name" {
  default = "bq-transform-fns"
}

variable "user_os" {
  type        = string
  default     = "linux"
  description = "The OS of the person running the Terraform script. Options: [linux, darwin]"
  validation {
    condition     = contains(["linux", "darwin"], var.user_os)
    error_message = "Supported OS Options: [linux, darwin]"
  }
}

3.10. Authenticating the gcloud SDK

Terraform will deploy resources via the gcloud SDK, so authenticate first:

gcloud auth application-default login

3.11. Previewing the project shape with terraform plan

Before applying, check what’s about to get deployed:

terraform plan

3.12. Deploying the GCP infra with terraform apply

Now deploy the GCP infrastructure for the PII masking function as IaC in a single shot:

terraform apply

💡 [Author’s Note] I got an error!

If you see ERROR: (gcloud.builds.submit) FAILED_PRECONDITION: due to quota restrictions, Cloud Build cannot run builds in this region, you’ll need GCP to allow Cloud Build in your region (in my case, asia-northeast3). Projects with little to no billing history sometimes have Cloud Build restricted for financial-stability reasons. Two options:

  • Quick workaround: set the Cloud Build deployment region to us-central1. (This only moves the application image to a US data center — PII data itself isn’t moved, so privacy regulations aren’t affected.)
  • Proper fix: file via GCP Support Hub or contact a Customer Engineer.
...
gcloud builds submit \
--project ${var.project_id} \
--region us-central1 \ # Override the region only here.
--machine-type e2-highcpu-8 \
--substitutions _CONTAINER_IMAGE_NAME=${self.triggers.full_image_path} \
--service-account ${google_service_account.build_service_account.id} \
--default-buckets-behavior regional-user-owned-bucket \
--gcs-source-staging-dir gs://${google_storage_bucket.cloud_build_bucket.name}/source
...

3.13. Running the function in the BigQuery console

After 5 – 10 minutes the infra should be fully set up, and you can check the masking function in action.

In the left Explorer panel, under the remote_functions dataset, you’ll see the dlp_freetext_mask function in the Routines section.

Try the function on a sample PII record and you’ll see the PII portion successfully masked.

3.14. Cleaning up after the hands-on

To avoid ongoing charges, tear everything down once you’re done:

terraform destroy

4. Deeper Notes

“A few things to keep in mind.”

4.1. Check the service cost

Take a look at Sensitive Data Protection pricing:

The BigQuery Remote Function reaches Sensitive Data Protection through a function on Cloud Run. As noted earlier, Cloud Run is serverless — you only pay when the code runs — which keeps the cost surprisingly low.

4.2. Check the service quotas and limits

Check the Sensitive Data Protection quotas and limits:

The service has fairly aggressive quota limits — if a real-time or batch pipeline tries to push too many records through it, you’ll hit quota errors quickly. So you need to introduce pagination to throttle usage. In my Airflow DAG I avoided quota errors like this:

...
PAGE_SIZE = 100 # Records per batch
TOTAL_ROWS = len(df) # Total record count
TOTAL_PAGES = (TOTAL_ROWS + PAGE_SIZE - 1) // PAGE_SIZE # Total pages

for PAGE in range(TOTAL_PAGES):
  OFFSET = PAGE * PAGE_SIZE

  query = f"""
    WITH
    CTE_batch AS (
      SELECT
        pii
      FROM
        table
      ORDER BY
        id
      LIMIT
        {PAGE_SIZE}
      OFFSET
        {OFFSET}
    )
    SELECT
      pii,
      remote_functions.dlp_freetext_mask(pii) as pii_masked
    FROM
      CTE_batch
  """

  result = client.query(query)
  result_df = result.to_dataframe()
  ...

4.3. How to read dlp_freetext_mask error logs

When dlp_freetext_mask fails in BigQuery, the BigQuery error message itself is not very informative — it’ll only tell you that the replies field is missing:

Returned JSON from https://bq-transform-fns-xxx-du.a.run.app is not a valid object or it does not have "replies" field set. jobId: bquxjob_xxx, requestId: xxx.

In that case, head over to the Cloud Run console — the precise error logs live there. Keep this in mind for debugging.

Cloud Run


5. Closing Thoughts

“Here is the value the BigQuery DLP Remote Function created.”

So that’s the detailed walkthrough of building a BigQuery function for automatic PII masking, using GCP Sensitive Data Protection. (I hope nothing was overly hard or under-explained.)

The BigQuery DLP Remote Function environment was, at its core, my attempt to resolve a dilemma: how do we keep up with internal security policy while still raising the organization’s ability to use data? This was actually the first time the company had stood up an environment like this, so it took a lot of cross-team discussion — DevOps, security, data, and the leaders of each — and that made it a non-trivial undertaking.

Once it shipped, alongside the satisfaction of contributing to the organization, I personally walked away with a few things:

  • A much broader understanding of cloud infrastructure. (I think I could spin up a Remote Function that calls Gemini straight from BigQuery pretty easily now.)
  • A real jump in Terraform fluency. (I’d only used it for personal learning before, but applying it to a production problem made it stick.)
  • A clearer sense of how to communicate with DevOps and security teammates while managing data pipelines. (DevOps cares deeply about tfstate management; security wants the PII logic itself declared explicitly.)

It was a hard project, but solving an unprecedented problem ended up letting the rest of the org use data at a higher level — and I grew a step in the process. The sense of accomplishment was unusually high. Whenever I face hard problems going forward, I want to remember this “sweet ending” and untangle them one by one. Thanks for reading.