Streaming from Google Cloud Pub/Sub to Bigquery without the Middlemen

Ravish Garg
Google Cloud - Community
4 min readAug 10, 2022

--

Photo by Michael Barnes on Unsplash

To ingest data to BigQuery from Pub/Sub in real-time, one must leverage Dataflow pipelines along with the schema definition and required transformations (if any). Still, the same can be a stretch if one only needs to ingest raw data or with NO transformation from Pub/Sub to BigQuery.

Thus, to simplify it further and for such simple use-cases, Google has now launched direct streaming capability from Pub/Sub via “BigQuery Subscription”. In addition, Pub/Sub topic schemas provide the option of writing Pub/Sub messages to BigQuery tables with compatible schemas. If the schema is not enabled for your topic, messages will be registered as bytes or strings to the given BigQuery table. Well, you can always go through its detail over here, but in this blog, I would like a quick set-up to showcase its simplicity and make you more comfortable with its implementation.

Let’s consider a scenario where you would like to build a simple, real-time CDC pipeline for a CloudSQL MySQL table “people” to BigQuery leveraging Google Cloud native solutions like:

  • “Debezium Server” for capturing change logs of MySQL table “people”.
  • Create the BigQuery table “people2” to which CDC workload is to be ingested.
  • Create the Pub/Sub Topic Schema with Avro framework compatible with BigQuery table “people2”.
  • Create the Debezium Server compliant Pub/Sub Topic attached to the above Topic Schema.
  • Create the BigQuery Subscription attached to the above Pub/Sub Topic.
  • Initiate Debezium Server so that it will start pushing change logs as messages to the associated Pub/Sub Topic.
  • Respective Pub/Sub message gets decoded leveraging attached Pub/Sub Schema before getting pushed to provided BigQuery table directly.
Pub/Sub to BigQuery Direct Streaming, By Author

Now, let’s roll up the sleeves and get to the implementation of the same:

Set up & Configure Debezium Server:

  • Download & Install Debezium Server distribution.
  • Configure Debezium Server “./conf/application_properties” which will contain all configurations related to Source, Sink, Format & applicable Transformations.
  • The configuration file with CloudSQL — MySQL database as the source and Google Cloud PubSub as messaging infrastructure will look like this:
debezium.sink.type=pubsub
debezium.sink.pubsub.project.id=<<$PROJECT_ID>>
debezium.source.connector.class=io.debezium.connector.mysql.MySqlConnector
debezium.source.database.hostname=10.67.112.10
debezium.source.database.port=3306
debezium.source.database.user=root
debezium.source.database.password=mysqldb
debezium.source.database.server.id=373472625
debezium.source.database.server.name=mysql
debezium.source.database.include.list=test
debezium.source.table.include.list=test.people
debezium.source.database.history=io.debezium.relational.history.FileDatabaseHistory
debezium.source.database.history.file.filename=data/history.dat
debezium.source.offset.storage.file.filename=data/offsets.dat
debezium.source.offset.flush.interval.ms=0
debezium.source.transforms=unwrap
debezium.source.transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
debezium.source.transforms.unwrap.delete.handling.mode=rewrite
debezium.source.key.converter.schemas.enable=false
debezium.source.value.converter.schemas.enable=false
BigQuery Table Definition, By Author
  • Create Pub/Sub Schema “mysql_bq” with Apache Avro framework.
{
"type": "record",
"name": "Avro",
"fields": [
{
"name": "id",
"type": "int"
},
{
"name": "first_name",
"type": "string"
},
{
"name": "last_name",
"type": "string"
},
{
"name": "email",
"type": "string"
},
{
"name": "zipcode",
"type": "int"
},
{
"name": "city",
"type": "string"
},
{
"name": "country",
"type": "string"
},
{
"name": "__deleted",
"type": "string"
}
]
}
  • Create Pub/Sub Topic “mysql.test.people” associated with the above Schema.
Create Pub/Sub Topic connected to mysql_bq schema, By Author
  • Add BigQuery Subscription to the Pub/Sub Topic.
Pub/Sub BigQuery Subscription, By Author
Pub/Sub BigQuery Subscription, By Author
  • Initiate the Debezium Server.
<<Debezium_Server_Path/run.sh>>
Debezium-Server Output, By Author
  • And, Wollaaa… here, we have the CDC data in tabular format in BigQuery.
Pub/Sub to BigQuery Direct Streaming, By Author

--

--

Ravish Garg
Google Cloud - Community

Customer Engineer, Data Specialist @ Google Cloud. I assist customers transform & evolve their business via Google’s global network and software infrastructure.