My goal should be quite straight forward: I have a PostgreSQL running in Cloud SQL, and want to use Datastream to transfer data to Big Query, within the same project, preferably through private IP connectivity.
I have set up the database with private IP, created a private connectivity in Datastream with an available IP range on the database's VPC, and set up a connection profile towards the private IP with the correct credentials.
All I get for now is a timeout when I test the connection, being a bit hard to debug.
I have also tried enabling firewall rules to accept the traffic, with the same result. (Is it necessary to create firewall rules in this case?)
At first I thought the reverse proxy would only be needed when connecting from outside the GCP project or from other networks. Do I really need it in this case as well? Shouldn't they be accessible when they are within the same GCP project?
The amount of work and config needed to make this work leads me to believe that I am doing something the wrong way or not following best practice here. As this is assumably a central part of Datastream/GCP functionality, I assume there is a simpler and more easy to maintain way? I am setting up several databases to transfer analytics data to Big Query, so minimizing the overhead for each of them is a big advantage.
What would be the preferred way to acheive this?
Side question: Apart from Datastream, is there another preferred way / best practice to transfer data to Big Query from Cloud SQL? In the Cloud SQL config I see the option "Enable private path: Allows other Google Cloud services like BigQuery to access data and make queries over Private IP" but I can't find much documentation on what this is and how to use it.
Do I need a reverse proxy even if they are in the same project?
Yes unfortunately I do believe you need to setup a "reverse proxy" or intermediary VM even within the same project for the VPC network.
The reason being for this stated in the docs is:
When you configure a Cloud SQL for PostgreSQL instance to use private IP addresses, you use a VPC peering connection between your VPC network and the VPC network of the underlying Google services where your Cloud SQL instance resides. (Cloud SQL blue square in above digram)
Because Datastream's network (Datastream blue rectangle in diagram) can't be peered directly with Cloud SQL's private services network, and because VPC peering isn't transitive, a reverse proxy for Cloud SQL is required to bridge the connection from Datastream to your Cloud SQL instance."
Basically what this means is because Cloud SQL is a managed service and uses its own "Cloud SQL" protected VPC and Datastream is a managed service with its own "Datastream" VPC it results in a transitive VPC peering conflict. To resolve this transitivity issue, a middle client/proxy must be established to forward the traffic.
Apart from Datastream, is there another preferred way / best practice to transfer data to BigQuery from Cloud SQL?
This depends on the use-case or meaning of "transfer data" so let me try and showcase the options/products.
First ask yourself which of the below use-cases are you?
a) You want to transfer data in real-time from Cloud SQL Postgres to BigQuery so that the data also resides in BigQuery
b) You want to read Cloud SQL data into BigQuery and use it for analysis?
If you are Option a) than yes this is where Datastream comes in as the product/feature for you. Datastream is a change data capture (CDC) and replication service that lets you synchronize data reliably, and with minimal latency. Basically as data is added to Cloud SQL Postgres it will be synced and replicated into BigQuery in almost real-time. The benefit here is that BigQuery data will be consistently updated without you needing to do extra work.
If you are Option b) and you want to quickly read data from Cloud SQL into BigQuery for analysis and are fine with the data residing purely in Cloud SQL than Datastream might be too much overhead. Cloud SQL Federated Queries allow you to read/query Cloud SQL data into BigQuery and may be a possible option. Selecting the Private path for Google Cloud services checkbox is what enables these federated queries for Private IP Cloud SQL instances as discussed in Connect Cloud SQL to BigQuery docs.