Skip to main content

Version: 1.0

Best Practices

SQL

Small table instead or larger ones (data denormalization)
Select with fields names instead of SELECT *
Use INNER JOIN instead of WHERE (Where creates more combinaison => more work)
Do biggest join and filter per join
Filter early and big with WHERE
Partiton data (For BigQuery)
- Partition by ingest time
- Partition by specied data columns
LIMIT does not affect performance

Dataflow

Handles errors
- Gracefully catch errors using Try-Catch blocks
  - Output errors to PCollection and set the collector for the later analysis
  - Think about it as recyling the bad data
- How to update exiting code ?

SQL
Dataflow