Study Note 3.2.1: BigQuery Best Practices

Pizofreude - Feb 11 - - Dev Community

Cost Reduction Strategies

  1. Column Selection
    • Avoid using SELECT *
    • Always specify required column names explicitly
    • Rationale: BigQuery uses column storage, so selecting specific columns minimizes data read
  2. Query Planning
    • Always check query pricing before execution
    • Price is visible in the top right corner
    • Use clustering and partitioning effectively
  3. Data Operations
    • Use streaming inserts cautiously as they can significantly increase costs
    • Materialize queries into stages, especially when using CTEs in multiple locations
    • Leverage BigQuery's query result caching

Query Performance Optimization

  1. Data Structure and Filtering
    • Filter on partitioned or clustered columns
    • Denormalize data when dealing with complex structures
    • Utilize nested or repeated columns for denormalization
  2. Data Source Management
    • Use external data sources judiciously
    • Reading from Google Cloud Storage may incur additional costs
    • Reduce data volume before performing joins
  3. Query Pattern Optimization
    • Avoid treating WITH clause as a prepared statement
    • Prevent over-sharding of tables
    • Avoid JavaScript and user-defined functions
    • Use approximation aggregation functions (e.g., HyperLogLog) instead of complete ones
  4. Join Optimization
    • Optimize join patterns by following table size order:
      1. Place largest table first
      2. Follow with tables having fewest rows
      3. Arrange remaining tables in decreasing size order
    • Rationale: Largest table gets distributed evenly, while smaller tables are broadcasted to all nodes
  5. Query Structure
    • Place ORDER BY statements at the end of queries for maximum performance

Comprehensive BigQuery Best Practices Guide

1. Cost Reduction Strategies (Detailed)

A. Column Selection Best Practices

-- Bad Practice
SELECT * FROM transactions;

-- Good Practice
SELECT transaction_id, user_id, amount, timestamp
FROM transactions;

Enter fullscreen mode Exit fullscreen mode

Detailed Explanation:

  • BigQuery charges based on the amount of data scanned
  • In columnar storage, each column is stored separately
  • When using SELECT *, BigQuery must:
    • Scan all columns in the table
    • Process and transfer all data
    • Use more compute resources
  • Cost Impact Example:
    • Table with 100 columns, only need 5
    • SELECT * scans 100% of data
    • Specific columns scan only 5% of data
    • Potential 95% cost reduction

B. Query Planning and Optimization

Price Checking:

  • Use the INFORMATION_SCHEMA.JOBS_BY_PROJECT view to analyze query costs
  • Enable cost estimates in the UI
  • Set custom cost controls:
SET custom_quota = '1000000'; -- Set 1TB processing limit

Enter fullscreen mode Exit fullscreen mode

Partitioning Strategies:

  1. Time-based Partitioning:
CREATE TABLE dataset.table
PARTITION BY DATE(timestamp_column)
AS SELECT * FROM source_table;

Enter fullscreen mode Exit fullscreen mode
  1. Integer Range Partitioning:
CREATE TABLE dataset.table
PARTITION BY RANGE_BUCKET(user_id, GENERATE_ARRAY(0, 100, 10))
AS SELECT * FROM source_table;

Enter fullscreen mode Exit fullscreen mode

C. Data Operations Optimization

Streaming Inserts:

  • Cost: $0.01 per 200MB
  • Better alternatives:
    1. Batch loading
    2. Load jobs
    3. Scheduled uploads

Query Materialization:

-- Instead of repeated CTE usage
WITH base_cte AS (
  SELECT complex_calculation FROM huge_table
)
-- Materialize into temporary table
CREATE TEMP TABLE materialized_results AS
SELECT complex_calculation FROM huge_table;

-- Use materialized results
SELECT * FROM materialized_results;

Enter fullscreen mode Exit fullscreen mode

2. Query Performance Optimization (Detailed)

A. Data Structure and Filtering

Partitioning and Clustering Example:

-- Create partitioned and clustered table
CREATE TABLE sales_data
PARTITION BY DATE(transaction_date)
CLUSTER BY customer_id, product_id
AS SELECT * FROM source_sales;

-- Efficient query using partition filter
SELECT *
FROM sales_data
WHERE DATE(transaction_date) = '2024-02-10'
  AND customer_id = 12345;

Enter fullscreen mode Exit fullscreen mode

Nested and Repeated Columns:

-- Using nested structures
CREATE TABLE users
(
  user_id INT64,
  name STRING,
  addresses ARRAY<STRUCT
    street STRING,
    city STRING,
    country STRING
  >>
);

Enter fullscreen mode Exit fullscreen mode

B. Join Optimization Techniques

Optimal Join Order Example:

-- Good Practice (Largest to smallest)
SELECT *
FROM large_table a -- 1M rows
JOIN medium_table b -- 100K rows
  ON a.id = b.id
JOIN small_table c -- 10K rows
  ON b.id = c.id;

-- Poor Practice (Small to large)
SELECT *
FROM small_table c -- 10K rows
JOIN medium_table b -- 100K rows
  ON c.id = b.id
JOIN large_table a -- 1M rows
  ON b.id = a.id;

Enter fullscreen mode Exit fullscreen mode

C. Advanced Performance Optimization

Using Approximation Functions:

-- Instead of COUNT(DISTINCT)
SELECT APPROX_COUNT_DISTINCT(user_id)
FROM events;

-- Instead of exact percentiles
SELECT APPROX_QUANTILES(value, 100)[OFFSET(90)]
FROM measurements;

Enter fullscreen mode Exit fullscreen mode

Query Pattern Optimization:

-- Bad Practice (Multiple subqueries)
SELECT *
FROM (
  SELECT *
  FROM (
    SELECT * FROM large_table
  )
);

-- Good Practice (Flattened query)
SELECT *
FROM large_table;

Enter fullscreen mode Exit fullscreen mode

D. Monitoring and Maintenance

  1. Regular Performance Monitoring:
SELECT
  creation_time,
  user_email,
  total_bytes_processed,
  total_slot_ms
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
ORDER BY total_bytes_processed DESC;

Enter fullscreen mode Exit fullscreen mode
  1. Table Statistics Monitoring:
SELECT
  table_name,
  row_count,
  size_bytes,
  last_modified_time
FROM `project.dataset.__TABLES__`
ORDER BY size_bytes DESC;

Enter fullscreen mode Exit fullscreen mode

3. Additional Best Practices

  1. Resource Management:
    • Set appropriate job priorities
    • Use reservation slots for predictable workloads
    • Implement cost controls and quotas
  2. Data Quality:
    • Implement column-level constraints
    • Use standardized data types
    • Regular data validation checks
  3. Security:
    • Implement column-level security
    • Use authorized views
    • Regular audit of access patterns
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .