Replication lag

RDSReplicationLag #

Meaning #

Alert is triggered when an RDS instance has unacceptable replication lag.

Impact #

  • Applications that rely on up-to-date data may not work properly

Diagnosis #

  1. Check IOPS saturation on the RDS instance

    If IOPS are saturated, replication could have difficulties to be applied

  2. Check if a long-running transactions is executed on the replica instance

    Long-running transactions on a PostgreSQL replica could block the data replication process.

    SQL
    SELECT
        pid,
        leader_pid,
        datname as database,
        usename AS user,
        application_name,
        EXTRACT(EPOCH FROM now() - xact_start) as transaction_duration,
        CASE
            WHEN state = 'active' THEN EXTRACT(EPOCH FROM now() - query_start)
            ELSE null
        END query_duration,
        state,
        wait_event_type,
        wait_event,
        CASE
            WHEN state = 'active' THEN query
            ELSE null
        END query,
        xact_start as transaction_start,
        CASE
            WHEN state = 'active' THEN query_start
            ELSE null
        END query_start,
        CASE
            WHEN state != 'active' THEN EXTRACT(EPOCH FROM state_change - query_start)
            ELSE null
        END last_query_duration,
        CASE
            WHEN state != 'active' THEN query
            ELSE null
        END last_query,
        pg_blocking_pids(pid) as blocked_by,
        client_addr,
        backend_start,
        ceil(EXTRACT(EPOCH FROM now() - backend_start)) as backend_duration
    FROM pg_stat_activity
    WHERE query_start is not null
        AND usename!='rdsrepladmin'
        AND query not like 'START_REPLICATION %'
        AND pid != pg_backend_pid();
    

Mitigation #

  1. Kill long-running queries on replica

  2. Increase IOPS if they are saturated

Additional resources #

n/a