RDSReplicationLag #
Meaning #
Alert is triggered when an RDS instance has unacceptable replication lag.
Impact #
- Applications that rely on up-to-date data may not work properly
Diagnosis #
Check IOPS saturation on the RDS instance
If IOPS are saturated, replication could have difficulties to be applied
Check if a long-running transactions is executed on the replica instance
Long-running transactions on a PostgreSQL replica could block the data replication process.
SQL
SELECT pid, leader_pid, datname as database, usename AS user, application_name, EXTRACT(EPOCH FROM now() - xact_start) as transaction_duration, CASE WHEN state = 'active' THEN EXTRACT(EPOCH FROM now() - query_start) ELSE null END query_duration, state, wait_event_type, wait_event, CASE WHEN state = 'active' THEN query ELSE null END query, xact_start as transaction_start, CASE WHEN state = 'active' THEN query_start ELSE null END query_start, CASE WHEN state != 'active' THEN EXTRACT(EPOCH FROM state_change - query_start) ELSE null END last_query_duration, CASE WHEN state != 'active' THEN query ELSE null END last_query, pg_blocking_pids(pid) as blocked_by, client_addr, backend_start, ceil(EXTRACT(EPOCH FROM now() - backend_start)) as backend_duration FROM pg_stat_activity WHERE query_start is not null AND usename!='rdsrepladmin' AND query not like 'START_REPLICATION %' AND pid != pg_backend_pid();
Mitigation #
Kill long-running queries on replica
Increase IOPS if they are saturated
Additional resources #
n/a