RDSDiskSpaceLimit #
Meaning #
Alert is triggered when RDS instance is low on storage.
Impact #
The PostgreSQL instance will stop to prevent data corruption if no more disk space is available.
Diagnosis #
Determine whether it’s a long-term growth trend requiring storage increase or abnormal disk usage reflecting another problem:
Check database size growth over the last 24 hours to identify abnormal disk usage growth
Look at
Storage usage
panel inRDS instance details
Grafana dashboardCheck if there are long-running transactions (or queries)
Look at
Postgresql live activity
Grafana dashboard to find long-running queryWhy?
A long running transaction (or query) forces PostgreSQL to maintain visibility of all rows since the beginning of the transaction. So PostgreSQL accumulate WAL files until the transaction is close.
PostgreSQL
SELECT pid, leader_pid, datname as database, usename AS user, application_name, EXTRACT(EPOCH FROM now() - xact_start) as transaction_duration, CASE WHEN state = 'active' THEN EXTRACT(EPOCH FROM now() - query_start) ELSE null END query_duration, state, wait_event_type, wait_event, CASE WHEN state = 'active' THEN query ELSE null END query, xact_start as transaction_start, CASE WHEN state = 'active' THEN query_start ELSE null END query_start, CASE WHEN state != 'active' THEN EXTRACT(EPOCH FROM state_change - query_start) ELSE null END last_query_duration, CASE WHEN state != 'active' THEN query ELSE null END last_query, pg_blocking_pids(pid) as blocked_by, client_addr, backend_start, ceil(EXTRACT(EPOCH FROM now() - backend_start)) as backend_duration FROM pg_stat_activity WHERE query_start is not null AND usename!='rdsrepladmin' AND query not like 'START_REPLICATION %' AND pid != pg_backend_pid();
Check the status of PostgreSQL replication slots
Why?
PostgreSQL keeps WAL files on its disk until the replication slot client acknowledges they consumed it.Check log file sizes
Why?
PostgreSQL logs could consume large disk space.Usually related to:
- Connection/disconnections
- Slow queries
- PGaudit logs
- Internal errors
- Temporary files
PostgreSQL parameters:
log_temp_files
can be set to log temporary file creation above a size thresholdtemp_file_limit
parameter can be set to avoid over-usage: https://www.postgresql.org/docs/16/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-DISK
Mitigation #
You must avoid reaching no disk space left situation.
Fix the system that blocks PostgreSQL to recycle its WAL files
- If long-running transactions/queries: Cancel or kill the transactions
- If non-running replication slot: Delete replication slot
Increase RDS disk space
AWS RDS limitations
Minimal disk increase is 10%
You can’t make further storage modifications for either six (6) hours or until storage optimization has completed on the instance, whichever is longer.
Storage optimization can take several hours
Storage can’t be reclaimed
See more on AWS documentation
Set AWS_PROFILE
export AWS_PROFILE=<AWS account>
Determine the minimum storage for the increase đź’ˇ RDS requires a minimal storage increase of 10%
INSTANCE_IDENTIFIER=<replace with the RDS instance identifier>
aws rds describe-db-instances --db-instance-identifier ${INSTANCE_IDENTIFIER} \ | jq -r '{"Current IOPS": .DBInstances[0].Iops, "Current Storage Limit": .DBInstances[0].AllocatedStorage, "New minimum storage size": ((.DBInstances[0].AllocatedStorage|tonumber)+(.DBInstances[0].AllocatedStorage|tonumber*0.1|floor))}'
Increase storage:
NEW_ALLOCATED_STORAGE=<replace with new allocated storage in GB>
aws rds modify-db-instance --db-instance-identifier ${INSTANCE_IDENTIFIER} --allocated-storage ${NEW_ALLOCATED_STORAGE} --apply-immediately \ | jq .DBInstance.PendingModifiedValues
âť— If the RDS instance has replicas instances (replica or reporting), you must repeat the operation for all replicas to keep the same configuration between instances
Backport changes in Terraform