Disk space limit

RDSDiskSpaceLimit #

Meaning #

Alert is triggered when RDS instance is low on storage.

Impact #

The PostgreSQL instance will stop to prevent data corruption if no more disk space is available.

Diagnosis #

Determine whether it’s a long-term growth trend requiring storage increase or abnormal disk usage reflecting another problem:

  1. Check database size growth over the last 24 hours to identify abnormal disk usage growth

    Look at Storage usage panel in RDS instance details Grafana dashboard

  2. Check if there are long-running transactions (or queries)

    Look at Postgresql live activity Grafana dashboard to find long-running query

    Why?

    A long running transaction (or query) forces PostgreSQL to maintain visibility of all rows since the beginning of the transaction. So PostgreSQL accumulate WAL files until the transaction is close.

    PostgreSQL
    SELECT
        pid,
        leader_pid,
        datname as database,
        usename AS user,
        application_name,
        EXTRACT(EPOCH FROM now() - xact_start) as transaction_duration,
        CASE
            WHEN state = 'active' THEN EXTRACT(EPOCH FROM now() - query_start)
            ELSE null
        END query_duration,
        state,
        wait_event_type,
        wait_event,
        CASE
            WHEN state = 'active' THEN query
            ELSE null
        END query,
        xact_start as transaction_start,
        CASE
            WHEN state = 'active' THEN query_start
            ELSE null
        END query_start,
        CASE
            WHEN state != 'active' THEN EXTRACT(EPOCH FROM state_change - query_start)
            ELSE null
        END last_query_duration,
        CASE
            WHEN state != 'active' THEN query
            ELSE null
        END last_query,
        pg_blocking_pids(pid) as blocked_by,
        client_addr,
        backend_start,
        ceil(EXTRACT(EPOCH FROM now() - backend_start)) as backend_duration
    FROM pg_stat_activity
    WHERE query_start is not null
        AND usename!='rdsrepladmin'
        AND query not like 'START_REPLICATION %'
        AND pid != pg_backend_pid();
    
  3. Check the status of PostgreSQL replication slots

    Why?PostgreSQL keeps WAL files on its disk until the replication slot client acknowledges they consumed it.
  4. Check log file sizes

    Why?PostgreSQL logs could consume large disk space.

    Usually related to:

    • Connection/disconnections
    • Slow queries
    • PGaudit logs
    • Internal errors
    • Temporary files

    PostgreSQL parameters:

Mitigation #

You must avoid reaching no disk space left situation.

  • Fix the system that blocks PostgreSQL to recycle its WAL files

    • If long-running transactions/queries: Cancel or kill the transactions
    • If non-running replication slot: Delete replication slot
  • Increase RDS disk space

    AWS RDS limitations

    • Minimal disk increase is 10%

    • You can’t make further storage modifications for either six (6) hours or until storage optimization has completed on the instance, whichever is longer.

    • Storage optimization can take several hours

    • Storage can’t be reclaimed

    See more on AWS documentation

    1. Determine the minimum storage for the increase

      INSTANCE_IDENTIFIER=<replace with the RDS instance identifier>
      
      aws rds describe-db-instances --db-instance-identifier ${INSTANCE_IDENTIFIER} \
      | jq -r '{"Current IOPS": .DBInstances[0].Iops, "Current Storage Limit": .DBInstances[0].AllocatedStorage, "New minimum storage size": ((.DBInstances[0].AllocatedStorage|tonumber)+(.DBInstances[0].AllocatedStorage|tonumber*0.1|floor))}'
      
    2. Increase storage:

      NEW_ALLOCATED_STORAGE=<replace with new allocated storage in GB>
      
      aws rds modify-db-instance --db-instance-identifier ${RDS_INSTANCE} --allocated-storage ${INSTANCE_IDENTIFIER} --apply-immediately \
      | jq .DBInstance.PendingModifiedValues
      

Additional resources #