Disk space limit

RDSDiskSpaceLimit #

Meaning #

Alert is triggered when RDS instance is low on storage.

Impact #

The PostgreSQL instance will stop to prevent data corruption if no more disk space is available.

Diagnosis #

Determine whether it’s a long-term growth trend requiring storage increase or abnormal disk usage reflecting another problem:

  1. Check database size growth over the last 24 hours to identify abnormal disk usage growth

    Look at Storage usage panel in RDS instance details Grafana dashboard

  2. Check if there are long-running transactions (or queries)

    Look at Postgresql live activity Grafana dashboard to find long-running query

    Why?

    A long running transaction (or query) forces PostgreSQL to maintain visibility of all rows since the beginning of the transaction. So PostgreSQL accumulate WAL files until the transaction is close.

    PostgreSQL
    SELECT
        pid,
        leader_pid,
        datname as database,
        usename AS user,
        application_name,
        EXTRACT(EPOCH FROM now() - xact_start) as transaction_duration,
        CASE
            WHEN state = 'active' THEN EXTRACT(EPOCH FROM now() - query_start)
            ELSE null
        END query_duration,
        state,
        wait_event_type,
        wait_event,
        CASE
            WHEN state = 'active' THEN query
            ELSE null
        END query,
        xact_start as transaction_start,
        CASE
            WHEN state = 'active' THEN query_start
            ELSE null
        END query_start,
        CASE
            WHEN state != 'active' THEN EXTRACT(EPOCH FROM state_change - query_start)
            ELSE null
        END last_query_duration,
        CASE
            WHEN state != 'active' THEN query
            ELSE null
        END last_query,
        pg_blocking_pids(pid) as blocked_by,
        client_addr,
        backend_start,
        ceil(EXTRACT(EPOCH FROM now() - backend_start)) as backend_duration
    FROM pg_stat_activity
    WHERE query_start is not null
        AND usename!='rdsrepladmin'
        AND query not like 'START_REPLICATION %'
        AND pid != pg_backend_pid();
    
  3. Check the status of PostgreSQL replication slots

    Why?PostgreSQL keeps WAL files on its disk until the replication slot client acknowledges they consumed it.
  4. Check log file sizes

    Why?PostgreSQL logs could consume large disk space.

    Usually related to:

    • Connection/disconnections
    • Slow queries
    • PGaudit logs
    • Internal errors
    • Temporary files

    PostgreSQL parameters:

Mitigation #

You must avoid reaching no disk space left situation.

  • Fix the system that blocks PostgreSQL to recycle its WAL files

    • If long-running transactions/queries: Cancel or kill the transactions
    • If non-running replication slot: Delete replication slot
  • Increase RDS disk space

    AWS RDS limitations

    • Minimal disk increase is 10%

    • You can’t make further storage modifications for either six (6) hours or until storage optimization has completed on the instance, whichever is longer.

    • Storage optimization can take several hours

    • Storage can’t be reclaimed

    See more on AWS documentation

    1. Set AWS_PROFILE

      export AWS_PROFILE=<AWS account>
      
    2. Determine the minimum storage for the increase đź’ˇ RDS requires a minimal storage increase of 10%

      INSTANCE_IDENTIFIER=<replace with the RDS instance identifier>
      
      aws rds describe-db-instances --db-instance-identifier ${INSTANCE_IDENTIFIER} \
      | jq -r '{"Current IOPS": .DBInstances[0].Iops, "Current Storage Limit": .DBInstances[0].AllocatedStorage, "New minimum storage size": ((.DBInstances[0].AllocatedStorage|tonumber)+(.DBInstances[0].AllocatedStorage|tonumber*0.1|floor))}'
      
    3. Increase storage:

      NEW_ALLOCATED_STORAGE=<replace with new allocated storage in GB>
      
      aws rds modify-db-instance --db-instance-identifier ${INSTANCE_IDENTIFIER} --allocated-storage ${NEW_ALLOCATED_STORAGE} --apply-immediately \
      | jq .DBInstance.PendingModifiedValues
      

      âť— If the RDS instance has replicas instances (replica or reporting), you must repeat the operation for all replicas to keep the same configuration between instances

    4. Backport changes in Terraform

Additional resources #