Setting up Postgres with replication using Kubernetes

Wed, Jan 9, 2019

Full production example: https://github.com/arianitu/postgres-statefulset

We’ve been using Helm charts for Postgres, but recently the maintainers decided to break backwards compatibility and I decided to create my own manifests for a Postgres cluster.

Also, I’ve ran into issues with Helm charts where adding extra functionality was a bit of a pain. Besides, writing manifests in Kubernetes is quite enjoyable and pretty easy. It also makes you more aware of your cluster setup!

The requirements are:

I’ve decided to create two StatefulSets, one for the Master server and one for the Replication server. You may be able to create something out of only one StatefulSet where the Master and Slaves automatically discover what they are, but this system seemed more complicated.

Having two StatefulSets allows you to spin N master servers and N replica servers.

Configuration

Our configuration is based on the official documentation of setting up streaming replication on Postgres. You can find the documentation here: https://wiki.postgresql.org/wiki/Streaming_Replication

The steps are:

    1. Create a replica user that is used by the replica server to get data from the master
    1. Configure the Master server
    1. Configure the Replica server
    1. Allow access to the Master server from the Replica (via hba_conf)

Configuration files will be stored in a Kubernetes ConfigMap, while any database credentials will be stored in a Secret.

Create Replica User

So starting with the replica user, I decided to use an image feature from that scans for SQL files under /docker-entrypoint-initdb.d.

ConfigMap (create-replica-user.sh)

#!/bin/bash
set -e

psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
	CREATE ROLE replication WITH REPLICATION PASSWORD '$REPLICATION_PASSWORD' LOGIN
EOSQL

This file is mounted to /docker-entrypoint-initdb.d/create-replica-user.sh in our Master StatefulSet.

Configuring Postgres.conf

Both the Master StatefulSet and Replica StatefulSet will be using a shared Postgres.conf. You should copy over your Postgres.conf (an example can be found here) and add this to the end of it:

ConfigMap (postgres.conf)

include_if_exists = 'master.conf'
include_if_exists = 'replica.conf'

Configure the Master server.

ConfigMap (master.conf):

# To enable read-only queries on a standby server, wal_level must be set to
# "hot_standby". But you can choose "archive" if you never connect to the
# server in standby mode.
wal_level = hot_standby

# Set the maximum number of concurrent connections from the standby servers.
max_wal_senders = 5

# To prevent the primary server from removing the WAL segments required for
# the standby server before shipping them, set the minimum number of segments
# retained in the pg_xlog directory. At least wal_keep_segments should be
# larger than the number of segments generated between the beginning of
# online-backup and the startup of streaming replication. If you enable WAL
# archiving to an archive directory accessible from the standby, this may
# not be necessary.
wal_keep_segments = 32

# Enable WAL archiving on the primary to an archive directory accessible from
# the standby. If wal_keep_segments is a high enough number to retain the WAL
# segments required for the standby server, this is not necessary.
#archive_mode    = on
#archive_command = 'cp %p /path_to/archive/%f'

Configue the Replica server

Configuration of the Replica server is very straightforward. We need to set hot_standby = on and we’re done.

ConfigMap (replica.conf)

hot_standby = on

Configure pg_hba.conf to allow Replica to connect to Master

ConfigMap (pg_hba.conf)

# TYPE  DATABASE        USER            ADDRESS                 METHOD

# "local" is for Unix domain socket connections only
local   all             all                                     trust
# IPv4 local connections:
host    all             all             127.0.0.1/32            trust
# IPv6 local connections:
host    all             all             ::1/128                 trust
# Allow replication connections from localhost, by a user with the
# replication privilege.
local   replication     all                                     trust
host    replication     all             127.0.0.1/32            trust
host    replication     all             ::1/128                 trust

host    replication     replication     all                     md5
host    all             all             all                     md5

This line allows the replication user to connect and start replicating:

host    replication     replication     all                     md5

Creating all above files in a ConfigMap

It’s up to you how you create your ConfigMap, but I usually put all the files in a config folder and create a small script that creates a ConfigMap based on the files in that folder.

The script looks like this:

create_configmap.sh:

kubectl create configmap postgres --from-file=postgres.conf --from-file=master.conf --from-file=replica.conf --from-file=pg_hba.conf --from-file=create-replica-user.sh

I then run the script ./create_configmap.sh and we have all the configuration ready to go!

Configuring database user credentials.

In Kubernetes, sensitive information is put in a Secret. This is where you put your Postgres user/password anslo the replication users password. The config looks like this:

secret.yaml:

apiVersion: v1
kind: Secret
metadata:
  name: postgres
type: Opaque
stringData:
  password: master-password
  replicaPassword: replica-password

Apply it:

kubectl apply -f secret.yml

Master StatefulSet

We now need to create a Master StatefulSet. This StatefulSet is pretty standard, we’re using the Postgres:10.5 image and we’re volume mounting all the configuration we created above.

You can also configure how big you want the database to be and also how much resources you want the database to take in the cluster.

Look for resources and volumeClaimTemplates to configure those. This appears long and complicated, but it’s actually very simple.

The livenessProbe and readinessProbe determine if the database is running properly, and if it’s not, it restarts the database.

That’s all there is to it really!

statefulset-master.yml:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  updateStrategy:
    type: RollingUpdate
  
  selector:
    matchLabels:
      app: postgres

  serviceName: postgres
  replicas: 1
  template:
    metadata:
      labels:
        app: postgres 
    spec:
      volumes:
        - name: postgres-config
          configMap:
            name: postgres
            
      terminationGracePeriodSeconds: 10

      containers:
        - name: postgres
          image: postgres:10.5
          args: ['-c', 'config_file=/etc/postgres.conf', '-c', 'hba_file=/etc/pg_hba.conf']
          
          imagePullPolicy: IfNotPresent
        
          ports:
            - name: postgres
              containerPort: 5432
              protocol: TCP
          
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
          
          env:
            - name: POSTGRES_USER
              value: postgres
          
            - name: PGUSER
              value: postgres
          
            - name: POSTGRES_DB
              value: postgres
            
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
          
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: password
                  name: postgres
                  
            - name: REPLICATION_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: replicaPassword
                  name: postgres
              
            - name: POD_IP
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: status.podIP
              
          livenessProbe:
            exec:
              command:
                - sh
                - -c
                - exec pg_isready --host $POD_IP
            failureThreshold: 6
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5

          readinessProbe:
            exec:
              command:
                - sh
                - -c
                - exec pg_isready --host $POD_IP
            failureThreshold: 3
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 3

          volumeMounts:
            - mountPath: /var/lib/postgresql/data/pgdata
              name: postgres
              subPath: postgres-db
              
            - name: postgres-config
              mountPath: /etc/postgres.conf
              subPath: postgres.conf
              
            - name: postgres-config
              mountPath: /etc/master.conf
              subPath: master.conf
              
            - name: postgres-config
              mountPath: /etc/pg_hba.conf
              subPath: pg_hba.conf
              
            - name: postgres-config
              mountPath: /docker-entrypoint-initdb.d/create-replica-user.sh
              subPath: create-replica-user.sh
          
  volumeClaimTemplates:
  - metadata:
      name: postgres
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "standard"
      resources:
        requests:
          storage: 1Gi

Apply the StatefulSet to the cluster:

kubectl apply -f statefulset-master.yml

Wait for the Master server before running the Live server.

Replica StatefulSet

This file is very similar to the above file, with some key differences. We volume mount the replica config instead, we don’t volume mount pg_hba.conf and we also don’t mount create-replica-user.sh.

We also have an initContainer that clones the masters data using pg_basebackup

statefulset-replica.yml:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres-replica
spec:
  updateStrategy:
    type: RollingUpdate
  
  selector:
    matchLabels:
      app: postgres-replica

  serviceName: postgres-replica
  replicas: 1
  template:
    metadata:
      labels:
        app: postgres-replica 
    spec:
      volumes:
        - name: postgres-config
          configMap:
            name: postgres
            
      terminationGracePeriodSeconds: 10
      
      initContainers:
        - name: setup-replica-data-directory
          image: postgres:10.5
          
          env:
            - name: PGPASSWORD
              valueFrom:
                secretKeyRef:
                  key: replicaPassword
                  name: postgres

          command:
          - sh
          - -c
          - |
            if [ -z "$(ls -A /var/lib/postgresql/data/pgdata)" ]; then
                echo "Running pg_basebackup to catch up replication server...";
                pg_basebackup -R -h postgres -D /var/lib/postgresql/data/pgdata -P -U replication; 
                chown -R postgres:postgres $PGDATA;
            else
                echo "Skipping pg_basebackup because directory is not empty"; 
            fi

          volumeMounts:
            - mountPath: /var/lib/postgresql/data/pgdata
              name: postgres-replica
              subPath: postgres-db

      containers:
        - name: postgres-replica
          image: postgres:10.5
          args: ['-c', 'config_file=/etc/postgres.conf']
          
          imagePullPolicy: IfNotPresent
        
          ports:
            - name: postgres-rep
              containerPort: 5432
              protocol: TCP
          
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
          
          env:
            - name: POSTGRES_USER
              value: postgres
          
            - name: PGUSER
              value: postgres
          
            - name: POSTGRES_DB
              value: postgres
            
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
          
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: password
                  name: postgres
              
            - name: POD_IP
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: status.podIP
              
          livenessProbe:
            exec:
              command:
                - sh
                - -c
                - exec pg_isready --host $POD_IP
            failureThreshold: 6
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5

          readinessProbe:
            exec:
              command:
                - sh
                - -c
                - exec pg_isready --host $POD_IP
            failureThreshold: 3
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 3

          volumeMounts:
            - mountPath: /var/lib/postgresql/data/pgdata
              name: postgres-replica
              subPath: postgres-db
            
            - name: postgres-config
              mountPath: /etc/postgres.conf
              subPath: postgres.conf

            - name: postgres-config
              mountPath: /etc/replica.conf
              subPath: replica.conf
            
      
          
  volumeClaimTemplates:
  - metadata:
      name: postgres-replica
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "ssd"
      resources:
        requests:
          storage: 1Gi

Apply the StatefulSet to the cluster:

kubectl apply -f statefulset-replica.yml

Getting your data in

Getting your data into the cluster is quite easy using kubectl cp. What you need to do is:

    1. kubectl cp dump.sql postgres-0:dump.sql
    1. psql < dump.sql

Now you can connect to the replica and verify it’s replicating all your data:

    1. kubectl exec -it postgres-replica-0 /bin/bash
    1. psql
    1. \dt

Notes

If you’re importing a database from pg_dumpall, you must make sure that the replication user exists and that replication users password is set in the secret.yml file above.

This is because the replication server uses that user to clone the data and the credentials must match.