pg流复制_切换


  • 主备判断方法

  • 看WAL进程是sender还是receiver
--主
[postgres@pg93 ~]$ ps -fu postgres|egrep 'walsender|walreceiver'|grep -v egrep
postgres   6065   6054  0 16:35 ?        00:00:00 postgres: walsender repuser 192.168.150.132(50722) streaming 2/10001C0
--备
[postgres@pg93s ~]$ ps -fu postgres|egrep 'walsender|walreceiver'|grep -v egrep
postgres   7628   6966  0 00:44 ?        00:00:03 postgres: walreceiver   streaming 2/10001C0
  • 查pg_stat_replication和pg_stat_wal_receiver视图
--主:pg_stat_replication有返回
postgres@postgres:select pid,state,sync_state,usename,client_addr,application_name from pg_stat_replication;
 pid  |   state   | sync_state | usename |   client_addr   | application_name 
------+-----------+------------+---------+-----------------+------------------
 6065 | streaming | sync       | repuser | 192.168.150.132 | pg93std
(1 row)

--备:pg_stat_wal_receiver有返回
postgres=# select pid,status,sender_host,sender_port,last_msg_send_time from pg_stat_wal_receiver;
 pid  |  status   |   sender_host   | sender_port |      last_msg_send_time       
------+-----------+-----------------+-------------+-------------------------------
 7628 | streaming | 192.168.150.130 |        6000 | 2020-12-20 17:31:53.520146+08
(1 row)
  • 查看是否处于恢复状态,是为备,否为主
    select pg_is_in_recovery(0;

  • 查看控制信息

--主
[postgres@pg93 ~]$ pg_controldata -D /pgdata |grep cluster
Database cluster state:               in production
--备
[postgres@pg93s ~]$ pg_controldata -D /pgdata |grep cluster
Database cluster state:               in archive recovery
  • 比对参数
    12以前版本查看PGDATA下有没recovery.conf,有就是备
    12和以上版本查看PGDATA下有没standby.signal,有就是备
     
  • 主备切换(文件法)

--备添加参数,并reload或重启生效
promote_trigger_file = '/pgdata/.pg_prompt_6000'
postgres=# show promote_trigger_file;
  promote_trigger_file   
-------------------------
 /pgdata/.pg_prompt_6000
(1 row)

--停主库
[postgres@pg93 pgdata]$ pg_ctl stop -D /pgdata
waiting for server to shut down.... done
server stopped

--备库连接主失败,创建trigger文件后promote为主
[postgres@pg93s pgdata]$ touch /pgdata/.pg_prompt_6000
--查看备日志:
2020-10-07 02:18:47.670 CST [7908] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "192.168.150.130" and accepting
                TCP/IP connections on port 6000?
2020-10-07 02:18:52.676 CST [7909] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "192.168.150.130" and accepting
                TCP/IP connections on port 6000?
2020-10-07 02:18:57.680 CST [6968] LOG:  promote trigger file found: /pgdata/.pg_prompt_6000
2020-10-07 02:18:57.682 CST [6968] LOG:  redo done at 2/2000028
2020-10-07 02:18:57.682 CST [6968] LOG:  last completed transaction was at log time 2020-12-20 13:34:46.657621+08
2020-10-07 02:18:57.722 CST [6968] LOG:  selected new timeline ID: 2
2020-10-07 02:18:57.912 CST [6968] LOG:  archive recovery complete
2020-10-07 02:18:57.973 CST [6966] LOG:  database system is ready to accept connections
--可见备先是连接不到主库,在创建了trigger文件后自动检测到trigger文件,结束了recovery,变成主模式
--检查还发现trigger文件和standby.signal都被删除了
[postgres@pg93s pgdata]$ ls -l /pgdata/.pg_prompt_6000
ls: cannot access /pgdata/.pg_prompt_6000: No such file or directory
[postgres@pg93s ~]$ ls -l /pgdata/standby.signal
ls: cannot access /pgdata/standby.signal: No such file or directory

注意:

备库在启动状态时,只要检测到存在ptomote_trigger_file指定的文件,就会立即中断与主的同步(如果在同步的话)并将自身提升为主库,而此时主库并不会自动切换为备库(此时两个都是主库),所以如果是要做主备切换的话,需要先手动干净关闭主库,再prompt备库。

  • 主备切换(pg_ctl prompt)

--停主
[postgres@pg93s ~]$ pg_controldata |grep cluster
Database cluster state:               in production
[postgres@pg93s ~]$ pg_ctl stop -D /pgdata
waiting for server to shut down.... done
server stopped
--提升备
[postgres@pg93 pgdata]$ pg_ctl promote -D /pgdata
waiting for server to promote.... done
server promoted
[postgres@pg93 pgdata]$ pg_con
pg_config       pg_controldata  
[postgres@pg93 pgdata]$ pg_controldata |grep cluster
Database cluster state:               in production
  • pg_rewind修复双主:

在主机器损坏等严重故障中,没有条件将主干净关闭,或者提升备时忘记了先关闭主了,后续将主设置为备时会报错,此种情况下通常需要重新搭建原主为备,但也可以通过使用pg_rewind来实现从新主同步过来达到修复的目的(原主未传到备的部分事务将丢失):

--主备同步状态中,将备直接提升为主,此时两边都是主
[postgres@pg93s pgdata]$ pg_ctl promote -D /pgdata/
waiting for server to promote.... done
server promoted
[postgres@pg93s pgdata]$ pg_controldata |grep cluster
Database cluster state:               in production
[postgres@pg93 ~]$ pg_controldata |grep clus
Database cluster state:               in production
--再将原主启动到备模式后,不能正常同步,报错:
2020-12-21 00:46:00.053 CST [8553] LOG:  fetching timeline history file for timeline 4 from primary server
2020-12-21 00:46:00.075 CST [8553] FATAL:  could not start WAL streaming: ERROR:  requested starting point 2/D000000 on timeline 3 is not in this server's history
        DETAIL:  This server's history forked from timeline 3 at 2/C0001F8.
--rewind:停掉库,执行rewind
[postgres@pg93 ~]$ pg_rewind --target-pgdata /pgdata --source-server='host=192.168.150.132 port=6000 user=postgres dbname=postgres' -P
pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 2/C0001F8 on timeline 3
pg_rewind: rewinding from last common checkpoint at 2/C000148 on timeline 3
pg_rewind: reading source file list
pg_rewind: reading target file list
pg_rewind: reading WAL in target
pg_rewind: need to copy 104 MB (total source directory size is 3749 MB)
107316/107316 kB (100%) copied
pg_rewind: creating backup label and updating control file
pg_rewind: syncing target data directory
pg_rewind: Done!

--重新启动到备模式,可正常同步