Postgres pg_dumpall with compression

Posted on Tuesday, May 15, 2012



To do this on an Ubuntu machine is relatively easy.
Just run the following commands.

First log in as the postgres user


       > sudo su postgres


Then run pg_dumpall and pipe it to a compression program, in my case I use bzip2



       >    pg_dumpall | bzip2 -vf > database.`date +"%Y.%m.%d"`.pgdumpall.bz2




I like to time mine so I add the time command


       >    time pg_dumpall | bzip2 -vf > database.`date +"%Y.%m.%d"`.pgdumpall.bz2





Replication Database


However, this does not work to well on my current database.  I am trying to take a dumpall from a Replication database that has yet to catch up on all its WAL files.  If I run the dumpall command I get the following error.


pg_dump: Error message from server: ERROR:  canceling statement due to conflict with recovery
DETAIL:  User query might have needed to see row versions that must be removed.



If I run the following command as the root user:


       >  ps -AF | grep post



Here you can see its still recovering a WAL file from the Master Database.  If I compare it to the WAL files in the Master database it is about 30-35 behind.   It seems to linger here not losing too much ground or gaining.   I would think it would constantly get better or get worse.  But for some reason it just remains in this range… so far.





postgresql.conf


Open up postgresql.conf for editing.



       >  vi /etc/postgresql/9.1/main/postgresql.conf



I updated these two settings that relate to query timeout in a replication database.

@line 212

max_standby_archive_delay = 600s


max_standby_streaming_delay = 600s


I upped them to 600 seconds to see if that is a high enough timetout to get a dumpall to work.

After updating that you need to restart postgres



       >  /etc/init.d/postgresql restart



This got a little further but still was not enough to finish my pg_dumpall.
So I upped it to 1200s and restarted the postgresql, that was not enough so I upped it to 2400s.

It looks like when you do this and the replication server has not caught up it puts it in a waiting mode


Which is causing it to fall further behind.

So I guess taking a pg_dumpall of a replication server that has not caught up yet is generally a bad idea?

In the end I upped the timeout to 8 hrs   (28800s).   That gave me enough time to create a backup and also after the backup was complete, even though it had fallen further behind it caught back up to 30-35 WAL files behind the live database.

This is another issue I need to figure out.





References

2 comments:

  1. I have the same error, any followup on this issue?

    ReplyDelete
    Replies
    1. Well there is 2 issues going on here. First my read replicate is always 30-35 WAL files behind. I admit its on a 100 Mbit network, later this year we may up it to 1,000 and that may make that problem go away (the being 30 or so behind)

      The second issue seems simple. When you take a dump from a postgresql replication database you need to set the timeouts high enough to give your dump enough time to complete.

      When the dump is happening the WAL files will back up and you will go for 30-35 to maybe 100 or more (depending on your set up. Then after the dump it will catch right up again. Of course you need to have your master postgres database set up to save enough WAL files to deal with the backup.

      For example on my postgresl.conf file I set @line 201 or so

      size = 128 #in logfile segments, 16MB each.

      This says to save 128 WAL files (each 16MiB each) on the master system. In my case this is more than enough to deal with the lag.

      Delete