Is it actually possible to extract SQL commands from WAL generated in “replica”
wal_level
?
The answer is usually no, the “logical” wal_level
exists for a reason after
all, and you shouldn’t expect some kind of miracle here.
But in this series of articles you will see that if some conditions are met you can still manage to extract some information, and how to do it. This first article focuses on the WAL records and how to extract the ones you want, while the next one will show how to try to extract the information contained in those records.
Some context
This article is based of some work I did a few months ago to help a customer recover some data after an incident. It’s not a perfect solution and mostly a set of quick hacks I did to come up with something able to retrieve data in a few hours of work only, but I hope sharing details about it and some methodology can be helpful if you ever get in a similar situation. You will probably need to adapt it to your needs, with yet other hacks, but it should give you a good start. It can otherwise be of some interest if you want to know a bit more about the WAL records internals and some associated infrastructure.
The incident
Due to a series of unfortunate events, one of their HA clusters ended in a
split-brain situation for a some time before being reinitialised, which
entirely removed one of the data directory. After that, only the WALs that
were were generated on that instance were available, those being in “replica”
wal_level
, and nothing else.
One possibility to try recover the data would be to restore a physical backup, if any, replay archived WALs until the last transaction before the removed node is promoted (assuming those are still available) and then replay the WALs generated on that newly promoted node. Once there you still need to look at each row of each table of each database and compare it to yet another instance restore from the same backup to approximately the same time as this one. That’s clearly not ideal as it will likely require many days or even weeks of tedious hard work to do so, and will consume a lot of resources along the way. Is there a way to do better?
After a quick discussion, it turned out that there were a few elements that made some recovery from the WALs themselves possible (more on why later):
- One of the data directories was still available
- The customer guaranteed that no DDL happened since the incident
- Only INSERTs happened during the split-brain
WALs & Physical replication
As you probably know, postgres physical replication works by sending an exact copy of the modified binary raw data to the various standby servers, in a continuous stream of WAL records. As a consequence, those records don’t really know much about the database objects they reference, and nothing about the SQL queries that generated them. So what do they really contain? Let’s see what’s inside the WAL records generated for an INSERT into a normal heap relation.
WAL records
First of all, you have to know that the WAL records are split into Resource
Managers (declared in
src/include/access/rmgrlist.h),
each being responsible for a specific part of postgres (heap tables, indexes,
vauum…). They’re identified by a numeric identifier and often referred to as
a rmid
, for //resource manager identifier//.
Each of those resource managers can handle various operations, which are
internally called opcodes. Here we’re interested in the WAL records
generated while operating on standard heap tables, and especially during
INSERTs. This resource manager is a bit particular as it’s split into 2
different rmid
: RM_HEAP_ID
and RM_HEAP2_ID
. This is only an
implementation details, as each resource manager can only handle a limited
number of opcodes, everything is the same otherwise.
If you’re curious, here’s the definition of the main WAL record in the source code and a bit of details on the exact layout in the files:
/*
* The overall layout of an XLOG record is:
* Fixed-size header (XLogRecord struct)
* XLogRecordBlockHeader struct
* XLogRecordBlockHeader struct
* ...
* XLogRecordDataHeader[Short|Long] struct
* block data
* block data
* ...
* main data
* [...]
*/
typedef struct XLogRecord
{
uint32 xl_tot_len; /* total len of entire record */
TransactionId xl_xid; /* xact id */
XLogRecPtr xl_prev; /* ptr to previous record in log */
uint8 xl_info; /* flag bits, see below */
RmgrId xl_rmid; /* resource manager for this record */
/* 2 bytes of padding here, initialize to zero */
pg_crc32c xl_crc; /* CRC for this record */
/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */
} XLogRecord;
and a block data header:
/*
* Header info for block data appended to an XLOG record.
*
* 'data_length' is the length of the rmgr-specific payload data associated
* with this block. It does not include the possible full page image, nor
* XLogRecordBlockHeader struct itself.
*
* Note that we don't attempt to align the XLogRecordBlockHeader struct!
* So, the struct must be copied to aligned local storage before use.
*/
typedef struct XLogRecordBlockHeader
{
uint8 id; /* block reference ID */
uint8 fork_flags; /* fork within the relation, and flags */
uint16 data_length; /* number of payload bytes (not including page
* image) */
/* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */
/* If BKPBLOCK_SAME_REL is not set, a RelFileLocator follows */
/* BlockNumber follows */
} XLogRecordBlockHeader;
Everything here is very generic as it’s used by all the resource managers. One important bit though is the mention of a RelFileLocator after the block header if the record contains information about a different relation from the previous block, whatever is was (which is the meaning of BKPBLOCK_SAME_REL). This is of course important information for us.
typedef struct RelFileLocator
{
Oid spcOid; /* tablespace */
Oid dbOid; /* database */
RelFileNumber relNumber; /* relation */
} RelFileLocator;
But here’s a first reason why you need a proper data directory to do anything with the WALs: this doesn’t contain the schema name and table name, or even the table oid, but the tablespace oid, database oid and relfilenode, which is what the WAL actually need to identify a physical relation file (which is itself split into multiple files, the exact fork and segment are deduced using other information). So any table rewrite happening since the WAL records were generated (e.g. a VACUUM FULL) and you won’t be able to identify which relation a record is about, unless of course you find a way to map the current relfilenode to the one before the table rewrite.
Heap INSERT WAL records
Now that we saw a bit of the general WAL structures, let’s focus on the data specific to an INSERT. If you’re not familiar really with the internals, one easy way to locate the code related to a specific command is to look at the functions associated to a resource manager. Let’s look at the RM_HEAP_ID information in src/include/access/rmgrlist.h:
/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode */
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode)
We here have the name of the actual functions responsible for many operations (the exact list will vary depending on the postgres major version, I’m here using the list in postgres 17).
The redo function is the name of the function that applies an RM_HEAP_ID
record, the desc functions is the one that emits the info you see in
pg_waldump, the identify function returns a string describing the opcode
and so on. Let’s look at heap_identify()
:
const char *
heap_identify(uint8 info)
{
const char *id = NULL;
switch (info & ~XLR_INFO_MASK)
{
case XLOG_HEAP_INSERT:
id = "INSERT";
break;
[...]
}
return id;
}
We now know that the opcode we’re interested in is XLOG_HEAP_INSERT. A
quick git grep
in the tree will lead you to
src/backend/access/heap/heapam.c,
more precisely the heap_insert function. The interesting bit is located in
the “XLOG stuff” block. I will show here an extract focusing on the bit we
will need:
void
heap_insert(Relation relation, HeapTuple tup, CommandId cid,
int options, BulkInsertState bistate)
{
[...]
/* XLOG stuff */
if (RelationNeedsWAL(relation))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
XLogRecPtr recptr;
Page page = BufferGetPage(buffer);
uint8 info = XLOG_HEAP_INSERT;
int bufflags = 0;
[...]
xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
xlrec.flags = 0;
[...]
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
xlhdr.t_infomask = heaptup->t_data->t_infomask;
xlhdr.t_hoff = heaptup->t_data->t_hoff;
/*
* note we mark xlhdr as belonging to buffer; if XLogInsert decides to
* write the whole page to the xlog, we don't need to store
* xl_heap_header in the xlog.
*/
XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
XLogRegisterBufData(0,
(char *) heaptup->t_data + SizeofHeapTupleHeader,
heaptup->t_len - SizeofHeapTupleHeader);
[...]
recptr = XLogInsert(RM_HEAP_ID, info);
PageSetLSN(page, recptr);
}
We see here that this function is as expected inserting an RM_HEAP_ID
record,
with an XLOG_HEAP_INSERT
opcode. There are 2 data parts associated with this
record: the header of the tuple that’s being inserted and the tuple itself.
That’s great! At this point we know how to identify what relation an INSERT is about and the content of that INSERT. Let’s see how to filter those records from the WALs.
Extracting and filtering WAL records
Parsing the postgres WALs isn’t that complicated but still requires to know quite a bit more than what I showed here. Writing such code is possible but wait, don’t we already have a tool shipped with postgres which is designed to do exactly that? Yes there sure is, it’s pg_waldump.
Rather that writing something similar, couldn’t we simply teach pg_waldump to filter the records we’re interested in and save them somewhere so that we can later process them and generate SQL queries? This way we can then also benefit from all options in pg_waldump like specifying the starting and/or ending LSN or filtering a specific resource manager, without the need to worry about most of the WAL implementation details and only focusing on the few functions provided by postgres necessary for our need. Let’s see how to implement that.
The main source file is src/bin/pg_waldump/pg_waldump.c. Skipping most of the unrelated code, we can see that there’s a main loop that takes care of reading each record one by one, optionally filter them and then do something with them depending on how the tool was executed. I will again show an extract to focus on the most relevant part only:
for (;;)
{
[...]
/* try to read the next record */
record = XLogReadRecord(xlogreader_state, &errormsg);
[...]
/* apply all specified filters */
if (config.filter_by_rmgr_enabled &&
!config.filter_by_rmgr[record->xl_rmid])
continue;
[...]
/* perform any per-record work */
if (!config.quiet)
{
if (config.stats == true)
{
XLogRecStoreStats(&stats, xlogreader_state);
stats.endptr = xlogreader_state->EndRecPtr;
}
else
XLogDumpDisplayRecord(&config, xlogreader_state);
}
/* save full pages if requested */
if (config.save_fullpage_path != NULL)
XLogRecordSaveFPWs(xlogreader_state, config.save_fullpage_path);
/* check whether we printed enough */
config.already_displayed_records++;
if (config.stop_after_records > 0 &&
config.already_displayed_records >= config.stop_after_records)
break;
}
That’s quite simple, pg_waldump read the records one by one until it needs to stop, ignore the records that the users asked to discard and then takes action on the remaining ones. We can see that there’s already an option to save full page images, it definitely looks like we could just add something similar there, but for all records.
First, we will need to provide a way to identify the relation the INSERT is
about. That’s the RelFileLocator
, and we already know that it can be found
just after the XLogRecordBlockHeader. Postgres provides a function to retrieve
this information, and a bit more, named
XLogRecGetBlockTagExtended()
.
Here is it’s description:
/*
* Returns information about the block that a block reference refers to,
* optionally including the buffer that the block may already be in.
*
* If the WAL record contains a block reference with the given ID, *rlocator,
* *forknum, *blknum and *prefetch_buffer are filled in (if not NULL), and
* returns true. Otherwise returns false.
*/
bool
XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
RelFileLocator *rlocator,
ForkNumber *forknum,
BlockNumber *blknum,
Buffer *prefetch_buffer)
We need to provide the record - pg_waldump already retrieves it for us - and
the block_id
. The block_id
, or block reference, is simply an offset in the
array of data that the WAL records contains. If you look a bit above in this
article, you will see that we already know that heap_insert()
only uses a
hardcoded 0 block_id: this is the first argument in the various
XLogRegisterXXX()
function calls.
Next we need to retrieve the actual WAL record data, the tuple header and the
tuple itself. This one is a bit trickier, as the record can either be found in
a simple WAL record or in a full-page record. We need to check for a simple
WAL record first. The associated function is
XLogRecGetBlockData()
:
/*
* Returns the data associated with a block reference, or NULL if there is
* no data (e.g. because a full-page image was taken instead). The returned
* pointer points to a MAXALIGNed buffer.
*/
char *
XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
As noted in the comment, if the function returns NULL (and sets len to 0) then the data may be in a full-page image instead (or the data could be missing entirely). If that’s the case we need to retrieve the full-page image, and then locate the tuple the INSERT was about and extract it in the same format as a simple WAL record.
Postgres provides a function to extract the full-page image:
RestoreBlockImage()
:
/*
* Restore a full-page image from a backup block attached to an XLOG record.
*
* Returns true if a full-page image is restored, and false on failure with
* an error to be consumed by the caller.
*/
bool
RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
which is straightforward to use: just provide the record and the block identifier and you get the full-page image if found. However, there’s no function available to extract a tuple for a full-page image. Indeed postgres can simply overwrite the whole block with the full-page image as it contains the latest version of the block at the time it was generated, but in our case we definitely don’t want to emit an INSERT statement for every already existing tuple in the block!
Fortunately, even when we get a full-page image, our record still contains a
//main data area//. If you look up at the heap_insert()
function, that’s
the call to XLogRegisterData()
, and as you see here it contains an
xl_heap_insert
struct. And the first member of this struct, offnum, is
actually the position of the tuple in the page which is exactly what we need!
With all of that, it’s just a matter of accessing the tuple header and tuple at
the correct place among all the tuples present in the page, and save as we
would way it would be if it were a simple WAL record. If you’re wondering how
exactly it should be done, you can always look at how postgres itself does it
when it needs to return a specific tuple and adapt that code to your need. The
functions responsible for that are heapgetpage()
and heapgettup()
, located
in the
src/backend/access/heap/heapam.c
file we already mentioned.
We now have the information about the physical file location and the record itself that we will need to transmit to another program to decode it. The best way to do that is to simply save the record as-is in a binary file, and use the file name to transmit the metadata. I chose the following pattern to name the produced files:
LSN.TABLESPACE_OID.DATABASE_OID.RELFILENODE.FORKNAME
It will be trivial for the consumer to parse it and extract the required
metadata. One thing to note is that I don’t put the rmid
or the opcode
here as I’m only emitting the only one I’m interested in and discard everything
else. If that’s not your case you should definitely remember to add those in
the filename pattern.
Since this requires a bit of code to implement, I won’t detail it here but you can find the full result in the patch for pg_waldump that I’m attaching to this article, which implements this as a new –save-records option.
To conclude, let me also remind you that a compiled version of pg_waldump will only work for a single major postgres version. In my case, I had to work with postgres 11, so you can find the patch for this version here, but if needed I also rebased it again the current commit on the master branch, which can be found here.
What’s next?
This is the end of this first article. We saw some details on the postgres WAL infrastructure, with a full example for the case of a plain INSERT on a heap table. We also learned where to look to find where other WAL records are generated and to see more details about the implementation.
We also checked how pg_waldump is working and how to adapt it for our need, with a provided complete patch for both postgres 11 and the current dev version (postgres 17). Again, I’d like to remind you that all this work is only at a proof-of-concept stage, it’s definitely not polished and I’m sure that are many problems that would need to be fixed. One obvious example of such problem is that we’re saving all INSERT we find in the logs but we don’t check if the transaction they’re in eventually committed. It would be possible to fix that but it would require extraneous code, so as is it’s up to the users to double check that as needed. Overall it was enough to recover the needed data so I didn’t pursue any more work on it.
In the next article we will see some usage of this new –save-records option, and also how to read those records and decode them to generate plain INSERT queries. Stay tuned!