Extracting SQL from WAL? (part 2)

In the previous article of this series, we saw how to extract WAL records related to the exact SQL commands we want, INSERTs on heap tables, and what the structure of those records was. In this article we will focus on the heap specific information contained in those records and how to extract SQL queries from them.

INSERT data

At the end of the previous article, we could locate the various xl_heap_insert records from the WAL stream. From there, we extracted some metadata about the file’s physical location (tablespace oid, database oid and relation filenode among other things) and the data that was inserted itself.

As a reminder, here’s an extract of the code responsible for generating the WAL records for an INSERT, in the heap_insert() function, focusing on the interesting data:

void
heap_insert(Relation relation, HeapTuple tup, CommandId cid,
			int options, BulkInsertState bistate)
{
[...]
		xl_heap_header xlhdr;
[...]
		xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
		xlhdr.t_infomask = heaptup->t_data->t_infomask;
		xlhdr.t_hoff = heaptup->t_data->t_hoff;
[...]
		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
		XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
		/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
		XLogRegisterBufData(0,
							(char *) heaptup->t_data + SizeofHeapTupleHeader,
							heaptup->t_len - SizeofHeapTupleHeader);
[...]

2 entries are inserted: an xl_heap_header which contains some metadata about the tuple, extracted from the tuple header, and the data part of a HeapTuple. Let’s look at those in details.

Page layout

First of all, let’s quickly see how postgres stores tables and indexes on disk. I will only cover those basics that will be helpful for the rest of the article. If you want to dig more into this topic, there are a tons of resource available. You can refer to this entry point in the code, and I otherwise recommend looking at the section about it in “The internals of postgres” website.

A good general introduction is the documentation, which comes with a diagram of the layout that I include here:

Each tuple and index piece of data that postgres stores on disk is stored into a Page, which is by default 8kB. Each page starts with a header that contains some metadata about the page and ends with an optional “special area”, which can contain additional information specific to the component of postgres that will use this page.

In between is the actual data. The beginning of the data part is an array of ItemId, in ascending order, and the end of the data part are the items themselves (which will be the tuples in case of heap table pages), stored in the reverse order from the ItemId. Unless the page is totally full, there will be an empty space between the last ItemId and the first item (the pd_lower and pd_upper offset in the Page metadata).

Here’s the ItemId definition:

typedef struct ItemIdData
{
	unsigned lp_off:15,  /* offset to tuple (from start of page) */
		 lp_flags:2, /* state of line pointer, see below */
		 lp_len:15;  /* byte length of tuple */
} ItemIdData;

As you can see it holds the location of the item in the page, minimal metadata and the length of the item.

HeapTuple

The largest part stored in the record is the tuple itself. As the historic and default access method to store tuple is called heap, the struct that holds the tuple is called HeapTuple. Any custom Table Access Method can use a different struct to store what it needs for its specific implementation, but it will then also use a custom resource manager to generate specific WAL records.

Here’s the definition of a HeapTuple:

typedef struct HeapTupleData
{
	uint32		t_len;		/* length of *t_data */
	ItemPointerData t_self;		/* SelfItemPointer */
	Oid		t_tableOid;	/* table the tuple came from */
#define FIELDNO_HEAPTUPLEDATA_DATA 3
	HeapTupleHeader t_data;		/* -> tuple header and data */
} HeapTupleData;

It starts with some metadata, which isn’t stored on disk but generated or retrieved from somewhere else when the struct is read from disk. Indeed, there wouldn’t be much value storing the relation’s oid for each tuple on disk. The length of the tuple is stored on disk, as it’s a necessary piece of information, and is retrieved from the associated ItemId the we saw just before.

After that follows the “real” data, which is what is stored in the item part of the Page. It’s again split in 2 parts: the tuple header, which I will cover a bit later, and the tuple data.

The tuple data is the physical on-disk representation of the tuple. It was designed to be as space efficient as possible, so accessing individual fields is a bit complex, and CPU intensive. Let’s the most important part of this design. First, the tuple data is defined like that:

struct HeapTupleHeaderData
{
[...]
	/* ^ - 23 bytes - ^ */

#define FIELDNO_HEAPTUPLEHEADERDATA_BITS 5
	bits8		t_bits[FLEXIBLE_ARRAY_MEMBER];	/* bitmap of NULLs */

	/* MORE DATA FOLLOWS AT END OF STRUCT */
};

You probably know or heard that in postgres, NULL attributes don’t use any storage. Indeed, if an attribute is NULL there won’t be anything in the “data section”, and the bit for its attribute number in the t_bit bitmap will be set.

Then, a lot of data types have a variable size (which is internally referred as varlena). So, to save space postgres doesn’t store the offset of each attributes in the HeapTuple and just stores them next to each other (according to the datatype alignment rules) in a big chunk of memory.

This is indeed efficient, but unless your tuple only contains non-null fixed-sized attribute, the only way to access a specific attribute is to read all the previous ones, skip the NULL attribute and compute the position of the next one reading the length of variable datatype. This process is called tuple deforming, it takes a tuple in input and outputs two arrays: one with the datums and one with the null references, all indexed by the attribute number (0 based). The opposite operation (transform a tuple of datum and a tuple of nulls in a tuple) is unsurprisingly called tuple forming. If you want to read a bit more about those operations, the underlying functions are called heap_deform_tuple() and heap_form_tuple().

Note that tuple deforming is one of the operations that can be JITted, and there are some optimisations on the tuple deforming operation. Postgres supports “partial” deforming and will avoid deforming the full tuple when possible, stopping at the last attribute that the query is referencing, and will cache the offset of the latest attribute that has been deformed. But that can only help to some extent, so it’s always a good idea to mark columns as NOT NULL when possible, put all the columns with fixed-length attributes at the beginning of the tuples (with the NOT NULL first), ideally grouped by alignment size to avoid wasting a few bits, and put the most frequently accessed columns of variable length datatype next. All of that will help speeding up tuple deforming as much as possible.

Tuple header

The first part of the stored data is an xl_heap_header struct. It’s just a shorter version of the real tuple header that only contains some part of it, the rest of the header being available elsewhere in the WAL record or just not needed otherwise. Doing it this way can save a few bytes for each insert in the WAL, which is always a good thing. Its definition is:

typedef struct xl_heap_header
{
	uint16		t_infomask2;
	uint16		t_infomask;
	uint8		t_hoff;
} xl_heap_header;

t_infomask2 and t_infomask2 are two bitmaps that contain information about the tuple. You may have heard about hint bits, those two fields contains the tuple-level hint bits.

Let’s look at their details htup_details.c

struct HeapTupleHeaderData
{
[...]
	/* Fields below here must match MinimalTupleData! */

#define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK2 2
	uint16		t_infomask2;	/* number of attributes + various flags */

#define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK 3
	uint16		t_infomask;		/* various flag bits, see below */

#define FIELDNO_HEAPTUPLEHEADERDATA_HOFF 4
	uint8		t_hoff;			/* sizeof header incl. bitmap, padding */

	/* ^ - 23 bytes - ^ */
[...]
}

 * information stored in t_infomask2:
 */
#define HEAP_NATTS_MASK			0x07FF	/* 11 bits for number of attributes */
/* bits 0x1800 are available */
#define HEAP_KEYS_UPDATED		0x2000	/* tuple was updated and key cols
										 * modified, or tuple deleted */
#define HEAP_HOT_UPDATED		0x4000	/* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE			0x8000	/* this is heap-only tuple */

#define HEAP2_XACT_MASK			0xE000	/* visibility-related bits */
[...]
 * information stored in t_infomask:
 */
#define HEAP_HASNULL			0x0001	/* has null attribute(s) */
#define HEAP_HASVARWIDTH		0x0002	/* has variable-width attribute(s) */
[...]
#define HEAP_XMIN_COMMITTED		0x0100	/* t_xmin committed */
#define HEAP_XMIN_INVALID		0x0200	/* t_xmin invalid/aborted */
#define HEAP_XMIN_FROZEN		(HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
#define HEAP_XMAX_COMMITTED		0x0400	/* t_xmax committed */
#define HEAP_XMAX_INVALID		0x0800	/* t_xmax invalid/aborted */
[...]

We can see a few bits useful for the tuple deforming. For instance, we see that 11 bits of t_infomask2 are used to store the actual number of attributes stored in this tuple. Adding a new column in a table doesn’t always require a full table rewrite, and in that case those bits are critical to know when to stop looking for additional attributes when accessing tuples stored before the column was added. There’s also information on whether the tuple contains any NULL or variable-length datatype attribute. The rest of the hint bits are a clever use of the available space to handle various SQL operations, MVCC rules, HOT updates and other low level optimisations.

Tuple descriptors

Now that we covered some internals of the HeapTuple, it seems much easier to reach our goal: transform the INSERT WAL records into plain SQL statements. We know that we just have to deform the tuples to retrieve the values and the NULL attributes, generating the SQL statements around isn’t hard. But here comes the second reason why we need a proper data directory to do so, and why the lack of DDL is important.

As you probably guessed by now, one critical piece of information needed for the tuple deforming operation is the table structure declaration. Indeed, the HeapTuple is just a big chunk of memory, and without the list of columns, data types, and the types details, it’s impossible to interpret those. If your model doesn’t change too much it’s probably possible to do without and instead generate some kind of mapping manually based on what you know about the history of the instance. Be careful if you go this way, any discrepancy between the original and generated data types can lead to bogus output in the best case, or crashing your whole instance. But in my case I had the guarantee that no DDL happened since the incident, and the other data directory available so I could just rely on it.

Postgres handles the table structure declaration using another struct, called TupleDesc, for tuple descriptor. Its definition is:

typedef struct TupleDescData
{
	int	     natts;	/* number of attributes in the tuple */
	Oid	     tdtypeid;	/* composite type ID for tuple type */
	int32	     tdtypmod;	/* typmod for tuple type */
	int	     tdrefcount;/* reference count, or -1 if not counting */
	TupleConstr *constr;	/* constraints, or NULL if none */
	/* attrs[N] is the description of Attribute Number N+1 */
	FormData_pg_attribute attrs[FLEXIBLE_ARRAY_MEMBER];
} TupleDescData;

In our case the most interesting members are the number of attributes (natts) and the array of pg_attribute records (attrs). Those are also useful for the SQL generation part, as we can retrieve the columns from it. Note also that postgres will generate a TupleDesc automatically when you internally open a relation.

Let’s recapitulate. We have the record data, the filename contains the physical file location information that we can use to retrieve the actual relation, we know how to get the tuple descriptor for this relation and we can use it to deform the tuple and get the values from it. We have almost everything we need to generate the SQL queries.

The only remaining detail is that the values we get from the tuple deforming operation are in their physical representation, and we need to emit their textual representation. Again, that’s not a problem as each data type has a dedicated function for that, called type output function, available in pg_type.typoutput.

Extracting SQL from the INSERT records

Now is time for the fun part where we just need to put everything together to finish the project!

I chose to write it as an extension to be able to add and remove it easily from a production server. I also chose to minimize the amount of C code and rely on plpgsql functions when possible. It’s faster to write and plpgsql is also way safer.

I only wrote a single pg_decode_record() C function, that takes as input a record as a bytea, the tablespace oid and the relation filenode and emits the underlying SQL query. I wrote an extra pg_decode_all_records() function in plpgsql that uses existing pg_ls_dir() and pg_read_binary_file() to retrieve the files and record, and split_part() to extract the metadata from the filename.

I’m attaching the resulting extension to this article so you can see the whole implementation and adapt it if needed, and will just quickly describe the main parts here as we already covered the underlying elements. I’m also only showing here a simplified version to avoid too many implementation details.

First, I look for a matching relation oid in the pg_class catalog for the given tablespace and relfilenode, open the found relation with the weakest lock possible, make a copy of the tuple descriptor and start generating the SQL query with the qualified relation name. As for normal application, you need to make sure that the identifiers are properly quoted to generate working queries:

PGDLLEXPORT Datum
pg_decode_record(PG_FUNCTION_ARGS)
{
    bytea  *record = PG_GETARG_BYTEA_PP(0);
    Oid	spc = PG_GETARG_OID(1);
    Oid	relfilenode = PG_GETARG_OID(2);

    /* Get the relation oid from the tablespace oid and relfilenode */
    relid = get_spc_relnumber_relid(spcOid, relNumber);

    relation = table_open(relid, AccessShareLock);
    tupdesc = CreateTupleDescCopy(RelationGetDescr(relation));

    /* Start generating the SQL query */
    initStringInfo(buf);
    appendStringInfo(buf, "INSERT INTO %s.%s",
    		 quote_identifier(get_namespace_name(RelationGetNamespace(relation))),
    		 quote_identifier(RelationGetRelationName(relation)));

The next part extracts the data from the record and generate a HeapTuple with just enough information to be correctly deformed:

    /* mimic heap_xlog_insert */
    data = VARDATA(record);
    datalen = VARSIZE_ANY(record);
[...]
    htup = &tbuf.hdr;
[...]
    htup->t_hoff = xlhdr.t_hoff;

    /* build a fake tuple with the bare minimum to deform it */
    tuple = (HeapTuple) palloc0(HEAPTUPLESIZE + VARSIZE_ANY(record));
    tuple->t_data = htup;
    tuple->t_len = VARSIZE_ANY(record);
    ItemPointerSetInvalid(&(tuple->t_self));
    tuple->t_tableOid = relid;

For the next step, we just need to allocate the 2 arrays needed for the deforming and call heap_deform_tuple():

    values = palloc0(sizeof(Datum) * tupdesc->natts);
    isnull = palloc0(sizeof(bool) * tupdesc->natts);
    heap_deform_tuple(tuple, tupdesc, values, isnull);

Now that we have all the elements, we just need to iterate over the list of columns in the tuple descriptor, output a NULL if needed, otherwise find the type output function, call it for our value, and output it in the query after escaping it:

    /* append the values */
    appendStringInfoString(buf, " VALUES (");
    for (i = 0; i < tupdesc->natts; i++)
    {
    	char	   *value = NULL;
    	Oid			typoutput;
    	bool		typisvarlena;

    	if (i > 0)
    		appendStringInfoString(buf, ", ");

    	if (isnull[i])
    	{
    		appendStringInfoString(buf, "NULL");
    		continue;
    	}

    	getTypeOutputInfo(TupleDescAttr(tupdesc, i)->atttypid,
    					  &typoutput, &typisvarlena);

    	value = OidOutputFunctionCall(typoutput, values[i]);
    	value = quote_literal_cstr(value);

    	appendStringInfo(buf, "%s", value);

    	pfree(value);
    }
    appendStringInfoString(buf, ");");

Once done, we just need to properly close the relation and return the generated query to the caller:

	table_close(relation, NoLock);

	PG_RETURN_TEXT_P(cstring_to_text(buf.data));
}

And that’s all you need for the basic scenario! The real implementation has a bit more code for various other cases, like very basic TOAST table support, but is still unlikely to correctly handle any weird corner cases that can happen in the wild.

Basic usage

We can finally see the result of all the hard work in this article and the previous one! I will be using a simple scenario, first saving the current WAL position to only keep the records generated afterwards, then removing all the data from the table (without changing its relfilenode) to make sure that we don’t read anything from the table itself.

-- Get the current WAL location
rjuju =# SELECT pg_current_wal_lsn();
 pg_current_wal_lsn
--------------------
 F/46349E80
(1 row)

rjuju=# CREATE EXTENSION pg_decode_record;
CREATE EXTENSION

rjuju=# CREATE TABLE decode_record(id integer, val text storage external);
CREATE TABLE

rjuju=# INSERT INTO decode_record
  SELECT 1, 'simple test';
INSERT 0 1

-- Force a full-page write
rjuju=# CHECKPOINT;
CHECKPOINT

rjuju=# INSERT INTO decode_record
  SELECT 2, 'full-page write';
INSERT 0 1

rjuju=# INSERT INTO decode_record
  SELECT 3, 'a bit big '||string_agg(random()::text, ' ') FROM generate_series(1, 10);
INSERT 0 1

rjuju=# INSERT INTO decode_record
  SELECT 4, 'way bigger '||string_agg(random()::text, ' ') FROM generate_series(1, 120);
INSERT 0 1

-- Check the heap table size and underlying TOAST table size
rjuju=# SELECT oid::regclass::text, pg_size_pretty(pg_relation_size(oid)),
  reltoastrelid::regclass::text, pg_size_pretty(pg_relation_size(reltoastrelid))
  FROM pg_class
  WHERE relname = 'decode_record';
      oid      | pg_size_pretty |      reltoastrelid      | pg_size_pretty
---------------+----------------+-------------------------+----------------
 decode_record | 8192 bytes     | pg_toast.pg_toast_66731 | 8192 bytes
(1 row)

rjuju=# DELETE FROM decode_record;
DELETE 4

-- Make sure we remove all records and physically empty the tables
rjuju=# VACUUM decode_record;
VACUUM

rjuju=# SELECT oid::regclass::text, pg_size_pretty(pg_relation_size(oid)),
  reltoastrelid::regclass::text, pg_size_pretty(pg_relation_size(reltoastrelid))
  FROM pg_class
  WHERE relname = 'decode_record';
      oid      | pg_size_pretty |      reltoastrelid      | pg_size_pretty
---------------+----------------+-------------------------+----------------
 decode_record | 0 bytes        | pg_toast.pg_toast_66737 | 0 bytes
(1 row)

Ok, we should have a few records generated in the WAL corresponding to data we definitely lost in the table. Let’s extract the INSERT records using the custom pg_waldump we created in the previous article:

$ mkdir -p /tmp/pg_decode_record
$ pg_waldump --start "F/46349E80" --save-records /tmp/pg_decode_record
[...]
$ ls -l /tmp/pg_decode_record
0000000F-46367520.1663.16384.66743.0_main
0000000F-46367660.1663.16384.66743.0_main
0000000F-46367738.1663.16384.66743.0_main
0000000F-46367868.1663.16384.66746.0_main
0000000F-46368130.1663.16384.66746.0_main
0000000F-46368300.1663.16384.66743.0_main

You might wonder why there are 6 records extracted while we only inserted 4 rows. That’s because the last record was big enough to be TOASTed using 2 chunks, and as far as the WAL are concerned that’s 3 separate INSERTs in 2 different tables. Let’s see that in detail using the extension to decode the records (truncating the output as some rows are quite big):

rjuju=# SELECT substr(v, 1, 95)
    FROM pg_decode_all_records('/tmp/pg_decode_records') f(v);
                                          substr
-------------------------------------------------------------------------------------------
 INSERT INTO public.decode_record (id, val) VALUES ('1', 'simple test');
 INSERT INTO public.decode_record (id, val) VALUES ('2', 'full-page write');
 INSERT INTO public.decode_record (id, val) VALUES ('3', 'a bit big 0.5356172842583808 0.3...'
 INSERT INTO pg_toast.pg_toast_66810 VALUES ('66815', '0', E'\\x7761792062696767657220302e...'
 INSERT INTO pg_toast.pg_toast_66810 VALUES ('66815', '1', E'\\x3337383137353120302e303439...'
 INSERT INTO public.decode_record (id, val) VALUES ('4', /* toast pointer 66815 */);
(6 rows)

(note: I slightly edited the output to make it smaller and have correct syntax highlighting, the real extension will emit the real table name in a comment in case of INSERT in a TOAST table)

We see the first normal records properly decoded, whether they’re in a full-page image or not. The last record is indeed split into 3 different INSERTs, 2 in the TOAST table and 1 in the heap table.

As I mentioned earlier I only added very minimal support for TOAST tables, as I didn’t have any information about the customer tables and whether they would hit that case or not, or how often. The last insert isn’t a valid statement as the 2nd value is missing, but we can manually extract the value from the INSERT statements in the TOAST table and therefore fix the normal INSERT. For instance, using the first few bytes that we can see in the first chunk:

rjuju=# SELECT encode(E'\\x7761792062696767657220302e', 'escape');
-[ RECORD 1 ]---------
encode | way bigger 0.

The data is there, it just needs a bit of manual processing to get it.

To be totally fair, I also cheated a bit in that example by making sure that the data will be TOASTed but not compressed, so it’s very easy to manually retrieve the raw value from the extra INSERTs in the TOAST tables. It wouldn’t be very hard to have all of that working transparently, but I simply didn’t have the need. If you’re interested in that, I’d recommend looking at the detoast_attr() function in src/backend/access/common/detoast.c and all underlying code to see how you can manually decompress data. You would then only need to store the detoasted (and potentially decompressed) value referenced by the toast’s chunk_id locally, and emit it in the query instead of the currently emitted comment.

Conclusion

I hope you enjoyed those two articles and learned a bit about the WAL infrastructure and the way pages and tuples work internally.

If you missed it in the article, here is the link for the full extension.

I want to emphasize again that all the code I showed here is only a quick proof of concept that’s thought for one narrow use case, and it should be used with care. My goal here wasn’t to show state of the art code but rather show one possible way to quickly come up with a plan to salvage data in case of production incident. If you’re unfortunately confronted to a similar problem, or some major other accident I hope you will find some valuable resources and a starting point to come up with your own dedicated solution!

rjuju's home

rjuju.github.io