Postgresql: Does order of columns in a table matter?

Aug 12, 2023 · 8 min read

Picture this: you're designing a database schema for a cool new project in PostgreSQL. As you lay out the blueprint of your tables, you might wonder, does the order in which I define the columns really make a difference?. No one would pay much attention to it early on (that's what I did).

The usual way would be to club related columns together during CREATE TABLE and add additional columns at the end as the tables' requirement grows. Looks like a harmless and straight forward strategy to follow. But once you learn about the way postgres stores data, you can reduce a potential overhead that would inturn save storage space; and maybe tune the query performance a tad bit. Read along...

Table row layout

In Postgres, a "page" is a fundamental unit of data storage within the database system. A page is of fixed size (typically 8 kB, unless specified otherwise during server setup), and table data are spread across multiple pages based on various factors. Every page and row has a "header" that holds meta information about the data it holds. A row header (often referred to as tuple header), includes various pieces of information that help Postgres track the row's status, visibility, heap TID and structure. You can read more on it here.

This header is typically of 23 bytes long, with an optional null bitmap. Meaning, an empty row in postgres occupies 23 (or) 24 bytes in any table. This can be confirmed by getting the row size using

                            
                                SELECT pg_column_size(row()) AS row_size;
                            
                        

The size of columns of different data types are then summed up with this header size to obtain total memory occupied by a row. The datatypes and their length can be obtained using

                            
                                SELECT typname, typlen FROM pg_type;
                            
                        

Row data storage

Postgres has two column types, fixed-length column and variable-length column. When you see the results of the above query, you can see the storage size(in bytes) configured for the said data types. For the variable length columns, the size is -1 since they are managed at runtime.

Consider the table schema consisting of 3 columns, is_active(boolean), id(bigint), age(int) respectively in this order. This is just a made up schema for the purpose of this blog. As per our understanding so far, we can say with confident that the size of the row in this table will be: 24(row header) + 1(boolean) + 8(bigint) + 4(int) = 37 bytes

Lets test the hypothesis. I'll just pack the values as a row in the same order as in the table(for simplicity) and then check the total row size using

                            
                                SELECT pg_column_size(
                                    row(
                                        't'::BOOLEAN,
                                        1234567890::BIGINT,
                                        100::INT
                                    )
                                );
                            
                        

But the row size as returned by the query **drumroll** is 44 bytes. Where did the additional 7 bytes come from? Digging deeper into why such overhead exists, I found that postgres does something called "data-type alignment" (or) "CPU alignment". The concept of data alignment is rooted in how computer memory is accessed by the CPU. Most modern CPUs access memory in aligned blocks, which means they prefer data to start at memory addresses that are multiples of a specific size (usually the size of the data type being accessed). Proper alignment is important for optimizing memory access and ensuring consistent data storage across various platforms and architectures.

In PostgreSQL, columns are padded in relation to other columns in the table based on their data type's alignment requirements. The goal of padding is to ensure that each column's data starts at an appropriate memory boundary, which helps optimize storage and access performance, particularly when dealing with fixed-length data types.

Meaning, say the data-alignment is 4 byte and we have a table with 2 columns of smallint(2 bytes) and int(4 bytes); The data is stored as follows

    1000 -------------------  1004  ---- 1008
    |                           |         |
    | 2 bytes + 2 padding bytes | 4 bytes |
    |                           |         |
        --------------------------   -------
            ^                        ^
            |                        |
        smallint                    int
                        

The 1000, 1004, 1008,... are memory locations. Since the data alignment in our considertion is 4 bytes, the memory is aligned in multiples of 4 bytes. Postgres defines a parameter "maxalign" which defines the alignment bytes. It is set as 8 bytes. Why 8 bytes? My guess is maybe due to memory architecture of platforms(32-bit/64-bit). So a multiple of 8 would work well. Now, coming back to our initial table schema in the order of boolean, bigint and int. The column alignment would have been

    1000  --------------------   1008  ------------ 1016 -------- 1020 --... 1024
    |                           |                  |             |
    | 1 byte + 7 padding bytes  |     8 bytes      |   4 bytes   |
    |                           |                  |             |
        --------------------------   ----------------   -----------   ---....
            ^                         ^                 ^
            |                         |                 |
        boolean                     bigint             int
                      

As per the datatype alignment for the current order of columns, this is how the row would be stored internally. This brings us to the initial query output, which gave us 44 bytes. 24 + 8 + 8 + 4 = 44 bytes. The math adds up.

Keep in mind that though the maxalign length is 8 bytes, the last column of type int is not padded to match 8 bytes; this is because there is no more column after this, so postgres just keeps the memory block as length of data type. If the next column to be added can be fit within the remaining data-type alignment limit, the padding will not be added. WTH??? Why is this any different?? That is, the current last column is of type int (4 bytes), say the next column is of type smallint. Then 4 + 2 = 6 (<= 8). Since both the columns can be fit within the alignment limit, both the columns will be fit within a block and the row size will amount to 24 + 8 + 8 + 4 + 2 = 46 bytes instead of 24 + 8 + 8 + 8 + 2 = 50 bytes. Gotcha there, didn't it?

Things look fine so far when we use fixed-length columns, but what will the padding for variable-length columns? Such columns are taken care of by a different method called "TOAST" (The Oversized-Attribute Storage Technique). Postgres uses a fixed size page (8 kB) and does not allow tuples to span multiple pages. We can't fit large size columns, so they TOASTed to be stored efficiently. TOAST is another blog for another time.

Real world scenario

Consider the following schema of an "order_summary" table from an ecommerce platform.

                            
                                CREATE TABLE IF NOT EXISTS order_summary(
                                    id BIGSERIAL PRIMARY KEY,
                                    shipping_partner_id SMALLINT,
                                    order_id BIGINT,
                                    product_id INT,
                                    product_category_id INT, 
                                    is_delayed BOOLEAN,
                                    expected_delivery_date TIMESTAMPTZ,
                                    country_code INT,
                                    customer_id BIGINT
                                );
                            
                        

I'm sure this is not an "ideal" table schema, but I've seen tables at such levels. Let's insert a record and look at the size. Running pg_column_size returns 88 bytes. Notice the 2 int columns are fit within a single block of 8 bytes (gotcha again)

    * ---------------------------------- + -------------- +
    |   id (bigint)                      |  8 bytes       | 
    |   shipping_partner_id (smallint)   |  2 + 6 padding |
    |   order_id (bigint)                |  8 bytes       |
    |   product_id (int)                 |  4 bytes       | 
    |   product_category_id (int)        |  4 bytes       | 
    |   is_delayed (boolean)             |  1 + 7 padding |
    |   exp_delivery_date (timestamptz)  |  8 bytes       |
    |   country_code (int)               |  4 + 4 padding |
    |   customer_id (bigint)             |  8 bytes       |
    |   row header size                  |  24 bytes      |
    * ---------------------------------- + -------------- +
    |        final row size              |  88 bytes      |
    * ---------------------------------- + -------------- +
                        

Fixing the alignment overhead

We have an overhead of 17 bytes, that are added as part of alignment. The ultimate way would be to play around with column positions, so as to fit them within 8 byte blocks. Take a dig yourself before looking at the ideal column position.

    * ---------------------------------- + -------------- +
    |   id (bigint)                      |  8 bytes       | 
    |   order_id (bigint)                |  8 bytes       |
    |   exp_delivery_date (timestamptz)  |  8 bytes       |
    |   customer_id (bigint)             |  8 bytes       |
    |   product_id (int)                 |  4 bytes       | 
    |   product_category_id (int)        |  4 bytes       | 
    |   country_code (int)               |  4 bytes       |
    |   shipping_partner_id (smallint)   |  2 bytes       |
    |   is_delayed (boolean)             |  1 byte        |
    |   row header size                  |  24 bytes      |
    * ---------------------------------- + -------------- +
    |        final row size              |  71 bytes      |
    * ---------------------------------- + -------------- +
                        

There considerable reduction in row size from 88 to 71 (~19.30%). Imagine if this row had 100 million rows which is a possible scenario given this is an "order summary". The overhead by itself amounts to 1.7 gB. Now that is a considerable memory that cannot be ignored. This is just from a single table; imagine the possiblity for a database with, say 25-30 tables. Reduction in memory size means, reduced search time, means, improved query performance (to an extent). Since, this is a made up scenario to talk about column alignment we are able to fully reduce the overhead. But in real world, 100% overhead reduction may not be achievable.

If you take a look at the column arrangement, one thing that is blatantly visible is that the columns are arranged in descending order of the data type length. This brings us to the handout of this entire blog, "You cannot mess up overhead alignment bytes if you arrange columns in decreasing column of its data type length" But what about columns of variable-length columns? They are still widely used; if you go back a few sections, you can see that such columns are not padded to fit alignment. Thus, making them a suitable candidates to be placed at the end of the table after all fixed-length columns. Still then, every row will not have size as variable-length columns differ in size per the data in them. Our main aim is to reduce alignment overhead which we can achieve comfortably.

Takeaways

At a high level, this looks like a stuff that can(should) be handled during table creation. Such trivial things can bite us in the back at any later point, and should not be left to the wits of the user rather should be handled by postgres in itself. Anyhow, we kind of have to reoreder columns as the application evolves. But reordering columns frequently may (or may not) mess with the system_catalogs maintained by postgres(I guess, not certain of this) of the existing table data.

This can be easily missed early on. But we can still turn around and reduce gigs of storage with such tweakings. There are tools and extensions available can detect/suggest mitigations for such scenarios.