Sometimes I hit a wall when optimizing a client’s query. Over the years, I realized that most cases fall into two categories: (1) unreasonable expectations based on the cloud bill, not on the infrastructure’s power; (2) the primary key design. This article explains how primary keys can put tables into walled gardens so that the database schema falls apart into disconnected pieces, effectively disabling SQL’s set-based powers. I also suggest an alternative approach to primary key design and discuss its pros and cons. I use the term Structured Primary Keys to put emphasis on how it differs from natural and surrogate keys.
Customers
I’ll will use a small schema for a simple web shop in this article. Let’s start with the core of the customers table:
CREATE TABLE customers (
name VARCHAR NOT NULL,
email VARCHAR NOT NULL,
/* more attributes */
/* TODO: PRIMARY KEY */
)I kept it short as this article is not about the business attributes.0 The two shown columns are just there to be able to ask the essential question: What is a good primary key for this table?
At first sight, it is the natural vs. surrogate key discussion. But as this is only a side aspect of this article, I’ll keep it short: The table, as shown, has no column that can be considered for inclusion in a primary key.1 The reason is that both columns are governed by externally defined semantics. That means we do not know the rules they follow. In particular, we do not know if, or when they are unique. Well, we know that names are not unique, so we can rule the name column out. Although less obvious, the same is true for email addresses and more generally everything of which the rules of uniqueness are externally defined.
Note that I don’t use the “primary key values should be immutable” argument. Neither do I say that it is a bad argument. When a primary key value changes, it can be difficult to apply this change to the database as it might affect many rows in many tables.2 Further, the old primary key value might have left traces outside the database (APIs, log files, print outs, etc.) so that changing the primary key value renders those traces meaningless. While both arguments are correct, they are both irrelevant from the perceptive of the relational theory. I prefer an even stronger argument that is strictly required for a relational model to work.
This argument is that the immutability of the uniqueness rules is required. Note the emphasis: immutable uniqueness rules are strictly required, not immutable values! Let’s take email addresses as example. The problem is that the uniqueness semantics of email addresses depends on the target mail server. In particular, the part before the “@” should be case-sensitive, but often isn’t, subaddressing (“+” addressing) may or may not be supported, dots might be ignored. Let that sink in. While a.b@gmail.com and ab@gmail.com denote the same mailbox, they might refer to different mailboxes at other domains.
Externally defined uniqueness semantics are a problem. We might not fully understand them, and they might change in the future. Such misunderstandings and changes would break the uniqueness of our primary keys. Therefore, we must not use such values in our primary keys. So far, the story is in line with “the always use a surrogate key” mantra, but the story goes on. The remaining article is thus based on the following definition of the customers table:
CREATE TABLE customers (
name VARCHAR NOT NULL,
email VARCHAR NOT NULL,
/* more attributes */
id BIGINT NOT NULL GENERATED ALWAYS AS IDENTITY,
PRIMARY KEY (id)
)Orders
The second table in the shop is for orders:
CREATE TABLE orders (
customer_id BIGINT NOT NULL,
FOREIGN KEY (customer_id) REFERENCES customers (id),
placed TIMESTAMP(6) NOT NULL,
/* more attributes */
/* TODO: PRIMARY KEY */
)Again, I just show the relevant parts. It starts with the reference to the customer, including the foreign key definition. The placed column follows to store time of order placement. Obviously, this table would have many more column, but they are irrelevant for the important question: What is a good primary key for this table?
The situation is slightly different as before. One might argue that the combination of (customer_id, placed) is worth considering. And I actually agree. While the uniqueness semantics of time is also externally defined—by our universe, in confusing ways—, we might consider timestamps to be well enough understood and immutable in context of a web shop. Timestamps of past events even have the nice-to-have property of immutable values, backed by the universe itself. If the timestamp has a sufficiently high resolution, we could also consider it unique. The consequence of a primary key that combines the customer_id and placed columns is that a single customer cannot place two orders at the same microsecond. Or the other way around: The primary key puts a hard limit of one million orders per client per second on the shop’s capacity. Whether or not this is an acceptable limit is an individual decision, which becomes irrelevant further down. If we accept these limits, there is no compelling argument against using the combination of (customer_id, placed) as primary key. Even the nice-to-have of immutable values is granted. The reasons against it are that timestamps consume quite some space on disk↓ and that they are awkward to handle for humans. Just imagine asking over the phone: “Which order would you like to cancel?”
Let’s consider alternatives as well. If we do not accept the placed column as part of the primary key, the table does not have a candidate key. So, we need to make one like before. Again, the “always use a surrogate key” mantra seems to apply—but this time it leads to a problematic primary key. Have a look at the primary key that is often taken for an orders table:
CREATE TABLE orders (
customer_id BIGINT NOT NULL,
FOREIGN KEY (customer_id)
REFERENCES customers (id),
placed TIMESTAMP(6) NOT NULL,
/* more attributes */
id BIGINT NOT NULL
GENERATED ALWAYS AS IDENTITY,
PRIMARY KEY (id)
)Can you see the problem? Let me show you how this breaks the schema into pieces…
Order Lines
The table that completes the first design of the shop is for the items of an order:
CREATE TABLE order_lines (
order_id BIGINT NOT NULL,
FOREIGN KEY (order_id)
REFERENCES orders (id),
product_id INTEGER NOT NULL,
qty INTEGER NOT NULL CHECK (qty > 0),
/* more attributes */
/* TODO: PRIMARY KEY */
)This table is effectively the shopping cart. There is one row for each product that was put into an order. The table has a foreign key to the orders table but no primary key yet. The primary key of the order_lines is not even relevant for this discussion.

To understand the problem this primary key causes, we must consider queries like this: When did a specific customer order a specific product the last time? This is the corresponding query:
SELECT placed
FROM order_lines ol
JOIN orders o
ON o.id = ol.order_id
WHERE customer_id = ?
AND product_id = ?
ORDER BY placed DESC
FETCH FIRST 1 ROW ONLYThis is a top-n query in standard SQL syntax. If you are not familiar with the fetch first clause, it works just like limit 1 or select top 1 in this case. As the query needs column from two different tables, it is no surprise that join required. Unless you remember the idea that (customer_id, placed) could be used as primary key on the orders table.
If the primary key on orders is (customer_id, placed), the definition of the order_lines table changes too:
CREATE TABLE order_lines (
customer_id BIGINT NOT NULL,
order_placed TIMESTAMP(6) NOT NULL,
FOREIGN KEY (customer_id, order_placed)
REFERENCES orders (customer_id, placed),
product_id INTEGER NOT NULL,
qty INTEGER NOT NULL CHECK (qty > 0),
/* more attributes */
/* TODO: PRIMARY KEY */
)As it has a foreign key to orders, the primary key columns of orders must be available it the order_lines table. As a consequence, the previous query does not need a join anymore:
SELECT order_placed
FROM order_lines ol
WHERE customer_id = ?
AND product_id = ?
ORDER BY order_placed DESC
FETCH FIRST 1 ROW ONLYThe customer_id as well as the time of the order placement are now readily available next to the product_id. The next chart shows the response time of the first query as the basis (100%). The lower two bars show the relative response time using the query without join. For one engine (E1) it is ten times faster, for the other sill four times faster. But that is not because of the removal of the join!
The next query still performs the join to prove my point. It uses the customer_id and placed columns from the orders table as though they were not available in order_lines. The only references to the order_lines table are for the join condition and for the search on product_id. In particular, the select clause as well as the order by clause refer to columns in the orders table.
SELECT o.placed
FROM order_lines ol
JOIN orders o
ON o.customer_id = ol.customer_id
AND o.placed = ol.order_placed
WHERE o.customer_id = ?
AND ol.product_id = ?
ORDER BY o.placed DESC
FETCH FIRST 1 ROW ONLYThe join increases the response time just by a few percent points. Even with join, the query is factors faster as compared to a schema that uses the single-column primary key on orders.id.
The join is not a big problem. The wall built by the primary key is. The following diagrams make it visible. First, the variant with the single column primary key for the orders table. The arrows show the foreign keys. Their ends are at the respective columns of the constraint.
The orders table separates the customers from the order_lines because “incoming” and “outgoing” foreign keys hit the table at different columns. Whenever a query needs customers and order_lines, a row-by-row mapping between customer_id and order_id is required through the orders table.
This picture changes when the orders table uses the structured primary key (customer_id, placed):
This schema maintains the cohesion among all rows that belong to the same customer. This is of immense value for performance as well as for consistency.
Isn’t that Denormalization?
While it is true that denormalization potentially3 brings the same performance benefits, it is important to understand that a structured primary key does not break normalization. The next picture shows a schema that breaks the second normal form for the sake of performance.
For the analyzed query, this schema brings the same performance-benefits as a structured primary key does. But this schema does not guarantee that the customer_id column in the order_lines table points to the same customer as the respective orders row does. Sooner or later there will be inconsistencies. Many believe that their application can prevent that, but Murphy’s law says the opposite.
id | customer_id |
|---|---|
| 1 | 10 |
| 2 | 20 |
order_id | customer_id |
|---|---|
| 1 | 20 |
| 2 | 20 |
| 2 | 10 |
Given these sample rows, the question “Who placed order 1?” has different answers in each table. Even worse, the order_lines table contradicts itself for order 2.
Of course, I’d like to have both, the performance of the denormalized schema as well as the correctness of a normalized one. This is what structured primary keys provide.
Space-Efficiency
I guess now is a good time to discuss the space-efficiency of structured primary keys. It is foreseeable that more tables can lead to primary keys with more columns. This also requires more typing and—more importantly—increases the chances of making mistakes in the on clause. That’s the way it is. For the time being I must accept this drawback—well aware that some SQL dialects can join along a foreign key without an explicit join condition.4 Schema-aware tools that generate proper queries can mitigate this drawback to some extent.
However, let’s have a look at the memory usage of structured primary keys. Generally, there is no clear answer to the question which primary key design needs more memory. With a structured primary key, the orders table does not need the id column at all. This often spares an index too.5 On the other hand, the order_lines table has more columns. In systems that use heap tables, this counts twice: once for the heap table, once for the index supporting the primary key. Which factor dominates—the savings in some tables or losses in others—also depends on the data distribution. If there is just one order_lines entry for each order, the savings might outweigh the losses. As said, there is no general answer. But there are generally applicable methods to push the odds in favor of a structured primary key design.
If you like this page, you might also like …
… to subscribe my mailing lists, get free stickers, buy my book or get training.
In the shop schema, the space consumed by the primary key on the orders table is the critical factor, because its values are copied to all tables that have a foreign key pointing back to it. Therefore, this primary key needs to be space optimized without losing its structured nature. This is a strong argument against putting the placed column in the primary key. After all, the space required by a single timestamp falls into the range of 5 to 11 bytes—depending on the engine and the sub-seconds resolution.
This is why I suggest introducing a small proxy column in cases like this. It is, however, important to keep the two-column structure of the primary key: customer_id and placed itself or a proxy for it. Further down I will use a per-customer order number (order_no) as proxy.
If uniqueness is the only concern, the proxy values can be randomly chosen. In some cases, it is beneficial to choose the proxy values so that they preserve the sort order of the values in the original column (order-preserving). While this is useful when searching for the last time a customer ordered a product, we will see that the performance gain can be very surprisingly small.
The nice thing about such proxy columns is that their value range can be rather small. While it might be required to use a bigint for the single column primary key order_id, the structured primary key (customer_id, order_no) might be fine with two integer values. In case that a bigint needs twice as many bytes as an integer, the tables require the same space in both variants.6 Ultimately, the structured key design is smaller because it needs fewer indexes.7
In this context I’d like to provide a little food for thought about UUIDs: (1) one UUID needs as much space as four integer values;8 (2) those praising UUIDv7 might have always wanted a timestamp in their keys.
Per-Parent Sequences
Some readers might wonder where they should get the just introduced order_no values from. As a short sidenote, I’d like to demonstrate the surprising efficiency of the following insert statement. It picks the next available order_no for the respective customer, inserts a new order row with that value and returns the primary key of the just created row back to the application.9
INSERT INTO orders (customer_id, order_no, placed)
SELECT customer_id
, (SELECT COALESCE(MAX(order_no),0) + 1
FROM orders WHERE customer_id = t.customer_id)
, CURRENT_TIMESTAMP
FROM (VALUES (?)) t(customer_id)
RETURNING customer_id, order_noThe returned primary key values can be used for the insert into the order_lines table.
The basis (100%) in the next chart is the same transaction in a model that uses a single-column primary key on the orders table. Including the search for the next per-customer order_no, the transaction takes about one-third longer when using structured primary keys. As for the space-efficiency, there is no general answer whether the positive or negative effects outweigh the other.
Obviously, concurrency can lead to primary key violations. That is also true when the primary key is (customer_id, placed) and a single customer creates two orders with the same timestamp. An automatic, gentle (sleep) and limited retry loop is required, which is anyway the generic approach to cope with concurrency caused problems, such as dead locks.10
The required idempotency can be established by a dedicated unique column. As this column does not become part of any foreign keys, its size is less relevant. However, the required index still doubles the space consumed by this column so that shorter types might be preferable in (narrow) tables with many rows.
Note that the returning clause is not standard SQL and not widely supported. Standard SQL does provide an even more powerful feature, T495,“Combined data change and retrieval”, which is also not widely supported✓✗. If your engine does not provide this or equivalent functionality, getting the next per-customer order_no requires an extra query. If there is no need to preserve the chronology, client-side assignment (from a pool, by random, etc.) might be an option too. This is beneficial if very high per-customer concurrency is expected.
For comparison, the following chart compares the response time of the search for the last time a customer ordered a product. The basis (100%) is still the schema with a single-column primary key. The relative time comes from a schema using structured primary keys that do not preserve the chronology. In both cases, the query is essentially the same. They just differ in the on clause (one vs. two columns). Yet the query is much faster when using a structured primary key. It’s a miracle, isn’t it?
Of course there is no miracle. When the customer_id is available in the order_lines table, the engine can pick up the order_lines for the relevant customer and product. Next, it fetches the corresponding orders rows and takes the greats timestamp.
If the order_lines table only has the order_id but no customer_id, the engine has to pick up all order_lines for the respective product as it cannot tell whether or not each row belongs to the relevant customer. It also fetches the corresponding orders row for each of these order_lines just to figure out that many of them belong to other customers. The structured primary key saves twice: First, by not picking up order_lines of other customers, then by not accessing their order details.11
Note that the just shown speedup is neither caused by getting rid of a join nor by exploiting order-preserving characteristics of the proxy column. The query is so much faster because the structured primary key in the orders table does not separate customers from products.
The factor between the response times for the two different primary key designs multiplies with each crossed mapping table.12 By mapping table I mean tables for which the left and right join conditions use different columns. Crossing two such tables in a query would not result in a factor between four and ten like above, but rather in a factor between 16 to 100. And so forth.
Cyclic Consistency
Another benefit of structured primary keys is that they can protect against inconsistency if there are multiple join paths between two tables. For demonstration, let’s add another table to the shop: addresses. I also added two columns to the orders table: ship_to and bill_to.
The primary key of the addresses table is structured, very much like those of the other tables. The column address_no is a dense proxy for the actual columns that would be needed to reach uniqueness. The highlighted foreign key between the addresses and orders table deserves special attention: It combines the customer_id column, which was in that table before, with the newly introduced ship_to column. They collectively refer to the addresses table. This ensures that the ship_to address of an order always belongs to the same customer as the order itself. This is a very reasonable business requirement for a web shop. The foreign key for the bill_to address works analogous, albeit not pictured.
If the addresses table had a single column primary key (addresses.id), every order could point to every address—no matter if it belongs to the same customer or not. Starting from a row in the orders table, one could reach the customers table in two different ways. Either directly or through the addresses table. In case of faulty data, we could end at two different customers. Again, “anything that can go wrong will go wrong.”
This type of consistency is fully optional. We can store the full primary key (also the customer_id) of the addresses table for the shipping and/or billing addresses. There would be three customer.id values in the orders table—e.g, customer_id, ship_to_customer, bill_to_customer. This makes it explicit that the respective addresses may belong to different customers.
If it is not known whether or not such cross-customer references are allowed, we can add those columns but introduce a check constraint to validate that all have the same value. If the assumption that they must point to the same customer turns out to be wrong, just dropping the check constraint opens the schema so that it effectively resembles the behavior of a single-column primary key for the addresses table.
The above considerations do not make a lot of sense in the shop example. In a multi-tenant system it can be of great importance to be able to allow or prevent cross-tenant references as needed.
The final topping I have for you is that the above-pictured schema can be very space-efficient. If 32k addresses per customers are enough, we can use a small type that just needs two bytes.13 The columns ship_to and bill_to would, together, just need four bytes. In contrast, a single-column primary key on addresses would need an integer value at least and thus occupy eight bytes in the orders table. On top of that, the structured primary key spares an index on the addresses Table.
Bottom Line
After all, structured primary keys are really just about the idea to base primary keys on a reasonable foreign key plus wherever further columns are required to reach uniqueness. Let’s refer to these additional columns as the plus-columns. The recipe is “foreign key columns plus plus-columns”. Those plus-columns are often candidates for shorter proxy columns. In the last diagram, I marked the parent’s primary key columns with “⇧” and the plus-columns (or their proxy) with “+”.
If you were looking very closely, you might have noticed that I use a different plus-column of the order_lines table in last diagram. Initially it was the product_id column, but later it was replaced by the proxy column position. The reason is that it leaves more doors open in case the design of the products table (not shown) changes. E.g. because variants are introduced.14 After all, there is a reason for the “always use surrogate keys” mantra, but I think it should be applied to the plus-columns only.
I’m not sure about a systematic approach to identify reasonable foreign keys. Obviously, not every foreign key is a candidate to base a primary key on it. It would be very odd to base the primary key of the orders table on the foreign key to the addresses table. Also, addresses might have a foreign key to a table for ISO 3166 country codes, which would be a very odd candidate for the primary key of the addresses table. The best advice I have for you is this: use common sense. Once you have understood the idea and benefits of structured primary keys, you will do fine.15

