Structured Primary Keys

Sometimes I hit a wall when optimizing a client’s query. Over the years, I realized that most cases fall into two categories: (1) unreasonable expectations based on the cloud bill, not on the infrastructure’s power; (2) the primary key design. This article explains how primary keys can put tables into walled gardens so that the database schema falls apart into disconnected pieces, effectively disabling SQL’s set-based powers. I also suggest an alternative approach to primary key design and discuss its pros and cons. I use the term Structured Primary Keys to put emphasis on how it differs from natural and surrogate keys.

Contents:

Customers
Order Lines
Isn’t that Denormalization?
Space-Efficiency
Per-Parent Sequences
Cyclic Consistency
Bottom Line

Customers

I’ll will use a small schema for a simple web shop in this article. Let’s start with the core of the customers table:

CREATE TABLE customers (
  name   VARCHAR  NOT NULL,
  email  VARCHAR  NOT NULL,
  /* more attributes */

  /* TODO: PRIMARY KEY */
)

I kept it short as this article is not about the business attributes.0 The two shown columns are just there to be able to ask the essential question: What is a good primary key for this table?

At first sight, it is the natural vs. surrogate key discussion. But as this is only a side aspect of this article, I’ll keep it short: The table, as shown, has no column that can be considered for inclusion in a primary key.1 The reason is that both columns are governed by externally defined semantics. That means we do not know the rules they follow. In particular, we do not know if, or when they are unique. Well, we know that names are not unique, so we can rule the name column out. Although less obvious, the same is true for email addresses and more generally everything of which the rules of uniqueness are externally defined.

Note that I don’t use the “primary key values should be immutable” argument. Neither do I say that it is a bad argument. When a primary key value changes, it can be difficult to apply this change to the database as it might affect many rows in many tables.2 Further, the old primary key value might have left traces outside the database (APIs, log files, print outs, etc.) so that changing the primary key value renders those traces meaningless. While both arguments are correct, they are both irrelevant from the perspective of the relational theory. I prefer an even stronger argument that is strictly required for a relational model to work.

This argument is that the immutability of the uniqueness rules is required. Note the emphasis: immutable uniqueness rules are strictly required, not immutable values! Let’s take email addresses as example. The problem is that the uniqueness semantics of email addresses depends on the target mail server. In particular, the part before the “@” should be case-sensitive, but often isn’t, subaddressing (“+” addressing) may or may not be supported, single dots might be ignored. Let that sink in. While a.b@gmail.com and ab@gmail.com denote the same mailbox, they might refer to different mailboxes at other domains.

Externally defined uniqueness semantics are a problem. We might not fully understand them, and they might change in the future. Such misunderstandings and changes would break the uniqueness of our primary keys. Therefore, we must not use such values in our primary keys. So far, the story is in line with “the always use a surrogate key” mantra, but the story goes on. The remaining article is thus based on the following definition of the customers table:

CREATE TABLE customers (
  name   VARCHAR  NOT NULL,
  email  VARCHAR  NOT NULL,
  /* more attributes */

  id     BIGINT   NOT NULL GENERATED ALWAYS AS IDENTITY,
  PRIMARY KEY (id)
)

Orders

The second table in the shop is for orders:

CREATE TABLE orders (
  customer_id  BIGINT        NOT NULL,
  FOREIGN KEY (customer_id)  REFERENCES customers (id),

  placed       TIMESTAMP(6)  NOT NULL,
  /* more attributes */

  /* TODO: PRIMARY KEY */
)

Again, I just show the relevant parts. It starts with the reference to the customer, including the foreign key definition. The placed column follows to store time of order placement. Obviously, this table would have many more column, but they are irrelevant for the important question: What is a good primary key for this table?

The situation is slightly different as before. One might argue that the combination of (customer_id, placed) is worth considering. And I actually agree. While the uniqueness semantics of time is also externally defined—by our universe, in confusing ways—, we might consider timestamps to be well enough understood and immutable in context of a web shop. Timestamps of past events even have the nice-to-have property of immutable values, backed by the universe itself. If the timestamp has a sufficiently high resolution, we could also consider it unique. The consequence of a primary key that combines the customer_id and placed columns is that a single customer cannot place two orders at the same microsecond. Or the other way around: The primary key puts a hard limit of one million orders per client per second on the shop’s capacity. Whether or not this is an acceptable limit is an individual decision, which becomes irrelevant further down. If we accept these limits, there is no compelling argument against using the combination of (customer_id, placed) as primary key. Even the nice-to-have of immutable values is granted. The reasons against it are that timestamps consume quite some space on disk^↓ and that they are awkward to handle for humans. Just imagine asking over the phone: “Which order would you like to cancel?”

Let’s consider alternatives as well. If we do not accept the placed column as part of the primary key, the table does not have a candidate key. So, we need to make one like before. Again, the “always use a surrogate key” mantra seems to apply—but this time it leads to a problematic primary key. Have a look at the primary key that is often taken for an orders table:

CREATE TABLE orders (
  customer_id   BIGINT        NOT NULL,
  FOREIGN KEY  (customer_id)
                REFERENCES customers (id),

  placed        TIMESTAMP(6)  NOT NULL,
  /* more attributes */

  id            BIGINT        NOT NULL
                GENERATED ALWAYS AS IDENTITY,
  PRIMARY KEY (id)
)

Can you see the problem? Let me show you how this breaks the schema into pieces…

Order Lines

The table that completes the first design of the shop is for the items of an order:

CREATE TABLE order_lines (
  order_id     BIGINT   NOT NULL,
  FOREIGN KEY  (order_id)
               REFERENCES orders (id),

  product_id   INTEGER  NOT NULL,
  qty          INTEGER  NOT NULL  CHECK (qty > 0),
  /* more attributes */

  /* TODO: PRIMARY KEY */
)

This table is effectively the shopping cart. There is one row for each product that was put into an order. The table has a foreign key to the orders table but no primary key yet. The primary key of the order_lines is not even relevant for this discussion.

Screen-shot from Amazon: You last purchased this item on 1 Jun 2026

To understand the problem this primary key causes, we must consider queries like this: When did a specific customer order a specific product the last time? This is the corresponding query:

SELECT placed
  FROM order_lines ol
  JOIN orders o 
    ON o.id = ol.order_id
 WHERE o.customer_id = ?
   AND ol.product_id  = ?
 ORDER BY placed DESC
 FETCH FIRST 1 ROW ONLY

This is a top-n query in standard SQL syntax. If you are not familiar with the fetch first clause, it works just like limit 1 or select top 1 in this case. As the query needs column from two different tables, it is no surprise that join required. Unless you remember the idea that (customer_id, placed) could be used as primary key on the orders table.

If the primary key on orders is (customer_id, placed), the definition of the order_lines table changes too:

CREATE TABLE order_lines (
  customer_id  BIGINT       NOT NULL,
  order_placed TIMESTAMP(6) NOT NULL,
  FOREIGN KEY  (customer_id, order_placed)
               REFERENCES orders (customer_id, placed),

  product_id   INTEGER  NOT NULL,
  qty          INTEGER  NOT NULL  CHECK (qty > 0),
  /* more attributes */

  /* TODO: PRIMARY KEY */
)

As it has a foreign key to orders, the primary key columns of orders must be available it the order_lines table. As a consequence, the previous query does not need a join anymore:

SELECT order_placed
  FROM order_lines ol
 WHERE customer_id = ?
   AND product_id  = ?
 ORDER BY order_placed DESC
 FETCH FIRST 1 ROW ONLY

The customer_id as well as the time of the order placement are now readily available next to the product_id. The next chart shows the response time of the first query as the basis (100%). The lower two bars show the relative response time using the query without join. For one engine (E1) it is ten times faster, for the other sill four times faster. But that is not because of the removal of the join!

The next query still performs the join to prove my point. It uses the customer_id and placed columns from the orders table as though they were not available in order_lines. The only references to the order_lines table are for the join condition and for the search on product_id. In particular, the select clause as well as the order by clause refer to columns in the orders table.

SELECT o.placed
  FROM order_lines ol
  JOIN orders o
    ON o.customer_id = ol.customer_id
   AND o.placed      = ol.order_placed
 WHERE o.customer_id = ?
   AND ol.product_id = ?
 ORDER BY o.placed DESC
 FETCH FIRST 1 ROW ONLY

The join increases the response time just by a few percent points. Even with join, the query is factors faster as compared to a schema that uses the single-column primary key on orders.id.

The join is not a big problem. The wall built by the primary key is. The following diagrams make it visible. First, the variant with the single column primary key for the orders table. The arrows show the foreign keys. Their ends are at the respective columns of the constraint.

The orders table separates the customers from the order_lines because “incoming” and “outgoing” foreign keys hit the table at different columns. Whenever a query needs customers and order_lines, a row-by-row mapping between customer_id and order_id is required through the orders table.

This picture changes when the orders table uses the structured primary key (customer_id, placed):

This schema maintains the cohesion among all rows that belong to the same customer. This is of immense value for performance as well as for consistency.

Isn’t that Denormalization?

While it is true that denormalization potentially3 brings the same performance benefits, it is important to understand that a structured primary key does not break normalization. The next picture shows a schema that breaks the second normal form for the sake of performance.

For the analyzed query, this schema brings the same performance-benefits as a structured primary key does. But this schema does not guarantee that the customer_id column in the order_lines table points to the same customer as the respective orders row does. Sooner or later there will be inconsistencies. Many believe that their application can prevent that, but Murphy’s law says the opposite.

`orders`
`id`	`customer_id`
1	10
2	20

`order_lines`
`order_id`	`customer_id`
1	20
2	20
2	10

Given these sample rows, the question “Who placed order 1?” has different answers in each table. Even worse, the order_lines table contradicts itself for order 2.

Of course, I’d like to have both, the performance of the denormalized schema as well as the correctness of a normalized one. This is what structured primary keys provide.

Space-Efficiency

I guess now is a good time to discuss the space-efficiency of structured primary keys. It is foreseeable that more tables can lead to primary keys with more columns. This also requires more typing and—more importantly—increases the chances of making mistakes in the on clause. That’s the way it is. For the time being I must accept this drawback—well aware that some SQL dialects can join along a foreign key without an explicit join condition.4 Schema-aware tools that generate proper queries can mitigate this drawback to some extent.

However, let’s have a look at the memory usage of structured primary keys. Generally, there is no clear answer to the question which primary key design needs more memory. With a structured primary key, the orders table does not need the id column at all. This often spares an index too.5 On the other hand, the order_lines table has more columns. When using heap tables, this counts twice: once for the heap table, once for the index supporting the primary key. For clustered indexes, it counts n-fold as each non-clustered index has the full primary key in it. In systems with clustered indexes, Which factor dominates—the savings in some tables or losses in others—also depends on the data distribution. If there is just one order_lines entry for each order, the savings might outweigh the losses. As said, there is no general answer. But there are generally applicable methods to push the odds in favor of a structured primary key design.

If you like this page, you might also like …

… my mailing lists, free stickers, my book or my training.

In the shop schema, the space consumed by the primary key on the orders table is the critical factor, because its values are copied to all tables that have a foreign key pointing back to it. Therefore, this primary key needs to be space optimized without losing its structured nature. This is a strong argument against putting the placed column in the primary key. After all, the space required by a single timestamp falls into the range of 5 to 11 bytes—depending on the engine and the sub-seconds resolution.

This is why I suggest introducing a small proxy column in cases like this. It is, however, important to keep the two-column structure of the primary key: customer_id and placed itself or a proxy for it. Further down I will use a per-customer order number (order_no) as proxy.

If uniqueness is the only concern, the proxy values can be randomly chosen. In some cases, it is beneficial to choose the proxy values so that they preserve the sort order of the values in the original column (order-preserving). While this is useful when searching for the last time a customer ordered a product, we will see that the performance gain can be very surprisingly small.

The nice thing about such proxy columns is that their value range can be rather small. While it might be required to use a bigint for the single column primary key order_id, the structured primary key (customer_id, order_no) might be fine with two integer values. In case that a bigint needs twice as many bytes as an integer, the tables require the same space in both variants.6 Ultimately, the structured key design is smaller because it needs fewer indexes.7

In this context I’d like to provide a little food for thought about UUIDs: (1) one UUID needs as much space as four integer values;8 (2) those praising UUIDv7 might have always wanted a timestamp in their keys.

Per-Parent Sequences

Some readers might wonder where they should get the just introduced order_no values from. As a short sidenote, I’d like to demonstrate the surprising efficiency of the following insert statement. It picks the next available order_no for the respective customer, inserts a new order row with that value and returns the primary key of the just created row back to the application.9

INSERT INTO orders (customer_id, order_no, placed)
SELECT customer_id
     , (SELECT COALESCE(MAX(order_no),0) + 1
          FROM orders WHERE customer_id = t.customer_id)
     , CURRENT_TIMESTAMP
  FROM (VALUES (?)) t(customer_id)
RETURNING customer_id, order_no

The returned primary key values can be used for the insert into the order_lines table.

The basis (100%) in the next chart is the same transaction in a model that uses a single-column primary key on the orders table. Including the search for the next per-customer order_no, the transaction takes about one-third longer when using structured primary keys. As for the space-efficiency, there is no general answer whether the positive or negative effects outweigh the other.

Obviously, concurrency can lead to primary key violations. That is also true when the primary key is (customer_id, placed) and a single customer creates two orders with the same timestamp. An automatic, gentle (sleep) and limited retry loop is required, which is anyway the generic approach to cope with concurrency caused problems, such as dead locks.10

The required idempotency can be established by a dedicated unique column. As this column does not become part of any foreign keys, its size is less relevant. However, the required index still doubles the space consumed by this column so that shorter types might be preferable in (narrow) tables with many rows.

Note that the returning clause is not standard SQL and not widely supported. Standard SQL does provide an even more powerful feature, T495,“Combined data change and retrieval”, which is also not widely supported✓✗. If your engine does not provide this or equivalent functionality, getting the next per-customer order_no requires an extra query. If there is no need to preserve the chronology, client-side assignment (from a pool, by random, etc.) might be an option too. This is beneficial if very high per-customer concurrency is expected.

For comparison, the following chart compares the response time of the search for the last time a customer ordered a product. The basis (100%) is still the schema with a single-column primary key. The relative time comes from a schema using structured primary keys that do not preserve the chronology. In both cases, the query is essentially the same. They just differ in the on clause (one vs. two columns). Yet the query is much faster when using a structured primary key. It’s a miracle, isn’t it?

Of course there is no miracle. When the customer_id is available in the order_lines table, the engine can pick up the order_lines for the relevant customer and product. Next, it fetches the corresponding orders rows and takes the greats timestamp.

If the order_lines table only has the order_id but no customer_id, the engine has to pick up all order_lines for the respective product as it cannot tell whether or not each row belongs to the relevant customer. It also fetches the corresponding orders row for each of these order_lines just to figure out that many of them belong to other customers. The structured primary key saves twice: First, by not picking up order_lines of other customers, then by not accessing their order details.11

New posts per email

Note that the just shown speedup is neither caused by getting rid of a join nor by exploiting order-preserving characteristics of the proxy column. The query is so much faster because the structured primary key in the orders table does not separate customers from products.

The factor between the response times for the two different primary key designs multiplies with each crossed mapping table.12 By mapping table I mean tables for which the left and right join conditions use different columns. Crossing two such tables in a query would not result in a factor between four and ten like above, but rather in a factor between 16 to 100. And so forth.

Cyclic Consistency

Another benefit of structured primary keys is that they can protect against inconsistency if there are multiple join paths between two tables. For demonstration, let’s add another table to the shop: addresses. I also added two columns to the orders table: ship_to and bill_to.

The primary key of the addresses table is structured, very much like those of the other tables. The column address_no is a dense proxy for the actual columns that would be needed to reach uniqueness. The highlighted foreign key between the addresses and orders table deserves special attention: It combines the customer_id column, which was in that table before, with the newly introduced ship_to column. They collectively refer to the addresses table. This ensures that the ship_to address of an order always belongs to the same customer as the order itself. This is a very reasonable business requirement for a web shop. The foreign key for the bill_to address works analogous, albeit not pictured.

If the addresses table had a single column primary key (addresses.id), every order could point to every address—no matter if it belongs to the same customer or not. Starting from a row in the orders table, one could reach the customers table in two different ways. Either directly or through the addresses table. In case of faulty data, we could end at two different customers. Again, “anything that can go wrong will go wrong.”

This type of consistency is fully optional. We can store the full primary key (also the customer_id) of the addresses table for the shipping and/or billing addresses. There would be three customer.id values in the orders table—e.g, customer_id, ship_to_customer, bill_to_customer. This makes it explicit that the respective addresses may belong to different customers.

If it is not known whether or not such cross-customer references are allowed, we can add those columns but introduce a check constraint to validate that all have the same value. If the assumption that they must point to the same customer turns out to be wrong, just dropping the check constraint opens the schema so that it effectively resembles the behavior of a single-column primary key for the addresses table.

The above considerations do not make a lot of sense in the shop example. In a multi-tenant system it can be of great importance to be able to allow or prevent cross-tenant references as needed.

The final topping I have for you is that the above-pictured schema can be very space-efficient. If 32k addresses per customers are enough, we can use a small type that just needs two bytes.13 The columns ship_to and bill_to would, together, just need four bytes. In contrast, a single-column primary key on addresses would need an integer value at least and thus occupy eight bytes in the orders table. On top of that, the structured primary key spares an index on the addresses Table.

Bottom Line

After all, structured primary keys are really just about the idea to base primary keys on a reasonable foreign key plus wherever further columns are required to reach uniqueness. Let’s refer to these additional columns as the plus-columns. The recipe is “foreign key columns plus plus-columns”. Those plus-columns are often candidates for shorter proxy columns. In the last diagram, I marked the parent’s primary key columns with “⇧” and the plus-columns (or their proxy) with “+”.

If you were looking very closely, you might have noticed that I use a different plus-column of the order_lines table in last diagram. Initially it was the product_id column, but later it was replaced by the proxy column position. The reason is that it leaves more doors open in case the design of the products table (not shown) changes. E.g. because variants are introduced.14 After all, there is a reason for the “always use surrogate keys” mantra, but I think it should be applied to the plus-columns only.

I’m not sure about a systematic approach to identify reasonable foreign keys. Obviously, not every foreign key is a candidate to base a primary key on it. It would be very odd to base the primary key of the orders table on the foreign key to the addresses table. Also, addresses might have a foreign key to a table for ISO 3166 country codes, which would be a very odd candidate for the primary key of the addresses table. The best advice I have for you is this: use common sense. Once you have understood the idea and benefits of structured primary keys, you will do fine.15

You can’t catch up on 20 years of SQL evolution in one day. Subscribe the newsletter via E-Mail, Bluesky or RSS to gradually catch up and to keep modern-⁠sql.com on your radar.

About the Author

Markus Winand provides insights into SQL and shows how different systems support it at modern-sql.com. Previously he made use-the-index-luke.com, which is still actively maintained. Markus can be hired as trainer, speaker and consultant via winand.at.

Footnotes

0
In contrast to the traditional approach of normalization, which focuses on reducing a relation’s attributes into candidate keys.
1
In 1971, Codd wrote: “Obviously, there always exists at least one candidate key, …”. This seems to assume that the relation has enough attributes and that their semantics are fully known. Probably reasonable assumption in context of his paper.
2
While this is theoretically solved by SQL’s ON UPDATE CASCADE option of foreign keys, it quickly become practically prohibitive.
3
Not inevitably, because a denormalized schema often requires more indexes which increases the space consumed as well as time time for write operations.
4
Like JOIN TO ONE does in the Oracle database
5
Table that have only one index, to support the primary key, cannot spare any index.
6
Systems or data types that use a variable-length storage of integers automatically benefit from small oder_no values without limiting the total number of orders a single customer can have.
7
If an index on the leading columns of the structured primary key is anyway needed.
8
Typically, if the UUID is stored in binary form. If it stored as string, it typically takes the space of nine integer values. But everything depends on the engine.
9
Of course I took care that the primary key backing index is defined in the right column order: (customer_id, order_no).
The insert statement is actually more complex than needed. This is because I used it as a drop-in in replacement for other variants in my stress testing tool. All those variants just expect a single bind-paramter for the customer_id.
10
Book my data consistency training ;)
11
The described order of execution is just one of the possible execution plans. The positive effect is the same for all of them, however.
12
This is not fully correct as some operations can limit the exponential growth to some extend—e.g. hash joins.
13
Check smallint first.
14
In practice, it would not be a product_id, but an SKUs. But for the sake of the argument, let’s assume it could be the case.
15
Famous last words ;)

Structured Primary Keys

Customers

Orders

Order Lines

Isn’t that Denormalization?

Space-Efficiency

If you like this page, you might also like …

Per-Parent Sequences

New posts per email

Cyclic Consistency

Bottom Line

About the Author

Footnotes

Can I use…

News

The Standard

Concepts

Features

Use Cases

Connect with Markus Winand

Structured Primary Keys

Customers

Orders

Order Lines

Isn’t that Denormalization?

Space-Efficiency

If you like this page, you might also like …

Per-Parent Sequences

New posts per email

Cyclic Consistency

Bottom Line

About the Author

Buy the Book

Hire Markus

Footnotes

Can I use…

News

The Standard

Concepts

Features

Use Cases

Connect with Markus Winand