Normalisierung, Primärschlüssel und andere Albträume
Markus Winand https://winand.at • https://modern-sql.com • https://use-the-index-luke.com
Database normalization is the process of structuring a relational database in accordance with a series of normal forms…
The normal forms (from least normalized to most normalized) are:
| 1NF | First normal form |
| 2NF | Second normal form |
| 3NF | Third normal form |
| EKNF | Elementary key normal form |
| BCNF | Boyce–Codd normal form |
| 4NF | Fourth normal form |
| ETNF | Essential tuple normal form |
| 5NF | Fifth normal form |
| DKNF | Domain-key normal form |
| 6NF | Sixth normal form |
—https://en.wikipedia.org/wiki/Database_normalization
…to reduce data redundancy and improve data integrity.—https://en.wikipedia.org/wiki/Database_normalization
✅ 1. Redundancy and integrity ✅ 2. Increase the lifespan of applications ❓ 3. More informative to users ✅ 4. Neutral to query statistics
—Further Normalisation of the Data Base Relational Model E.F. Codd (1971)
Normalize until it hurts, denormalize until it works.—Ancient proverb
Can we beat that?
CREATE TABLE customers ( name VARCHAR NOT NULL, email VARCHAR NOT NULL, dob DATE, -- GDPR? /* more attributes */ PRIMARY KEY ( ? ) id INTEGER GENERATED ALWAYS AS IDENTITY, PRIMARY KEY (id) )
None of these columns is a good candidate key. While there is a functional dependency email ⇒ name, dob, ... the semantics of e-mail addresses are a show-stopper.
Further, there are more desirable properties of primary keys: • Space-efficiency • Immutability
In absences of a suitable candidate key, we have to make one Whether this is a number, UUID or anything else is a “detail” covered later
Also covered later: Should the DBMS or the App assign it?
CREATE TABLE orders ( customer_id INTEGER NOT NULL, FOREIGN KEY (customer_id) REFERENCES customers (id), placed TIMESTAMP NOT NULL, /* more attributes */ PRIMARY KEY ( ? ) )
Candidate Keys? Maybe (customer_id, placed)?
Surrogate Key? A new “id” column could be used like before
CREATE TABLE order_lines ( /* PK of orders */ NOT NULL, FOREIGN KEY (...) REFERENCES orders (...), product_id INTEGER NOT NULL, qty INTEGER NOT NULL CHECK (qty > 0), /* more attributes */ PRIMARY KEY ( ) )
Candidate Keys? Maybe (“PK of orders”, product_id)?
Dedicated Column? A new “id” column like before?
INSERT INTO orders
(id , …)
VALUES (DEFAULT, …)
RETURNING idINSERT INTO order_lines
(order_id, …)
VALUES (? , …)INSERT INTO orders
(customer_id, placed, …)
VALUES (?, CURRENT_TIMESTAMP,…)
RETURNING customer_id, placedINSERT INTO order_lines
(customer_id, order_placed, …)
VALUES (? , ? , …)
SELECT placed
FROM order_lines ol
JOIN orders o
ON o.id = ol.order_id
WHERE customer_id = ?
AND product_id = ?
ORDER BY placed DESC
FETCH FIRST 1 ROW ONLYSELECT order_placed FROM order_lines ol WHERE customer_id = ? AND product_id = ? ORDER BY order_placed DESC FETCH FIRST 1 ROW ONLY
SELECT order_placed FROM order_lines ol WHERE customer_id = ? AND product_id = ? ORDER BY order_placed DESC FETCH FIRST 1 ROW ONLY
SELECT placed
FROM order_lines ol
JOIN orders o
ON o.id = ol.order_id
WHERE customer_id = ?
AND product_id = ?
ORDER BY placed DESC
FETCH FIRST 1 ROW ONLYSELECT order_placed FROM order_lines ol WHERE customer_id = ? AND product_id = ? ORDER BY order_placed DESC FETCH FIRST 1 ROW ONLY
Limit -- PostgreSQL
-> Index Only Scan Backward
using … on order_linesLimit -- PostgreSQL
-> Sort orders.placed DESC
-> Nested Loop
-> Index Only Scan o
-> Index Only Scan olRETURN -- Db2 (LUW)
IXSCAN (REVERSE) ORDER_LI…SELECT o.placed
FROM order_lines ol
JOIN orders o /* inner join ! */
ON o.customer_id = ol.customer_id
AND o.placed = ol.order_placed
WHERE o.customer_id = ?
AND product_id = ?
ORDER BY o.placed DESC
FETCH FIRST 1 ROW ONLYLimit -- PostgreSQL
-> Nested Loop
-> Index Only Scan Backward ol
-> Index Only Scan oRETURN -- Db2 (LUW)
IXSCAN (REVERSE) ORDER_LINESorders: order_idPrimary key on orders: customer_id, placedNote: Best of Q1, Q2, Q3 depends of the popularity of the involved products!
Q1
SELECT customers.*
FROM (SELECT customer_id
FROM order_lines
JOIN orders ON orders.id = order_lines.order_id
WHERE product_id IN (?, ?)
GROUP BY customer_id
HAVING MAX(placed) FILTER (WHERE product_id = ?)
> COALESCE( MAX(placed) FILTER (WHERE product_id = ?)
, DATE'1900-01-01')
) t
JOIN customers ON customers.id = t.customer_idQ2
SELECT *
FROM (SELECT customer_id
FROM order_lines
JOIN orders ON orders.id = order_lines.order_id
WHERE product_id = ?
GROUP BY customer_id
HAVING NOT EXISTS (SELECT *
FROM order_lines ol
JOIN orders o ON o.id = ol.order_id
WHERE o.customer_id = orders.customer_id
AND o.placed >= MAX(orders.placed)
AND product_id = ?
)
) t
JOIN customers ON customers.id = t.customer_idQ3
WITH l AS (SELECT customer_id, MAX(placed) last_placed
FROM orders
JOIN order_lines ON order_id = orders.id
WHERE product_id = ?
GROUP BY customer_id )
SELECT *
FROM customers
JOIN l ON l.customer_id = customers.id
WHERE NOT EXISTS (SELECT *
FROM orders
JOIN order_lines ON order_id = orders.id
WHERE orders.customer_id = customers.id
AND product_id = ?
AND placed >= l.last_placed )Yet another primary key alternative
INSERT INTO orders
(customer_id, placed, …)
VALUES (?, CURRENT_TIMESTAMP,…)
RETURNING customer_id, placedINSERT INTO orders (customer_id, order_no, placed)
SELECT customer_id
, (SELECT COALESCE(MAX(order_no),0) + 1
FROM orders WHERE customer_id = t.customer_id)
, CURRENT_TIMESTAMP
FROM (VALUES (?)) t(customer_id)
RETURNING customer_id, order_noorder_idSizeNew OrderLast OrderedA ⋠ B (best)
customer_id, placedSizeNew OrderLast OrderedA ⋠ B (best)
customer_id, order_noSizeNew OrderLast OrderedA ⋠ B (best)
The bottom line
⇧ Primary key columns of the parent table
+ Columns needed to allow multiple child records I use the term “SubID” to describe the role of this column
I refer to primary keys designed in the way as Structured Primary Keys
Externally Defined Semantics…may change at any time
↳ Collations change
↳ Issuers might suddenly
↳↳ re-use values
↳↳ change values (ISO 3166)…can be surprisingly complex
↳ e-mail addresses
↳↳ Mixed case-sensitivity
↳↳ before/after @
↳↳ Subaddressing: me+you@
↳ customers.email is not
↳ a suitable PK component
Internally Defined Semantics
…are also subject to change!
↳ Introducing product variants
↳ such as size, color, …
↳↳ would affect order_lines
↳↳ PK if it is (…, product_id)
↳ order_lines.product_id is
↳ not a suitable PK component
↳ Bonus: “SubID” position
↳ can preserve the cart order
Immutable values are not strictly required… … but very desirable.
If the structured primary key design leads to the use of mutable columns, there is (probably) another design mistake in the schema.
Example: Departments and employees
The employees PK should *not* be (department_id, employee_no)Not every FK is a candidate
In this example, departments and employees are actually both top-level tables. The are connected by a separate table to implement the (temporal) N:N connection.
“SubIDs” are inherited to all child tables and their rows Or the other way around:Each primary key is the set of its “SubID” and those of its “ancestor” tables Their memory footprint is more relevant than that of other columns
On the other hand, smaller types (e.g. INT) might suffice for “SubIDs”, whereas BIGINT would be required for a single-column primary key
Normalization requires candidate keys to be minimal in terms of columns. But this is in different context and does not contradict the above! In this context here, only the total memory demand matters.
UUIDs…are bulky ↳ 16 bytes (binary) ↳ 36 bytes (ASCII)1 UUID = 4 INTEGERsEcstatic about UUIDv7? You actually always wanted a timestamps in your keys!
Timestamps…are bulky
↳ 5-11 bytes, w/o TZ, varies by engine & depends on #fractions
↳ Awkward for humans to handle: Just imagine on the phone…
↳ orders.placed is very questionable as PK component
Character Strings…are often bulky…and complex
I think always opting for a numeric, incrementing “SubID” is the best practice
INSERT INTO orders (customer_id, order_no, placed)
SELECT customer_id
, (SELECT COALESCE(MAX(order_no),0) + 1
FROM orders WHERE customer_id = t.customer_id)
, CURRENT_TIMESTAMP
FROM (VALUES (?)) t(customer_id)
RETURNING customer_id, order_noConcurrent activity, for the same customer_id, causes primary key violations
An automatic, gentle and limited re-try loop is required (generally required to cope with concurrency induced errors like deadlocks)Re-try loops require idempotency
CREATE TABLE orders ( customer_id INTEGER NOT NULL, FOREIGN KEY (customer_id) REFERENCES customers (id), order_no INTEGER NOT NULL, -- SubId PRIMARY KEY (customer_id, order_no) placed TIMESTAMP NOT NULL, /* more attributes */ idem_key UUID NOT NULL UNIQUE )
A separate UNIQUE column for idempotency is required As it is not inherited to other tables, size is less relevant
It can be used in interfaces: does’t leak information, is not guessable
Cyclic Integrity
Unstructured PKs separate two sides from each other
Joining through such tables needs a
row-by-row re-mapping (id⇔customer_id)
to reach the opposite side
This is denormalization
This is a structured PK