Normalisierung, Primärschlüssel und andere Albträume

Markus Winand https://winand.at • https://modern-sql.com • https://use-the-index-luke.com

What is Normalization?

Database normalization is the process of structuring a relational database in accordance with a series of normal forms…

The normal forms (from least normalized to most normalized) are:

1NFFirst normal form
2NFSecond normal form
3NFThird normal form
EKNFElementary key normal form
BCNFBoyce–Codd normal form
4NFFourth normal form
ETNFEssential tuple normal form
5NFFifth normal form
DKNFDomain-key normal form
6NFSixth normal form

—https://en.wikipedia.org/wiki/Database_normalization

What is Normalization Good For?

…to reduce data redundancy and improve data integrity.—https://en.wikipedia.org/wiki/Database_normalization

✅ 1. Redundancy and integrity ✅ 2. Increase the lifespan of applications ❓ 3. More informative to users ✅ 4. Neutral to query statistics

—Further Normalisation of the Data Base Relational Model E.F. Codd (1971)

Normalization

Normalize until it hurts, denormalize until it works.—Ancient proverb

Can we beat that?

Example: Shop
CREATE TABLE customers (
   name   VARCHAR  NOT NULL,
   email  VARCHAR  NOT NULL,
   dob    DATE, -- GDPR?
   /* more attributes */
   PRIMARY KEY (     ?     )
   id     INTEGER  GENERATED ALWAYS AS IDENTITY,
   PRIMARY KEY (id)
)                                                

None of these columns is a good candidate key. While there is a functional dependency email ⇒ name, dob, ... the semantics of e-mail addresses are a show-stopper.

Further, there are more desirable properties of primary keys: • Space-efficiency     • Immutability

In absences of a suitable candidate key, we have to make one Whether this is a number, UUID or anything else is a “detail” covered later

Also covered later: Should the DBMS or the App assign it?

Example: Shop
CREATE TABLE orders (
   customer_id  INTEGER      NOT NULL,
   FOREIGN KEY (customer_id) REFERENCES customers (id),

   placed       TIMESTAMP    NOT NULL,
   /* more attributes */

   PRIMARY KEY (     ?     )
)

Candidate Keys? Maybe (customer_id, placed)?  

Surrogate Key? A new “id” column could be used like before

Example: Shop
CREATE TABLE order_lines (
   /*  PK of orders  */  NOT NULL,
   FOREIGN KEY (...) REFERENCES orders (...),

   product_id  INTEGER   NOT NULL,
   qty         INTEGER   NOT NULL  CHECK (qty > 0),
   /* more attributes */

   PRIMARY KEY (   )
)

Candidate Keys? Maybe (“PK of orders”, product_id)?

Dedicated Column? A new “id” column like before?

Example: Shop - Primary & Foreign Keys
customers🔑+idnameemaildoborders🔑idcustomer_idplacedorders🔑customer_id🔑+order_noplacedorders🔑customer_id🔑+order_noplacedship_tobill_toorder_lines🔑order_id🔑product_idqtyorder_lines🔑order_id🔑product_idqtycustomer_idorder_lines🔑customer_id🔑order_no🔑+positionproduct_idqty100%PG [GB]100%Db2 [GB]100%100%100%100%-32%-23%+33%+3%customers🔑idnameemaildoborders🔑customer_id🔑placedorder_lines🔑customer_id🔑order_placed🔑product_idqty100%PG [GB]100%Db2 [GB]-23%-17%+47%+22%addresses🔑customer_id🔑+address_nonote...Shipping addressOrder addressescannot point ataddress ofdifferent customerThis integrity is optional:store full PK of addressin order to allow crosscustomer addressesSaves space:(1) shorter types (smallint?)(2) often spares an index
Example: Shop - Table Sizes, Including Indexes
customers🔑+idnameemaildoborders🔑idcustomer_idplacedorders🔑customer_id🔑+order_noplacedorders🔑customer_id🔑+order_noplacedship_tobill_toorder_lines🔑order_id🔑product_idqtyorder_lines🔑order_id🔑product_idqtycustomer_idorder_lines🔑customer_id🔑order_no🔑+positionproduct_idqty100%PG [GB]100%Db2 [GB]100%100%100%100%-32%-23%+33%+3%customers🔑idnameemaildoborders🔑customer_id🔑placedorder_lines🔑customer_id🔑order_placed🔑product_idqty100%PG [GB]100%Db2 [GB]-23%-17%+47%+22%addresses🔑customer_id🔑+address_nonote...Shipping addressOrder addressescannot point ataddress ofdifferent customerThis integrity is optional:store full PK of addressin order to allow crosscustomer addressesSaves space:(1) shorter types (smallint?)(2) often spares an index
Example: Shop - New Order
INSERT INTO orders     
       (id     , …)
VALUES (DEFAULT, …)
RETURNING id
INSERT INTO order_lines
       (order_id, …)
VALUES (?       , …)
PG [ms]100%Db2 [ms]100%
INSERT INTO orders
       (customer_id, placed, …)      
VALUES (?, CURRENT_TIMESTAMP,…)
RETURNING customer_id, placed
INSERT INTO order_lines
       (customer_id, order_placed, …)
VALUES (?          , ?           , …)
PG [ms]+93% Db2 [ms]+38% 
Example: Shop - Last Ordered
SELECT placed
  FROM order_lines ol
  JOIN orders o 
    ON o.id = ol.order_id
 WHERE customer_id = ?
   AND product_id = ?
 ORDER BY placed DESC
 FETCH FIRST 1 ROW ONLY
PG [ms]100%Db2 [ms]100%
SELECT order_placed
  FROM order_lines ol
               
                         
 WHERE customer_id = ?
   AND product_id = ?
 ORDER BY order_placed DESC
 FETCH FIRST 1 ROW ONLY
PG [ms]-90%Db2 [ms]-74%
Example: Shop - Last Ordered - DBMS Optimizations
SELECT order_placed
  FROM order_lines ol
               
                         

 WHERE customer_id = ?
   AND product_id = ?
 ORDER BY order_placed DESC
 FETCH FIRST 1 ROW ONLY
SELECT placed
  FROM order_lines ol
  JOIN orders o 
    ON o.id = ol.order_id  

 WHERE customer_id = ?
   AND product_id = ?
 ORDER BY placed DESC
 FETCH FIRST 1 ROW ONLY
SELECT order_placed
  FROM order_lines ol
               
                         

 WHERE customer_id = ?
   AND product_id = ?
 ORDER BY order_placed DESC
 FETCH FIRST 1 ROW ONLY
PG [ms]-90%Db2 [ms]-74%PG [ms]-90%Db2 [ms]-74%PG [ms]100%Db2 [ms]100%
Limit         -- PostgreSQL
-> Index Only Scan Backward
   using … on order_lines
Limit         -- PostgreSQL
-> Sort orders.placed DESC
   -> Nested Loop
      -> Index Only Scan o
      -> Index Only Scan ol
RETURN         -- Db2 (LUW)
 IXSCAN (REVERSE) ORDER_LI…
SELECT o.placed
  FROM order_lines ol
  JOIN orders o /* inner join ! */
    ON o.customer_id = ol.customer_id
   AND o.placed = ol.order_placed
 WHERE o.customer_id = ?
   AND product_id = ?
 ORDER BY o.placed DESC
 FETCH FIRST 1 ROW ONLY
PG [ms]-88%Db2 [ms]-70%
Limit                   -- PostgreSQL
-> Nested Loop
   -> Index Only Scan Backward ol
   -> Index Only Scan o
RETURN                   -- Db2 (LUW)
 IXSCAN (REVERSE) ORDER_LINES
Example: Shop - Customers who bought A, but since then not B
Primary key on orders: order_idPG Q1 [ms]100%PG Q2 [ms]+251% PG Q2 [ms]+74% Db2 Q1 [ms]100%Db2 Q2 [ms]1550% Db2 Q2 [ms]1276% Primary key on orders: customer_id, placedPG Q1 [ms]-97%PG Q2 [ms]-97%PG Q2 [ms]-98%Db2 Q1 [ms]-95%Db2 Q2 [ms]-96%Db2 Q2 [ms]-95%

Note: Best of Q1, Q2, Q3 depends of the popularity of the involved products!

Example: Shop - Customers who bought A, but since then not B

Q1

SELECT customers.*
  FROM (SELECT customer_id
          FROM order_lines
          JOIN orders ON orders.id = order_lines.order_id
         WHERE product_id IN (?, ?)
         GROUP BY customer_id
        HAVING           MAX(placed) FILTER (WHERE product_id = ?)
             > COALESCE( MAX(placed) FILTER (WHERE product_id = ?)
                       , DATE'1900-01-01')
       ) t
  JOIN customers ON customers.id = t.customer_id
Example: Shop - Customers who bought A, but since then not B

Q2

SELECT *
  FROM (SELECT customer_id
          FROM order_lines
          JOIN orders ON orders.id = order_lines.order_id
         WHERE product_id = ?
         GROUP BY customer_id
        HAVING NOT EXISTS (SELECT *
                             FROM order_lines ol
                             JOIN orders o ON o.id = ol.order_id
                            WHERE o.customer_id = orders.customer_id
                              AND o.placed  >= MAX(orders.placed)
                              AND product_id = ?
                          )
       ) t
  JOIN customers ON customers.id = t.customer_id
Example: Shop - Customers who bought A, but since then not B

Q3

WITH l AS (SELECT customer_id, MAX(placed) last_placed
             FROM orders
             JOIN order_lines ON order_id = orders.id
            WHERE product_id = ?
            GROUP BY customer_id )
SELECT *
  FROM customers
  JOIN l ON l.customer_id = customers.id
 WHERE NOT EXISTS (SELECT *
                     FROM orders
                     JOIN order_lines ON order_id = orders.id
                    WHERE orders.customer_id = customers.id
                      AND product_id = ?
                      AND placed >= l.last_placed )

Yet another primary key alternative

Example: Shop - Table Sizes, Including Indexes
customers🔑+idnameemaildoborders🔑idcustomer_idplacedorders🔑customer_id🔑+order_noplacedorders🔑customer_id🔑+order_noplacedship_tobill_toorder_lines🔑order_id🔑product_idqtyorder_lines🔑order_id🔑product_idqtycustomer_idorder_lines🔑customer_id🔑order_no🔑+positionproduct_idqty100%PG [GB]100%Db2 [GB]100%100%100%100%-32%-23%+33%+3%customers🔑idnameemaildoborders🔑customer_id🔑placedorder_lines🔑customer_id🔑order_placed🔑product_idqty100%PG [GB]100%Db2 [GB]-23%-17%+47%+22%addresses🔑customer_id🔑+address_nonote...Shipping addressOrder addressescannot point ataddress ofdifferent customerThis integrity is optional:store full PK of addressin order to allow crosscustomer addressesSaves space:(1) shorter types (smallint?)(2) often spares an index
Example: Shop - New Order
INSERT INTO orders
       (customer_id, placed, …)      
VALUES (?, CURRENT_TIMESTAMP,…)
RETURNING customer_id, placed
PG [ms]+93% Db2 [ms]+38% 
INSERT INTO orders (customer_id, order_no, placed)
SELECT customer_id
     , (SELECT COALESCE(MAX(order_no),0) + 1
          FROM orders WHERE customer_id = t.customer_id)
     , CURRENT_TIMESTAMP
  FROM (VALUES (?)) t(customer_id)
RETURNING customer_id, order_no
PG [ms]+29% Db2 [ms]+35% 
Example: Shop - Benchmark Overview

order_idSizePG [GB]100%Db2 [GB]100%New OrderPG [ms]100%Db2 [ms]100%Last OrderedPG [ms]100%Db2 [ms]100%A ⋠ B (best)PG [ms]100%Db2 [ms]100%

customer_id, placedSizePG [GB]+8.2% Db2 [GB]-1.1% New OrderPG [ms]+93% Db2 [ms]+38% Last OrderedPG [ms]-90%Db2 [ms]-74%A ⋠ B (best)PG [ms]-98%Db2 [ms]-96%

customer_id, order_noSizePG [GB]-2.7% Db2 [GB]-13% New OrderPG [ms]+29% Db2 [ms]+35% Last OrderedPG [ms]-86%Db2 [ms]-72%A ⋠ B (best)PG [GB]-98%Db2 [GB]-95%

The bottom line

Structured Primary Key = Foreign Key Columns + SubID
customers🔑+idnameemaildoborders🔑idcustomer_idplacedorders🔑customer_id🔑+order_noplacedorders🔑customer_id🔑+order_noplacedship_tobill_toorder_lines🔑order_id🔑product_idqtyorder_lines🔑order_id🔑product_idqtycustomer_idorder_lines🔑customer_id🔑order_no🔑+positionproduct_idqty

⇧ Primary key columns of the parent table

+ Columns needed to allow multiple child records I use the term “SubID” to describe the role of this column

I refer to primary keys designed in the way as Structured Primary Keys

Choosing a “SubID” - Immutable Semantics

Externally Defined Semantics…may change at any time ↳ Collations change ↳ Issuers might suddenly↳ re-use values↳ change values (ISO 3166)…can be surprisingly complex ↳ e-mail addresses ↳ Mixed case-sensitivity ↳↳ before/after @↳ Subaddressing: me+you@ customers.email is not a suitable PK component

Internally Defined Semantics …are also subject to change! ↳ Introducing product variants  such as size, color, … ↳ would affect order_lines ↳↳ PK if it is (…, product_id) order_lines.product_id is not a suitable PK component Bonus: “SubID” position can preserve the cart order

Choosing a “SubID” - Immutable Values

Immutable values are not strictly required… … but very desirable.

If the structured primary key design leads to the use of mutable columns, there is (probably) another design mistake in the schema. Example: Departments and employees The employees PK should *not* be (department_id, employee_no)Not every FK is a candidate In this example, departments and employees are actually both top-level tables. The are connected by a separate table to implement the (temporal) N:N connection.

Choosing a “SubID” - Space-efficiency (bytes)

“SubIDs” are inherited to all child tables and their rows Or the other way around:Each primary key is the set of its “SubID” and those of its “ancestor” tables Their memory footprint is more relevant than that of other columns

On the other hand, smaller types (e.g. INT) might suffice for “SubIDs”, whereas BIGINT would be required for a single-column primary key

Normalization requires candidate keys to be minimal in terms of columns. But this is in different context and does not contradict the above! In this context here, only the total memory demand matters.

Choosing a “SubID” - Space-efficiency (bytes)

UUIDs…are bulky ↳ 16 bytes (binary) ↳ 36 bytes (ASCII)1 UUID = 4 INTEGERs1UUID PG [MB]100%1UUID Db2 [MB]100%4INTs PG [MB]-0.1% 4INTs Db2 [MB]+0.0% Ecstatic about UUIDv7? You actually always wanted a timestamps in your keys!

Choosing a “SubID” - Space-efficiency (bytes)

Timestamps…are bulky ↳ 5-11 bytes, w/o TZ, varies by engine & depends on #fractions ↳ Awkward for humans to handle: Just imagine on the phone… orders.placed is very questionable as PK component

Character Strings…are often bulky…and complex

I think always opting for a numeric, incrementing “SubID” is the best practice

Side Matter: Idempotency
INSERT INTO orders (customer_id, order_no, placed)
SELECT customer_id
     , (SELECT COALESCE(MAX(order_no),0) + 1
          FROM orders WHERE customer_id = t.customer_id)
     , CURRENT_TIMESTAMP
  FROM (VALUES (?)) t(customer_id)
RETURNING customer_id, order_no

Concurrent activity, for the same customer_id, causes primary key violations

An automatic, gentle and limited re-try loop is required (generally required to cope with concurrency induced errors like deadlocks)Re-try loops require idempotency

Side Matter: Idempotency
CREATE TABLE orders (
  customer_id  INTEGER      NOT NULL,
  FOREIGN KEY (customer_id) REFERENCES customers (id),
  order_no     INTEGER      NOT NULL, -- SubId
  PRIMARY KEY (customer_id, order_no)

  placed       TIMESTAMP    NOT NULL,
  /* more attributes */

  idem_key     UUID         NOT NULL UNIQUE
)

A separate UNIQUE column for idempotency is required As it is not inherited to other tables, size is less relevant

It can be used in interfaces: does’t leak information, is not guessable

Cyclic Integrity

Cyclic Integrity
customers🔑+idnameemaildoborders🔑idcustomer_idplacedorders🔑customer_id🔑+order_noplacedorders🔑customer_id🔑+order_noplacedship_tobill_toorder_lines🔑order_id🔑product_idqtyorder_lines🔑order_id🔑product_idqtycustomer_idorder_lines🔑customer_id🔑order_no🔑+positionproduct_idqty100%PG [GB]100%Db2 [GB]100%100%100%100%-32%-23%+33%+3%customers🔑idnameemaildoborders🔑customer_id🔑placedorder_lines🔑customer_id🔑order_placed🔑product_idqty100%PG [GB]100%Db2 [GB]-23%-17%+47%+22%addresses🔑customer_id🔑+address_nonote...Shipping addressOrder addressescannot point ataddress ofdifferent customerThis integrity is optional:store full PK of addressin order to allow crosscustomer addressesSaves space:(1) shorter types (smallint?)(2) often spares an index

Unstructured PKs separate two sides from each other

orders🔑 id   customer_id   placed

Joining through such tables needs a row-by-row re-mapping (idcustomer_id) to reach the opposite side

This is denormalizationorder_lines🔑 order_id🔑 position customer_id ...

This is a structured PKorder_lines🔑 customer_id🔑 order_no🔑 position ...