Slides

What is Normalization?

Database normalization is the process of structuring a relational database in accordance with a series of normal forms…

The normal forms (from least normalized to most normalized) are:

1NF	First normal form
2NF	Second normal form
3NF	Third normal form
EKNF	Elementary key normal form
BCNF	Boyce–Codd normal form
4NF	Fourth normal form
ETNF	Essential tuple normal form
5NF	Fifth normal form
DKNF	Domain-key normal form
6NF	Sixth normal form

—https://en.wikipedia.org/wiki/Database_normalization

What is Normalization Good For?

…to reduce data redundancy and improve data integrity.—https://en.wikipedia.org/wiki/Database_normalization

✅ 1. Redundancy and integrity ✅ 2. Increase the lifespan of applications ❓ 3. More informative to users ✅ 4. Neutral to query statistics

—Further Normalisation of the Data Base Relational Model E.F. Codd (1971)

Normalization

Normalize until it hurts, denormalize until it works.—Ancient proverb

Can we beat that?

Example: Shop

CREATE TABLE customers (
   name   VARCHAR  NOT NULL,
   email  VARCHAR  NOT NULL,
   dob    DATE, -- GDPR?
   /* more attributes */
   PRIMARY KEY (     ?     )
   id     INTEGER  GENERATED ALWAYS AS IDENTITY,
   PRIMARY KEY (id)
)

None of these columns is a good candidate key. While there is a functional dependency email ⇒ name, dob, ... the semantics of e-mail addresses are a show-stopper.

Further, there are more desirable properties of primary keys: • Space-efficiency • Immutability

In absences of a suitable candidate key, we have to make one Whether this is a number, UUID or anything else is a “detail” covered later

Also covered later: Should the DBMS or the App assign it?

Example: Shop

CREATE TABLE orders (
   customer_id  INTEGER      NOT NULL,
   FOREIGN KEY (customer_id) REFERENCES customers (id),

   placed       TIMESTAMP    NOT NULL,
   /* more attributes */

   PRIMARY KEY (     ?     )
)

Candidate Keys? Maybe (customer_id, placed)?

Surrogate Key? A new “id” column could be used like before

Example: Shop

CREATE TABLE order_lines (
   /*  PK of orders  */  NOT NULL,
   FOREIGN KEY (...) REFERENCES orders (...),

   product_id  INTEGER   NOT NULL,
   qty         INTEGER   NOT NULL  CHECK (qty > 0),
   /* more attributes */

   PRIMARY KEY (   )
)

Candidate Keys? Maybe (“PK of orders”, product_id)?

Dedicated Column? A new “id” column like before?

Example: Shop - Primary & Foreign Keys

Example: Shop - Table Sizes, Including Indexes

Example: Shop - New Order

INSERT INTO orders     
       (id     , …)
VALUES (DEFAULT, …)
RETURNING id

INSERT INTO order_lines
       (order_id, …)
VALUES (?       , …)

INSERT INTO orders
       (customer_id, placed, …)      
VALUES (?, CURRENT_TIMESTAMP,…)
RETURNING customer_id, placed

INSERT INTO order_lines
       (customer_id, order_placed, …)
VALUES (?          , ?           , …)

Example: Shop - Last Ordered

SELECT placed
  FROM order_lines ol
  JOIN orders o 
    ON o.id = ol.order_id
 WHERE o.customer_id = ?
   AND ol.product_id = ?
 ORDER BY placed DESC
 FETCH FIRST 1 ROW ONLY

SELECT order_placed
  FROM order_lines ol
               
                         
 WHERE customer_id = ?
   AND product_id = ?
 ORDER BY order_placed DESC
 FETCH FIRST 1 ROW ONLY

Example: Shop - Last Ordered - DBMS Optimizations

SELECT order_placed
  FROM order_lines ol
               
                         

 WHERE customer_id = ?
   AND product_id = ?
 ORDER BY order_placed DESC
 FETCH FIRST 1 ROW ONLY

SELECT placed
  FROM order_lines ol
  JOIN orders o 
    ON o.id = ol.order_id  

 WHERE o.customer_id = ?
   AND ol.product_id = ?
 ORDER BY placed DESC
 FETCH FIRST 1 ROW ONLY

SELECT order_placed
  FROM order_lines ol
               
                         

 WHERE customer_id = ?
   AND product_id = ?
 ORDER BY order_placed DESC
 FETCH FIRST 1 ROW ONLY

Limit         -- PostgreSQL
-> Index Only Scan Backward
   using … on order_lines

Limit         -- PostgreSQL
-> Sort orders.placed DESC
   -> Nested Loop
      -> Index Only Scan o
      -> Index Only Scan ol

RETURN         -- Db2 (LUW)
 IXSCAN (REVERSE) ORDER_LI…

SELECT o.placed
  FROM order_lines ol
  JOIN orders o /* inner join ! */
    ON o.customer_id = ol.customer_id
   AND o.placed = ol.order_placed
 WHERE o.customer_id = ?
   AND ol.product_id = ?
 ORDER BY o.placed DESC
 FETCH FIRST 1 ROW ONLY

Limit                   -- PostgreSQL
-> Nested Loop
   -> Index Only Scan Backward ol
   -> Index Only Scan o

RETURN                   -- Db2 (LUW)
 IXSCAN (REVERSE) ORDER_LINES

Example: Shop - Customers who bought A, but since then not B

Primary key on orders: order_idPrimary key on orders: customer_id, placed

Note: Best of Q1, Q2, Q3 depends of the popularity of the involved products!

Example: Shop - Customers who bought A, but since then not B

SELECT customers.*
  FROM (SELECT customer_id
          FROM order_lines
          JOIN orders ON orders.id = order_lines.order_id
         WHERE product_id IN (?, ?)
         GROUP BY customer_id
        HAVING           MAX(placed) FILTER (WHERE product_id = ?)
             > COALESCE( MAX(placed) FILTER (WHERE product_id = ?)
                       , DATE'1900-01-01')
       ) t
  JOIN customers ON customers.id = t.customer_id

Example: Shop - Customers who bought A, but since then not B

SELECT *
  FROM (SELECT customer_id
          FROM order_lines
          JOIN orders ON orders.id = order_lines.order_id
         WHERE product_id = ?
         GROUP BY customer_id
        HAVING NOT EXISTS (SELECT *
                             FROM order_lines ol
                             JOIN orders o ON o.id = ol.order_id
                            WHERE o.customer_id = orders.customer_id
                              AND o.placed  >= MAX(orders.placed)
                              AND product_id = ?
                          )
       ) t
  JOIN customers ON customers.id = t.customer_id

Example: Shop - Customers who bought A, but since then not B

WITH l AS (SELECT customer_id, MAX(placed) last_placed
             FROM orders
             JOIN order_lines ON order_id = orders.id
            WHERE product_id = ?
            GROUP BY customer_id )
SELECT *
  FROM customers
  JOIN l ON l.customer_id = customers.id
 WHERE NOT EXISTS (SELECT *
                     FROM orders
                     JOIN order_lines ON order_id = orders.id
                    WHERE orders.customer_id = customers.id
                      AND product_id = ?
                      AND placed >= l.last_placed )

Yet another primary key alternative

Example: Shop - Table Sizes, Including Indexes

Example: Shop - New Order

INSERT INTO orders
       (customer_id, placed, …)      
VALUES (?, CURRENT_TIMESTAMP,…)
RETURNING customer_id, placed

INSERT INTO orders (customer_id, order_no, placed)
SELECT customer_id
     , (SELECT COALESCE(MAX(order_no),0) + 1
          FROM orders WHERE customer_id = t.customer_id)
     , CURRENT_TIMESTAMP
  FROM (VALUES (?)) t(customer_id)
RETURNING customer_id, order_no

Example: Shop - Benchmark Overview

order_idSizeNew OrderLast OrderedA ⋠ B (best)

customer_id, placedSizeNew OrderLast OrderedA ⋠ B (best)

customer_id, order_noSizeNew OrderLast OrderedA ⋠ B (best)

Side Matter: Idempotency

INSERT INTO orders (customer_id, order_no, placed)
SELECT customer_id
     , (SELECT COALESCE(MAX(order_no),0) + 1
          FROM orders WHERE customer_id = t.customer_id)
     , CURRENT_TIMESTAMP
  FROM (VALUES (?)) t(customer_id)
RETURNING customer_id, order_no

Concurrent activity, for the same customer_id, causes primary key violations

An automatic, gentle and limited re-try loop is required (generally required to cope with concurrency induced errors like deadlocks)Re-try loops require idempotency

Side Matter: Idempotency

CREATE TABLE orders (
  customer_id  INTEGER      NOT NULL,
  FOREIGN KEY (customer_id) REFERENCES customers (id),
  order_no     INTEGER      NOT NULL, -- PlusColumn
  PRIMARY KEY (customer_id, order_no)

  placed       TIMESTAMP    NOT NULL,
  /* more attributes */

  idem_key     UUID         NOT NULL UNIQUE
)

A separate UNIQUE column for idempotency is required As it is not inherited to other tables, size is less relevant

It can be used in interfaces: does’t leak information, is not guessable

Structured Primary Key = Foreign Key Columns + Plus-Columns

⇧ Primary key columns of the parent table

+ Columns needed to allow multiple child records I use the term “Plus-Column” to describe the role of this column

I refer to primary keys designed in the way as Structured Primary Keys

Choosing a “Plus Column” - Immutable Semantics

Externally Defined Semantics…may change at any time ↳ Collations change ↳ Issuers might suddenly ↳↳ re-use values ↳↳ change values (ISO 3166)…can be surprisingly complex ↳ e-mail addresses ↳↳ Mixed case-sensitivity ↳↳ before/after @ ↳↳ Subaddressing: me+you@ ↳ customers.email is not ↳ a suitable PK component

Internally Defined Semantics …are also subject to change! ↳ Introducing product variants ↳ such as size, color, … ↳↳ would affect order_lines ↳↳ PK if it is (…, product_id) ↳ order_lines.product_id is ↳ not a suitable PK component ↳ Bonus: Plus Column position ↳ can preserve the cart order

Choosing a “Plus Column” - Immutable Values

Immutable values are not strictly required… … but very desirable.

If the structured primary key design leads to the use of mutable columns, there is (probably) another design mistake in the schema. Example: Departments and employees The employees PK should *not* be (department_id, employee_no)Not every FK is a candidate In this example, departments and employees are actually both top-level tables. The are connected by a separate table to implement the (temporal) N:N connection.

Choosing a “Plus Column” - Space-efficiency (bytes)

“Plus Columns” are inherited to all child tables and their rows Or the other way around:Each primary key is the set of its “Plus Columns” and those of its “ancestor” tables Their memory footprint is more relevant than that of other columns

On the other hand, smaller types might suffice for “Plus Columns”, whereas BIGINT would be required for a single-column primary key

Normalization requires candidate keys to be minimal in terms of columns. But this is in different context and does not contradict the above! In this context here, only the total memory demand matters.

Choosing a “Plus Columns” - Space-efficiency (bytes)

UUIDs…are bulky ↳ 16 bytes (binary) ↳ 36 bytes (ASCII)1 UUID = 4 INTEGERsEcstatic about UUIDv7? You actually always wanted a timestamps in your keys!

Choosing a “Plus Columns” - Space-efficiency (bytes)

Timestamps…are bulky ↳ 5-11 bytes, w/o TZ, varies by engine & depends on #fractions ↳ Awkward for humans to handle: Just imagine on the phone… ↳ orders.placed is very questionable as PK component

Character Strings…are often bulky…and complex

I think always opting for a numeric, incrementing “Plus Column” is the best practice

Unstructured PKs separate two sides from each other

Joining through such tables needs a row-by-row re-mapping (id⇔customer_id) to reach the opposite side