08. Lateral Joins in Postgres
March 16, 2020 / 2 minute read

Postgres is my favourite piece of tech.

I've been spending as much time as I can with it recently. Your database goes far beyond a storage engine. It's an API for your business logic and for manipulating data for your application to consume.

When I first started with Postgres, I used only the basic features - INSERT INTO and SELECT * FROM x WHERE ... this was the absolute limit of it. There's a whole host of uncommon functionality available. This is functionality that is tested, that's fast, and already available for your application to use.

The Postgres documentation can be a little wordy and technical at times. Given the subject matter, this can absolutely be forgiven. Nevertheless, it can make it hard to break in and get the most out of the database.

I want to spend some time explaining lateral joins. The wording on the website is not very clear, so here's some code we've all probably built.

In web analytics, you can do a "time on page" calculation. From a purely technical perspective, here's how this looks:

  1. An impression comes in at time A.

  2. A second impression comes in at time B.

  3. The time on page value for impression A is B.created - A.created.

Given a database table looking like:

CREATE TABLE impressions (user_id int, created timestamptz);

If you wanted to look at each of the "time on page" values in a dataset, you might find yourself writing something like below.

Pseudo-code to retrieve time on page for each impression might look like:

impressions = db.run("SELECT user_id, created FROM impressions")

for impression in impressions:

previous_impression = db.run("SELECT user_id, created FROM impressions WHERE user_id = {impression.id} and created < {impression.created}")

# assuming we get a match

time_on_page = impression.created - previous_impression.created


This is NOT good. If you have 1000 impressions, it means 1000 + 1 trips to the database. Every trip comes with network latency and execution overhead that slowly poisons your response time. At hobby-scale this will not be super noticeable, but anything beyond that, this is a HUGE code smell. In technical terms, this will run in O(N) time. For any non-static application, this is not acceptable.

So, lateral joins. They exist to simulate the for impression in impressions: ... block of code. You can execute this logic without needing to leave the database server, and with all the performance that Postgres provides.

Here's how you could write the item in pure SQL:

    latest.user_id as user_id,
    latest.created as created,
    previous.created AS previous_impression_at

-- We alias the outer "impressions" table as "latest"
FROM impressions latest LEFT JOIN LATERAL (
    SELECT created
    FROM impressions previous
    -- This WHERE clause is the key. This will be called for each
    -- row returned from the outer query, and then made available
    -- here as latest.*
    WHERE previous.user_id = latest.user_id AND previous.created < latest.created
    ORDER BY created DESC
    LIMIT 1
) previous ON TRUE;

This will spit back all rows with headings:

user_id, created,   previous_impression_at

0      , 2019-04.., 2019-01..

This is hugely powerful, and something worth leveraging to make your workflow simpler, and to help push data manipulation logic down to the database.

Here's the official PG docs. Search for "LATERAL SUBQUERIES".

And another article with a little more meat if you're interested