SQL

Write an awesome doc for SQL. A very nice and practical one, extracted from SQL official documentation.

View on GitHub

Pagination

[!TIP]

To have a more readable query you can also try to use CTEs. But here we just used subqueries whenever possible.

OFFSET implementation

Breaking the problem into two half

  1. How can I do it in two separate queries (super simple).
  2. How can I combine them into one query, I do not like the sound of a two I/O whereas I can do it in one.

Separate queries

Smashing and combining all of these

SELECT *, (total / 10::double precision)::int AS "totalPage"
FROM (
  SELECT
    (SELECT COUNT(id)
     FROM public.news_articles
    ) AS "total",
    (SELECT JSON_AGG(TO_JSONB(filtered_news_articles))
     FROM (SELECT *
           FROM public.news_articles
           WHERE title LIKE '%something%'
           ORDER BY created_at ASC
           OFFSET 0
           LIMIT 10
          ) as "filtered_news_articles"
    ) AS "data"
);

This query will return something like this if you wanted to see it in plain JSON:

{
  "total": 50,
  "data": [
    {
      "id": "9b050c4f-e0dc-4c19-9e02-844957a67522",
      "title": "A title with something inside it!"
      // ...
    },
    {
      "id": "b5c5c3c9-75c9-4495-908f-47e42abc92a9",
      "title": "Is something ready?"
      // ...
    },
    // ...
  ],
  "totalPage": 5
}

[!IMPORTANT]

  • In a real world app we usually tend to use dynamic values for limit & offset. That’s why I used a cast operator to convert limit into double precision. Otherwise it would performed an integer operation and that could lead to not seeing last page’s data.
  • Here we are converting the totalPage back to int again after it is calculated.
  • Calculating everything in SQL can become cumbersome if you over do it. Just look at how much harder it is to read it just because we wanted to have the totalPage calculate inside SQL. But instead we could do it in our codebase.

Why you should not probably use OFFSET for pagination

  1. The database fetches rows from the disk. Then it drop part of it and return the rest.
  2. OFFSET takes only one single parameter, thus big OFFSETs impose a lot of work on the database.
  3. If a new rows were to be inserted in between two separate request you might get duplicate data.

    OFFSET showing duplicate info if user inserts a record

    [!CAUTION]

    The part that is also contributing in this behavior is where we are sorting based on age since kasir in this case might end up higher in the fetched records, thus leading to this buggy behavior.

    But here we are sorting news based on created_at. So this situation won’t happen.

The root cause of this issue lies in the fact that we are only telling database how many records should be dropped. No more. And, well this is not very much context.

Thus after seeing how we can implement it with OFFSET we’re gonna discuss a better option.

Benchmark

OFFSET 0

Not indexed order by with 0 as offset

Here we’re:

  1. Bitmap index/heap scan: database engine scans the rows using a bitmap index scan based on the WHERE clause (what we have now is an unsorted result).
  2. Sorting results by:

    • published_at in descending order (those published will be returned first)
    • Then by id in descending order if there are ties.

    “top-N heapsort” is used since we only want a couple of sorted rows (learn more about it here).

    [!NOTE]

    BTW when I tried this query in my local system it was using quicksort algorithm and not top-N, :confused:.

    Learn more about this topic here.

  3. We are returning only the amount specified in the LIMIT clause.

OFFSET 10

Not indexed order by with 10 as offset

OFFSET 40

Not indexed order by with 40 as offset

linear relationship between dataset and response time

As you can see the bigger the dataset we have after WHERE clause the more time it takes.

Conclusion for these different OFFSETs

Pay attention to the shaded area between the two lines to see the differences between the two series.

Fetching last page take a lot more time

divider

WHERE Implementation – Cursor-Based Pagination – Seek Method

Jargons:

First variation

In this variation we do not construct the next page in our query/backend. So client has to make a request to the backend, and if we returned an empty array then our frontend realizes that that we were at the last page. This might be good in one scenario, when we are sure until user reaches the end of the current page we’re gonna have a new page for sure.

Here we are first starting without indexing published_at field.

SELECT * FROM public.news_articles
WHERE
  topic = 'huawei'
	AND (
    published_at,
    id
  ) < (
    previous_published_at,
    previous_id
  )
ORDER BY published_at DESC, id DESC
LIMIT 10;

[!CAUTION]

Here we’ve assumed that:

  • id is auto increment (an integer). If you’re like me (use UUID everywhere as your ID) it is not gonna pose any form of issue AMAIK.
  • A definite sort order is mandatory.
  • The second part of the WHERE clause is a tuple comparison and it MUST be implemented that way. If you declare them as two filters, implemented independently then it would result in most rows failing to show up at all. You need to implement the id filter only within ties of the created_at filter.

Seek method implementation visualization

Since we are no longer fetching lots of data just to sort (ORDER BY) and then skip (OFFSET and LIMIT) we have:

Line chart on how performant is seek method

[!TIP]

If we index the published_at too then we are gonna have a much lower response time.

Second variation

Here we compute and render the next page’s query before hand and send it back to our client. Something like this:

SELECT
  (SELECT count.count
   FROM (SELECT COUNT(id), created_at
	       FROM public.messages
	       WHERE sender_id = '' AND receiver_id = ''
	       GROUP BY created_at, id
	       ORDER BY created_at ASC LIMIT 10
        ) as count
  ),
  (SELECT JSON_AGG(TO_JSONB(messages))
   FROM (SELECT *
      	 FROM public.messages
      	 WHERE sender_id = '' AND receiver_id = ''
         ORDER BY created_at ASC
         LIMIT 10
        ) as messages
  ) as data
;

![CAUTION]

This query might return null as value for both columns. Thus you need to handle it gently. IDK, define default values or whatever suits your situation best.

divider

Seek method VS OFFSET

OFFSET vs seek method

There are a couple of limitations that seek method has:

References

Learn more