Managing Complex Data with JSON in PostgreSQL, part 2

In Part 1 we explained why you absolutely need JSONb columns in your PostgreSQL tables, and how they simplify the expression of complex data, including complex references and relationships.

In this article, we’ll dive deeper into ways to use tables with JSONb columns, discussing constraints and triggers to guarantee data integrity, and show how to get full JSON expressivity while keeping most of the performance.

A many-branched tree in the woods

Views and JSON

Views are essentially virtual tables that are populated dynamically from an expression given at the time of view creation. They are typically read-only and are roughly as performant as the same expression raw. There are minor advantages in compiled expression optimization, but mostly the value of views lies in simplifying the syntax of a hairy expression, particularly when it’s used in a join.

Views work with JSON columns, just like any other columns, and can be used to simplify queries involving JSON. This means that you could store your data in JSON format and make one or more views that break the JSON data out into columnar format. Alternatively, you could store columnar data (possibly across many tables) and create views that express the data in JSON format.

Here we’ll make a view that shows all enemies list changes in the characteristics column of the people table, according to the change_log table (note that the referenced tables are defined in Part 1):

Another option is to make a view that expresses the person table with the enemies list as an Associative Table:

Now the enemy_association view can be used in joins without directly referring to the JSON content.

Although using this view is syntactically simpler than directly joining into the JSON, it will not be as performant or robust as an actual table that holds this information. Views are not indexed, and JSON column indexes are not quite as good as indexed scalar-valued columns. Further, with scalar-valued columns, foreign-key constraints can be used to ensure data integrity across data changes.

Note that Postgres does have the ability to update and insert through views, but only in very strict circumstances that generally rule out manipulation of data inside JSON columns. However, we can provide that functionality with INSTEAD OF triggers.

Trigger Functions to Get Performance and Expressivity

Careful application of indexing will improve the performance of tables that contain JSON columns (and the views that refer to them). That may be good enough for your application, depending on your usage patterns.

If you need better performance without changing the tables there are two main options:

  • Materialized Views – while still views populated by a given expression, these are “full tables” that use storage space and can be indexed. However, you can’t manipulate them directly. Instead, they are updated with the REFRESH MATERIALIZED VIEW command, which can be quite expensive, time- and resource-wise. If the view is only used for occasional reporting, or the source tables rarely change, then this may be the ideal option, as there’s no overhead per insert or update to the backing table.
  • Trigger-maintained tables – where triggers are used on tables or views to supplement or convert inserts and updates to changes to one or more other tables. Let’s explore this more.

Suppose we determined that our enemy_association view was not fast enough for our project, but we like or don’t want to change how we’re manipulating the data in the people table with JSON. To provide this mix of performance and flexibility we will instead create an enemy_association table and put a trigger function on the people table to maintain the enemy_association table. To do this, we will:

  1. Create the table.
  2. Make a trigger function that handles DELETE, UPDATE, and INSERT (see the docs for more information)
  3. Apply the trigger to execute BEFORE changes to the people table:

Note the use of the ? operator (as in NEW.characteristics ? 'enemies') to see if a JSON object has the enemies key in it. For completeness, let’s test it:

Did you spot the mistake? Here’s the error: cannot extract elements from an object.

That’s not a very clear error, but the trigger function did prevent us from making a data-structure mistake: enemies is supposed to be an array! And the error happened before the data was placed into either table since we created a BEFORE trigger, so it prevented the data from being committed to either table.

Trying again, with a fixed query:

That worked. Both people and enemy_association have been updated as expected. Here’s the enemy_association entry:

“ac2b65eb-713f-4f1b-bef6-c48ea8c9f72d”“9eda3717-ee4a-4ac7-ba2e-d3b43b7e79c4”“Occupational Rivalry”

Let’s test an update. First, empty the people table then run the insert from Part 1 (here) on the empty people table, then run this query to update George’s enemy list to include Kramer as a “Frenemy”:


Trigger Functions for Data-Shape Checking

The trigger function provides some minimal data structure safety mostly on accident due to how it processes the data structure. However, we can take that further and use the trigger function to explicitly ensure data structure consistency. To start off with, we can add a quick type check on NEW.characteristics->'enemies' before both the UPDATE and INSERT:

And now an entry will fail with a nice clear error message if enemies isn’t an array, unlike that cryptic error we got before. Note that it does not throw an exception if enemies is missing, as we’re considering that acceptable.

Foreign-Key Constraints

We should ensure that the referenced ids exist in the table. We can do that in at least two ways:

  1. In the trigger function as “virtual constraints” since we’re already creating the trigger function, this is relatively easy.
  2. With standard foreign key constraints. This is also an option since we are creating an output table that has a UUID column that directly references a column in another table.

Since standard foreign-key constraints aren’t always an option, I’ll demonstrate how a quick check would be if we were to put it in the function:

Note that the virtual constraint is not deferrable like proper foreign-key constraints are. We could use parameters and such to provide similar functionality, but that’s beyond the scope of this post.

Next we’ll demonstrate a real and deferrable foreign key-constraint:

Since we added DEFERRABLE we can now use SET CONSTRAINTS ALL DEFERRED (docs) in a transaction to allow insertion of data without checking the constraint until commit time.

In-JSON Referential Integrity

With the foreign key constraint set to ON DELETE CASCADE we can end up with invalid references from the JSON. The foreign-key constraint is on enemy_association and not on people. This means that if you DELETE from people then any references to that person will be removed from enemy_association but the characteristics->'enemies' entry that made those enemy_association rows, if not updated, will now be invalid.

Since we are already triggering on the DELETE of a people row, we can take that time to adjust any referencing characteristics->'enemies' while we are there.

We will need to update the trigger, but first we need to know how to remove an entry from the JSON array nested in the JSON structure. Here’s a reasonable first test using a SELECT to return the updated structure we want. Here we’re using few example ids hard-coded:

Sadly, that results in an error: column "people.characteristics" must appear in the GROUP BY clause or be used in an aggregate function

That error is coming from the characteristics || jsonb_build_object('enemies', jsonb_agg(enemy)) part of the query. Since the jsonb_agg() aggregates multiple rows (in the same way max() and count() does), for any column used that’s not participating directly in that aggregation, there must be a GROUP BY clause to determine which row to use to get the value of those columns from.

But we don’t have a GROUP BY clause and we don’t really need one if we use the implicit grouping of the parent row being made into multiple rows with the jsonb_array_elements() function.

This is solved by isolating the jsonb_agg()/jsonb_array_elements() into its own sub-select. (There are other ways, of course, but this one is fairly clean and easy to remember.) Try two:

However, that also results in an error: invalid reference to FROM-clause entry for table "people". That means, specifically, that we cannot refer to people.characteristics from inside the sub-select when people was defined outside of the sub-select. This is what the LATERAL keyword is for. From the documentation:

Subqueries appearing in FROM can be preceded by the key word LATERAL. This allows them to reference columns provided by preceding FROM items. (Without LATERAL, each subquery is evaluated independently and so cannot cross-reference any other FROM item.)

Try three, with LATERAL:

It works! Now to make that an UPDATE statement and put it in the trigger function, except (and this one is important) we cannot change a column of a row that is also deleted in the same command.

Said another way: If we DELETE with a WHERE that would delete two or more people that cross-reference (are enemies of each other) it will throw an error when it tries to delete the updated row. This is particularly tricky since the trigger is called per-row, with no visibility into what other rows would be deleted.

The solution, thankfully, is fairly simple: Move the UPDATE statement to an DELETEAFTER trigger instead of a DELETEBEFORE trigger. This allows the deletions of full rows to complete, then the UPDATE can update the remaining rows.

Bringing it all together

Here is the final trigger function:

And the trigger setup:

Note that DELETE is now called twice for each row, we also added some checks for TG_WHEN = 'BEFORE' to ensure that we don’t try to execute the UPDATE and INSERT on AFTER when we don’t want to.

How to use the enemy_association table

Since we have triggers on people and constraints on enemy_association, it’s important to treat enemy_association as read-only and handle all writing through people.

However, you can safely query and join against enemy_association and get the full performance of Postgres indexing using traditional join syntax. We didn’t show adding indexes, but once a representative data set is loaded, we should test and add indexes as needed. Keep in mind that primary-keys are implicitly indexed but foreign-key constraints do not implicitly create indexes.

Performance-wise, the triggers will cause inserts into people to be slightly slower since it’s actually inserting into both tables as well as providing some additional integrity checks. There is also some data duplication so this method will use more on-disk storage space. As usual, performance and data integrity have some trade-offs. However, there are still a few more options to explore.

Up next…

In Part 3 we explain how to make a read/write view with JSON data that is transparently backed by fully-indexable tables. That allows reading and writing in both traditional relational-database SQL as well as using rich JSON syntax, without on-disk data duplication and with maximum performance.

Categories: Blog

18 May, 2021