In Part 1 we explained why you absolutely need JSONb columns in your PostgreSQL tables, and how they simplify the expression of complex data, including complex references and relationships.
In this article, we’ll dive deeper into ways to use tables with JSONb columns, discussing constraints and triggers to guarantee data integrity, and show how to get full JSON expressivity while keeping most of the performance.
Views and JSON
Views are essentially virtual tables that are populated dynamically from an expression given at the time of view creation. They are typically read-only and are roughly as performant as the same expression raw. There are minor advantages in compiled expression optimization, but mostly the value of views lies in simplifying the syntax of a hairy expression, particularly when it’s used in a join.
Views work with JSON columns, just like any other columns, and can be used to simplify queries involving JSON. This means that you could store your data in JSON format and make one or more views that break the JSON data out into columnar format. Alternatively, you could store columnar data (possibly across many tables) and create views that express the data in JSON format.
Here we’ll make a view that shows all
enemies list changes in the
characteristics column of the
people table, according to the
change_log table (note that the referenced tables are defined in Part 1):
Another option is to make a view that expresses the
person table with the enemies list as an Associative Table:
enemy_association view can be used in joins without directly referring to the JSON content.
Although using this view is syntactically simpler than directly joining into the JSON, it will not be as performant or robust as an actual table that holds this information. Views are not indexed, and JSON column indexes are not quite as good as indexed scalar-valued columns. Further, with scalar-valued columns, foreign-key constraints can be used to ensure data integrity across data changes.
Note that Postgres does have the ability to update and insert through views, but only in very strict circumstances that generally rule out manipulation of data inside JSON columns. However, we can provide that functionality with
INSTEAD OF triggers.
Trigger Functions to Get Performance and Expressivity
Careful application of indexing will improve the performance of tables that contain JSON columns (and the views that refer to them). That may be good enough for your application, depending on your usage patterns.
If you need better performance without changing the tables there are two main options:
- Materialized Views – while still views populated by a given expression, these are “full tables” that use storage space and can be indexed. However, you can’t manipulate them directly. Instead, they are updated with the
REFRESH MATERIALIZED VIEWcommand, which can be quite expensive, time- and resource-wise. If the view is only used for occasional reporting, or the source tables rarely change, then this may be the ideal option, as there’s no overhead per insert or update to the backing table.
- Trigger-maintained tables – where triggers are used on tables or views to supplement or convert inserts and updates to changes to one or more other tables. Let’s explore this more.
Suppose we determined that our
enemy_association view was not fast enough for our project, but we like or don’t want to change how we’re manipulating the data in the
people table with JSON. To provide this mix of performance and flexibility we will instead create an
enemy_association table and put a trigger function on the
people table to maintain the
enemy_association table. To do this, we will:
- Create the table.
- Make a trigger function that handles
INSERT(see the docs for more information)
- Apply the trigger to execute
BEFOREchanges to the
Note the use of the
? operator (as in
NEW.characteristics ? 'enemies') to see if a JSON object has the
enemies key in it. For completeness, let’s test it:
Did you spot the mistake? Here’s the error:
cannot extract elements from an object.
That’s not a very clear error, but the trigger function did prevent us from making a data-structure mistake:
enemies is supposed to be an array! And the error happened before the data was placed into either table since we created a
BEFORE trigger, so it prevented the data from being committed to either table.
Trying again, with a fixed query:
That worked. Both
enemy_association have been updated as expected. Here’s the
Let’s test an update. First, empty the people table then run the insert from Part 1 (here) on the empty
people table, then run this query to update George’s enemy list to include Kramer as a “Frenemy”:
Trigger Functions for Data-Shape Checking
The trigger function provides some minimal data structure safety mostly on accident due to how it processes the data structure. However, we can take that further and use the trigger function to explicitly ensure data structure consistency. To start off with, we can add a quick type check on
NEW.characteristics->'enemies' before both the
And now an entry will fail with a nice clear error message if
enemies isn’t an array, unlike that cryptic error we got before. Note that it does not throw an exception if
enemies is missing, as we’re considering that acceptable.
We should ensure that the referenced
ids exist in the table. We can do that in at least two ways:
- In the trigger function as “virtual constraints” since we’re already creating the trigger function, this is relatively easy.
- With standard foreign key constraints. This is also an option since we are creating an output table that has a UUID column that directly references a column in another table.
Since standard foreign-key constraints aren’t always an option, I’ll demonstrate how a quick check would be if we were to put it in the function:
Note that the virtual constraint is not deferrable like proper foreign-key constraints are. We could use parameters and such to provide similar functionality, but that’s beyond the scope of this post.
Next we’ll demonstrate a real and deferrable foreign key-constraint:
Since we added
DEFERRABLE we can now use
SET CONSTRAINTS ALL DEFERRED (docs) in a transaction to allow insertion of data without checking the constraint until commit time.
In-JSON Referential Integrity
With the foreign key constraint set to
ON DELETE CASCADE we can end up with invalid references from the JSON. The foreign-key constraint is on
enemy_association and not on
people. This means that if you
people then any references to that person will be removed from
enemy_association but the
characteristics->'enemies' entry that made those
enemy_association rows, if not updated, will now be invalid.
Since we are already triggering on the
DELETE of a
people row, we can take that time to adjust any referencing
characteristics->'enemies' while we are there.
We will need to update the trigger, but first we need to know how to remove an entry from the JSON array nested in the JSON structure. Here’s a reasonable first test using a
SELECT to return the updated structure we want. Here we’re using few example
Sadly, that results in an error:
column "people.characteristics" must appear in the GROUP BY clause or be used in an aggregate function
That error is coming from the
characteristics || jsonb_build_object('enemies', jsonb_agg(enemy)) part of the query. Since the
jsonb_agg() aggregates multiple rows (in the same way
count() does), for any column used that’s not participating directly in that aggregation, there must be a
GROUP BY clause to determine which row to use to get the value of those columns from.
But we don’t have a
GROUP BY clause and we don’t really need one if we use the implicit grouping of the parent row being made into multiple rows with the
This is solved by isolating the
jsonb_array_elements() into its own sub-select. (There are other ways, of course, but this one is fairly clean and easy to remember.) Try two:
However, that also results in an error:
invalid reference to FROM-clause entry for table "people". That means, specifically, that we cannot refer to
people.characteristics from inside the sub-select when
people was defined outside of the sub-select. This is what the
LATERAL keyword is for. From the documentation:
Subqueries appearing in FROM can be preceded by the key word LATERAL. This allows them to reference columns provided by preceding FROM items. (Without LATERAL, each subquery is evaluated independently and so cannot cross-reference any other FROM item.)
Try three, with
It works! Now to make that an
UPDATE statement and put it in the trigger function, except (and this one is important) we cannot change a column of a row that is also deleted in the same command.
Said another way: If we
DELETE with a
WHERE that would delete two or more people that cross-reference (are enemies of each other) it will throw an error when it tries to delete the updated row. This is particularly tricky since the trigger is called per-row, with no visibility into what other rows would be deleted.
The solution, thankfully, is fairly simple: Move the
UPDATE statement to an
AFTER trigger instead of a
BEFORE trigger. This allows the deletions of full rows to complete, then the
UPDATE can update the remaining rows.
Bringing it all together
Here is the final trigger function:
And the trigger setup:
DELETE is now called twice for each row, we also added some checks for
TG_WHEN = 'BEFORE' to ensure that we don’t try to execute the
AFTER when we don’t want to.
How to use the
Since we have triggers on
people and constraints on
enemy_association, it’s important to treat
enemy_association as read-only and handle all writing through
However, you can safely query and join against
enemy_association and get the full performance of Postgres indexing using traditional join syntax. We didn’t show adding indexes, but once a representative data set is loaded, we should test and add indexes as needed. Keep in mind that primary-keys are implicitly indexed but foreign-key constraints do not implicitly create indexes.
Performance-wise, the triggers will cause inserts into
people to be slightly slower since it’s actually inserting into both tables as well as providing some additional integrity checks. There is also some data duplication so this method will use more on-disk storage space. As usual, performance and data integrity have some trade-offs. However, there are still a few more options to explore.
In Part 3 we explain how to make a read/write view with JSON data that is transparently backed by fully-indexable tables. That allows reading and writing in both traditional relational-database SQL as well as using rich JSON syntax, without on-disk data duplication and with maximum performance.