In Part 1 we explained why you absolutely need JSONb columns in your PostgreSQL tables, and how they simplify the expression of complex data, including complex references and relationships.
In this article, we’ll dive deeper into ways to use tables with JSONb columns, discussing constraints and triggers to guarantee data integrity, and show how to get full JSON expressivity while keeping most of the performance.
Views and JSON
Views are essentially virtual tables that are populated dynamically from an expression given at the time of view creation. They are typically read-only and are roughly as performant as the same expression raw. There are minor advantages in compiled expression optimization, but mostly the value of views lies in simplifying the syntax of a hairy expression, particularly when it’s used in a join.
Views work with JSON columns, just like any other columns, and can be used to simplify queries involving JSON. This means that you could store your data in JSON format and make one or more views that break the JSON data out into columnar format. Alternatively, you could store columnar data (possibly across many tables) and create views that express the data in JSON format.
Here we’ll make a view that shows all enemies
list changes in the characteristics
column of the people
table, according to the change_log
table (note that the referenced tables are defined in Part 1):
Another option is to make a view that expresses the person
table with the enemies list as an Associative Table:
Now the enemy_association
view can be used in joins without directly referring to the JSON content.
Although using this view is syntactically simpler than directly joining into the JSON, it will not be as performant or robust as an actual table that holds this information. Views are not indexed, and JSON column indexes are not quite as good as indexed scalar-valued columns. Further, with scalar-valued columns, foreign-key constraints can be used to ensure data integrity across data changes.
Note that Postgres does have the ability to update and insert through views, but only in very strict circumstances that generally rule out manipulation of data inside JSON columns. However, we can provide that functionality with INSTEAD OF
triggers.
Trigger Functions to Get Performance and Expressivity
Careful application of indexing will improve the performance of tables that contain JSON columns (and the views that refer to them). That may be good enough for your application, depending on your usage patterns.
If you need better performance without changing the tables there are two main options:
- Materialized Views – while still views populated by a given expression, these are “full tables” that use storage space and can be indexed. However, you can’t manipulate them directly. Instead, they are updated with the
REFRESH MATERIALIZED VIEW
command, which can be quite expensive, time- and resource-wise. If the view is only used for occasional reporting, or the source tables rarely change, then this may be the ideal option, as there’s no overhead per insert or update to the backing table. - Trigger-maintained tables – where triggers are used on tables or views to supplement or convert inserts and updates to changes to one or more other tables. Let’s explore this more.
Suppose we determined that our enemy_association
view was not fast enough for our project, but we like or don’t want to change how we’re manipulating the data in the people
table with JSON. To provide this mix of performance and flexibility we will instead create an enemy_association
table and put a trigger function on the people
table to maintain the enemy_association
table. To do this, we will:
- Create the table.
- Make a trigger function that handles
DELETE
,UPDATE
, andINSERT
(see the docs for more information) - Apply the trigger to execute
BEFORE
changes to thepeople
table:
Note the use of the ?
operator (as in NEW.characteristics ? 'enemies'
) to see if a JSON object has the enemies
key in it. For completeness, let’s test it:
Did you spot the mistake? Here’s the error: cannot extract elements from an object
.
That’s not a very clear error, but the trigger function did prevent us from making a data-structure mistake: enemies
is supposed to be an array! And the error happened before the data was placed into either table since we created a BEFORE
trigger, so it prevented the data from being committed to either table.
Trying again, with a fixed query:
That worked. Both people
and enemy_association
have been updated as expected. Here’s the enemy_association
entry:
person_id |
enemy_id |
flavor |
---|---|---|
“ac2b65eb-713f-4f1b-bef6-c48ea8c9f72d” | “9eda3717-ee4a-4ac7-ba2e-d3b43b7e79c4” | “Occupational Rivalry” |
Let’s test an update. First, empty the people table then run the insert from Part 1 (here) on the empty people
table, then run this query to update George’s enemy list to include Kramer as a “Frenemy”:
person_id |
enemy_id |
flavor |
---|---|---|
“9eda3717-ee4a-4ac7-ba2e-d3b43b7e79c4” | “a4adc3fd-485f-4fc7-a561-863ea78e9bb9” | “Teacher” |
“9eda3717-ee4a-4ac7-ba2e-d3b43b7e79c4” | “ac2b65eb-713f-4f1b-bef6-c48ea8c9f72d” | “Unexplained” |
“9eda3717-ee4a-4ac7-ba2e-d3b43b7e79c4” | “6ae1da27-0565-4649-96c5-e4c1da48de21” | “Frenemy” |
Trigger Functions for Data-Shape Checking
The trigger function provides some minimal data structure safety mostly on accident due to how it processes the data structure. However, we can take that further and use the trigger function to explicitly ensure data structure consistency. To start off with, we can add a quick type check on NEW.characteristics->'enemies'
before both the UPDATE
and INSERT
:
And now an entry will fail with a nice clear error message if enemies
isn’t an array, unlike that cryptic error we got before. Note that it does not throw an exception if enemies
is missing, as we’re considering that acceptable.
Foreign-Key Constraints
We should ensure that the referenced id
s exist in the table. We can do that in at least two ways:
- In the trigger function as “virtual constraints” since we’re already creating the trigger function, this is relatively easy.
- With standard foreign key constraints. This is also an option since we are creating an output table that has a UUID column that directly references a column in another table.
Since standard foreign-key constraints aren’t always an option, I’ll demonstrate how a quick check would be if we were to put it in the function:
Note that the virtual constraint is not deferrable like proper foreign-key constraints are. We could use parameters and such to provide similar functionality, but that’s beyond the scope of this post.
Next we’ll demonstrate a real and deferrable foreign key-constraint:
Since we added DEFERRABLE
we can now use SET CONSTRAINTS ALL DEFERRED
(docs) in a transaction to allow insertion of data without checking the constraint until commit time.
In-JSON Referential Integrity
With the foreign key constraint set to ON DELETE CASCADE
we can end up with invalid references from the JSON. The foreign-key constraint is on enemy_association
and not on people
. This means that if you DELETE
from people
then any references to that person will be removed from enemy_association
but the characteristics->'enemies'
entry that made those enemy_association
rows, if not updated, will now be invalid.
Since we are already triggering on the DELETE
of a people
row, we can take that time to adjust any referencing characteristics->'enemies'
while we are there.
We will need to update the trigger, but first we need to know how to remove an entry from the JSON array nested in the JSON structure. Here’s a reasonable first test using a SELECT
to return the updated structure we want. Here we’re using few example id
s hard-coded:
Sadly, that results in an error: column "people.characteristics" must appear in the GROUP BY clause or be used in an aggregate function
That error is coming from the characteristics || jsonb_build_object('enemies', jsonb_agg(enemy))
part of the query. Since the jsonb_agg()
aggregates multiple rows (in the same way max()
and count()
does), for any column used that’s not participating directly in that aggregation, there must be a GROUP BY
clause to determine which row to use to get the value of those columns from.
But we don’t have a GROUP BY
clause and we don’t really need one if we use the implicit grouping of the parent row being made into multiple rows with the jsonb_array_elements()
function.
This is solved by isolating the jsonb_agg()
/jsonb_array_elements()
into its own sub-select. (There are other ways, of course, but this one is fairly clean and easy to remember.) Try two:
However, that also results in an error: invalid reference to FROM-clause entry for table "people"
. That means, specifically, that we cannot refer to people.characteristics
from inside the sub-select when people
was defined outside of the sub-select. This is what the LATERAL
keyword is for. From the documentation:
Subqueries appearing in FROM can be preceded by the key word LATERAL. This allows them to reference columns provided by preceding FROM items. (Without LATERAL, each subquery is evaluated independently and so cannot cross-reference any other FROM item.)
Try three, with LATERAL
:
It works! Now to make that an UPDATE
statement and put it in the trigger function, except (and this one is important) we cannot change a column of a row that is also deleted in the same command.
Said another way: If we DELETE
with a WHERE
that would delete two or more people that cross-reference (are enemies of each other) it will throw an error when it tries to delete the updated row. This is particularly tricky since the trigger is called per-row, with no visibility into what other rows would be deleted.
The solution, thankfully, is fairly simple: Move the UPDATE
statement to an DELETE
→AFTER
trigger instead of a DELETE
→BEFORE
trigger. This allows the deletions of full rows to complete, then the UPDATE
can update the remaining rows.
Bringing it all together
Here is the final trigger function:
And the trigger setup:
Note that DELETE
is now called twice for each row, we also added some checks for TG_WHEN = 'BEFORE'
to ensure that we don’t try to execute the UPDATE
and INSERT
on AFTER
when we don’t want to.
How to use the enemy_association
table
Since we have triggers on people
and constraints on enemy_association
, it’s important to treat enemy_association
as read-only and handle all writing through people
.
However, you can safely query and join against enemy_association
and get the full performance of Postgres indexing using traditional join syntax. We didn’t show adding indexes, but once a representative data set is loaded, we should test and add indexes as needed. Keep in mind that primary-keys are implicitly indexed but foreign-key constraints do not implicitly create indexes.
Performance-wise, the triggers will cause inserts into people
to be slightly slower since it’s actually inserting into both tables as well as providing some additional integrity checks. There is also some data duplication so this method will use more on-disk storage space. As usual, performance and data integrity have some trade-offs. However, there are still a few more options to explore.
Up next…
In Part 3 we explain how to make a read/write view with JSON data that is transparently backed by fully-indexable tables. That allows reading and writing in both traditional relational-database SQL as well as using rich JSON syntax, without on-disk data duplication and with maximum performance.