Blog

Managing Complex Data with JSON in PostgreSQL, Part 1

Are you looking at your stale, flat relational database and thinking: “It’s so rigid and lifeless!” Do you need polymorphism? Do you have a NoSQL database that has no obvious shape and is causing you lack-of-relationship nightmares? With JSON in PostgreSQL, you can have your cake and relationships too!

Crystal ball on a fence post in which we see a tree

What Problem Are We Trying to Solve?

I’d like to explain the decades-old issue we’re working on here.

Let’s take a look at a fairly common use case: a Content Management System (CMS) for an online publication. The publication has articles or stories, which contain references to a mix of Copy, Images, Videos, and Ads. People are also involved in the Story – some involved in the making of the Story (author, editor, photographer, etc), some are referenced in the Story. Add in Places, Things, Events, Products, and other references you don’t even know about yet, and it gets pretty complicated.

Each of these entities have different but overlapping content types, metadata types, and references. Essentially, each entity has its own schema. To make matters worse, you may have new entities being discovered and added over time – like shooting scripts, image rights, embedded ad copy, etc.

The final straw is that you have been tasked with the impossible job of adding a change-log to your database that tracks certain changes of certain entities, keeping both the old and new data, and consolidating it into a single, unified stream to inform downstream systems.

This post will explain how to build just such a beast and we’ll do it using PostgreSQL’s powerful mix of relational and JSON capabilities!

Technology Choices

When building a system handling complex data, technology decisions boil down to finding the right balance of flexibility, performance, and maintainability. For this article, we’ll focus on the database storing the data behind all of this.

Relational databases are the poster child for rigid data schemas. Costs in performance, storage usage, and maintainability rise exponentially for every degree-of-freedom added to the dataset.

Representing complex, extensible data in a relational database is a decades-old problem and many solutions have been used:

  • Adding tables – adding one or more new tables for each new entity. This gets cumbersome, and usually leads to frequent database schema upgrades.
  • Adding columns / sparsing – adding additional columns to tables for specific cases. This makes a table row the union of all possible attributes for different types of entities. The data is spread thin and has to be gathered up knowing the type of entity the row represents. This pushes responsibility for data integrity and other constraints up to the application. These guarantees must be provided by the code writing to the database, compensated for in anything reading from the database, or both.
  • Column overloading – providing columns where the value is interpreted differently based on context. For example, using a text or binary column and storing data that may be a number, boolean, text, or even a deep data structure such as JSON. This is a terrible idea, but people do it. This also requires deep knowledge of the dataset, and has no data shape or referential integrity guarantees.
  • External data or opaque data – where the data that’s flexible is housed in a completely different system, such as in a file or other database system, or perhaps as an opaque “blob” within the database. This can have performance impact, and also lacks data quality guarantees.
  • Many-to-many relationships – are another sticking point for relational databases. We’ll come back to this later.

NoSQL databases solve some of these problems – but not all:

  • NoSQL databases are often JSON-document based with one of a variety of query mechanisms. Most NoSQL databases are schema-less, and without concept of relationships. These often amount to a folder of partially indexed JSON documents with a REST-like interface.
  • While NoSQL databases have their uses, relationships and data integrity are left up to the consuming code, and often end up being combined with indexing and caching services, and the data shape integrity is left up to the code using the database.

Looking at PostgreSQL: Why JSON?

JSON (JavaScript Object Notation) has grown quite a distance from its roots in JavaScript, and is now supported as the de-facto machine-to-machine data exchange format for simple-yet-deep structured data. JSON data is a tree structure using key-value maps (a.k.a. “objects”) and arrays, with the leaf nodes being simple scalars in the form of numbers, strings, booleans, or nulls.

JSON is also a text-based format that needs to be parsed to be useful, and doesn’t have a built-in mechanism to adhere to a schema. However, JSON is easily parsed by most programming languages including JavaScript in a web browser, and is heavily used in REST and GraphQL APIs.

Most NoSQL databases use JSON for its flexibility and expressivity. However, a JSON document on its own is a flat, two-dimensional tree without any linkage to other documents. Without structure enforcement, managing large amounts of this data can quickly become a burden that’s left up to the code (or people) using the database.

Contrast that to very rigid, row-and-column relational database tables that can, through references, become highly dimensional data. Since the data in columns is traditionally purely scalar, it’s effectively a tree of all leaves. The rigidity of relational databases provides intelligent indexing, storage, constraints providing the structural integrity of data and references, uniqueness guarantees, and prevention of losing essential data.

Now we can combine the two to get the best parts of both!

PostgreSQL Support for JSON

PostgreSQL added JSON support in version 9.2 and has greatly expanded this capability since. AWS Aurora Serverless (PostgreSQL-flavor) uses PostgreSQL 10 protocol and language. Unfortunately that doesn’t include the newer SQL/JSON Path that arrived in PostgreSQL 12, so plan accordingly. Nothing that can be done with SQL/JSON Path can’t be done without it, but it is more convenient.

What’s JSONb?

For better efficiency in usage and space, instead of storing strings and parsing them at the time of usage in json columns, one should use jsonb (JSON-binary) columns. To quote the docs:

The JSON data type stores an exact copy of the input text, which processing functions must reparse on each execution; while JSONb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process since no reparsing is needed. JSONb also supports indexing, which can be a significant advantage.

JSONb, due to the additional indexing, also has the operators needed to accomplish the tasks described below, whereas json columns don’t. In every case where I say “JSON data,” I mean JSON data stored in a jsonb column type.

Many-to-Many Relationships

One of the most powerful features of using JSON in a database column is the ability to express many-to-many relationships all at once, along with any additional data about those relationships. Options for many-to-many relations include: in order of more-to-less traditional:

  • Junction tables – in classic relational DBs it’s common to use associative tables, or more formally “Associative Entities“. This is a table with columns of primary key references to join two or more other tables. Additional metadata about the relationship can also be held in additional columns.
    • These index well, support constraint checks, and are a well-known design pattern. But they can be a lot of work to maintain.
    • Since the data is stored across multiple tables, it takes multiple statements to freeze or thaw a dataset complete with relationships.
  • Array columns – allowing a single column to contain a variable-length list of same-typed values. These can mimic an associative table, but joining is asymmetric, as it’s easier to join from an array column than it is to join to one. Array columns are supported in PostgreSQL, but not in many other databases.
    • Arrays index using the same GIN index type as jsonb columns, and cannot contain additional metadata with each value, as you can with JSON. (Note: It is possible to make a jsonb[] column, but we don’t recommend it, as there’s no value over a jsonb column that contains an array.)
  • JSON – allows a complete tree of relationships to be expressed, along with relevant metadata about those relationships. This is the pattern we use in this post.
    • This is not as performant as using an Associative Table for joins mostly due to differences in indexing, but this allows a detailed relationship to be expressed with metadata in a single freeze or thaw statement.
    • Similar to arrays, joining performance is asymmetric.
  • Graph databases – represent relationships as specific “first class entities”. Until PostgreSQL grows graph capabilities we compromise by using JSON.

Code examples: How to Use JSON in PostgreSQL

How to Insert JSON Data into a Row

A typical problem when writing a fictional narrative such as a book or television script is managing the relationships of the characters. For our examples, we’ll keep track of the dark side of a beloved TV comedy by tracking rivalries in Seinfeld. We’ll do that by tracking the asymmetrical relationship of who is perceived to be an enemy by whom.

Let’s start with a table that has a jsonb type column and INSERT our first few entries using the JSON in string form:

Any context that is assumed to be of json or jsonb type will automatically parse and use the JSON value provided as a string.

It’s also possible to use JSONb inline, and extract parts out of it:

This may seem somewhat useless, but imagine this used from code where you can’t easily manipulate the data structure before you provide it to SQL, such as in an AppSync Aurora Serverless Resolver:

How to Insert Whole Rows with JSON

With the jsonb_populate_recordset() (docs) function you can load JSON that contains an array or objects directly into a table, even if that table doesn’t have any json or jsonb columns!

This is likely the easiest way to populate an entire table at once:

Note that the id values are provided so that we could refer to other people in the enemies list.

Similarly, we SET CONSTRAINTS ALL DEFERRED to avoid constraints errors within the statement, as each row is inserted. (We don’t have constraints yet, and will talk more about how to implement constraints in the next article, so this can be ignored for now.)

If you provide the JSON as a parameter, you can load a JSON file and import entire tables at once, for example with python:

How to Extract JSON Data

We already saw how to reach into objects and arrays with [object]->[key] or [array]->[int] (which returns json or jsonb). There’s also [object]->>[key] or [array]->>[int] which returns the same value as the -> form, except converted to a string. You can then further cast it to another type, such as integer, boolean, or uuid.

To expand a JSON array into rows (ideal for use in a sub-select) use jsonb_array_elements() (docs):

How to Form JSON Data from SQL

The simplest way to form JSON from a SELECT result is row_to_json() (docs), but it only takes a single row at a time:

You can add columns by adding to the row, which requires a sub-select:

A jsonb object can be made “from scratch” with jsonb_build_object() (docs) to form JSON objects with key/value pairs:

The keys and values can be expression values and nested:

Use the || (docs) operator to merge jsonb objects, much like with strings:

We can combine multiple rows with jsonb_agg() / jsonb_object_agg() (docs) to aggregate rows of jsonb values into jsonb arrays / objects (respectively).

Bringing It All Together: Making a Change Log Table

We’ll create a change_log table that has:

  • Two jsonb columns: old_value and new_value to hold the said values
  • table_name to track what table changed
  • operation to hold the type of change it was ('DELETE', 'INSERT', 'UPDATE', or 'TRUNCATE')
  • id for an auto-generated UUID
  • created_timestamp for being able to track when the change happened, and sort events

Now there’ll need to be a mechanism to actually populate this change_log, here’s a simple trigger function:

And we’ll need to set the trigger function to run on all of the tables we want to track:

Note that it’s an AFTER trigger since we don’t want to risk preventing or delaying the change, we only want to record that it happened. If we wanted to do implement deep constraints here, we would want it to be a BEFORE trigger.

Now if we make a change to an entry:

We can now see a new entry in the change_log with Kramer getting credit for The Beach in the new_value but not in the old_value.

old_value new_value

Up next…

In Part 2 we go into how to enforce referential integrity, provide clean errors for data format violations, and take steps to get all of the performance of a relational database while using JSON to express the relationships.

Categories: Blog

11 May, 2021