Subselects can be used in many ways: filtering, retrieving specific data, and creating custom joins. Writing effective SQL using subselects, though, can be frustrating. While SQL is ubiquitous in software engineering and is a valuable skill for testing, data analysis and reporting, nowadays a lot of SQL is abstracted in ORM frameworks such as Hibernate and ibatis, and many projects have adopted No-SQL databases. This means some developers have little or no experience in SQL. While this is not a bad trend, in and of itself, there are still many solid business cases for storing data in a traditional, relational database. With that in mind, lets explore three ways to use this troublesome SQL Query.
The following is a simplified example from the music domain that models Albums, Tracks, Recordings, and People to illustrate the uses of the subselect. Note: the recording table is versioned.
1. WHERE subselect
The most common way subselects are used is to filter data in the WHERE clause. Using subselects in this way allows for a subset of data to be selected from a table in order to filter the values in another table.
The following is an example of getting all the recordings that appear on albums with titles that start with ‘Best of ABBA’.
SELECT * FROM subselect.recording r WHERE recording_id in (SELECT recording_id FROM subselect.track t, subselect.album a WHERE t.album_id=a.album_id AND a.title like 'Best of ABBA%') ORDER BY recording_id, version;
Note: the same results could be obtained by joining the track and album tables to the recording table.
The features of this query:
- Filters the recording data down to only those recordings from albums where the title starts with ‘Best of Abba’.
- This subselect is quick and returns a single column of data that will be used to filter the recording results.
- The subselect data does not show up in the final results
- The subselect must only return one column of data
- Good readability
2. JOIN subselect
JOINs offer another interesting way to filter data via subselect. In this case the subselect is joined to the query. This has the effect of filtering data while providing the query with values returned in the select statement of the inner query. Using subselects in this way allows the query to take multiple rows of similar data and transforming it into a columns.
The following is an example of getting all the recordings that appear on albums with titles that start with ‘Best of ABBA’, but ensuring that the latest version of the recording is returned.
SELECT r.recording_id, foo.current_version, r.title, r.artist, r.duration, t.position, a.title, a.artist FROM subselect.recording r JOIN subselect.track t ON t.recording_id=r.recording_id JOIN subselect.album a ON t.album_id=a.album_id JOIN (SELECT recording_id, MAX(version) AS current_version FROM subselect.recording r GROUP BY recording_id ORDER BY recording_id) foo ON foo.current_version = r.version AND foo.recording_id=r.recording_id WHERE a.title like 'Best of ABBA%' ORDER BY a.album_id;
The features of this query:
- Allows for multiple selected columns to be exposed to the main query
- Provides easy filtering for versioned data
- Filters data like a where clause
- Specific use causes
3. SELECT subselect
The SELECT statement is another place we can use a subselect. Unlike the first two subselect examples, the SELECT clause’s subselects do not filter, but they do allow the results to be enhanced. The main thing to watch out for when using subselects is that data returned from the subselect must be returned as one column and one row. This can be done using an aggregation function.
The following combines the join clause subselect and the select clause subselect to return a report of all the current recordings that appear on albums with titles that start with ‘Best of ABBA’, while returning a pipe delimited list of persons attributed to recording.
SELECT r.title, r.artist, r.duration, r.recording_id, a.title, t.position, (SELECT string_agg(name, ' | ') FROM subselect.person p, subselect.recording_person rp WHERE rp.person_id=p.person_id AND rp.recording_id=r.recording_id) FROM subselect.recording r JOIN subselect.track t ON t.recording_id=r.recording_id JOIN subselect.album a ON t.album_id=a.album_id JOIN (SELECT recording_id, MAX(version) AS current_version FROM subselect.recording r GROUP BY recording_id ORDER BY recording_id) foo ON foo.current_version = r.version AND foo.recording_id=r.recording_id WHERE a.title like 'Best of ABBA%' ORDER BY a.album_id;
In the above case, we are creating a result set that look more like a data report. The SELECT clause subselect is highly inefficient with respect to the JOIN subselect, but it provides us with rolled, delimited data that maybe more efficient than using an application to call back into the database for each row.
Note: it’s important to pick a delimiter that should not appear in the data.
The features of this query:
- Allows for aggregated data in the select statement that is not easily joined.
- Very inefficient. A subselect is called for each row returned from the outer query
- All results must be rolled up into one row and one column.
- Has very specific, but limited use case
Subselects are a very powerful tool to keep in your engineering toolbox, but knowing their uses and fallbacks are the difference between a fast useful query and a query that will leave you waiting and frustrated.
All examples where written in Postgres and may use functions that are native to Postgres.
Source can be found at https://github.com/djchi82/TamingOfTheSubselect