In the last installment of the Extract Transform and Load series, we discussed the types of Business Rules. In this post, we will discuss the final two steps that have to do with data storage: Data Identification and Persistence.
The identification step has to do with matching incoming data with data already persistent in the data store. While this step can be done before the business rules or after, it is best done after the canonical is created. There are instances when this step isn’t needed at all such as
- Data scraped from websites
- Data that is unique to a transaction (transient data)
- Data snapshots or time synced data
Three typical cases for identification are Single, Submitter, and Transaction source.
In this source type, if there are multiple submitters, as in the case above, data ownership needs to be taken into consideration. Rough Trade owns the recording, but Sony submitted the recording first. Some questions to think about are:
- What rules dictate that Rough Trade can update the recording?
- How does the system know who actually owns the recording?
- What happens to Rough Trade’s data if they aren’t allowed to update the row?
Pros for Single Source
- Easily conceptually to understand
- Models the real world
- Reduces the number of database rows
- A good model for dissemination
Cons for Single Source
- Complicated update logic
- Easy to duplicate data if identification fields aren’t well thought out
Submitter source refers to a single entity for the system per submitter. This is where the same logical entity is duplicated in the database per submitter. In other words, entity uniqueness is determined by the entity’s identifier fields and who submitted it. In The Smiths example from above, each submitter would get its own unique row, resulting in two rows.
In this case, no consideration is needed to determine which
- No ownership rules
- Complicate identification fields
- More difficult dissemination model
- More database rows
Transaction Source refers to a single row per submission. This approach is best when the system deals with request-based data. Imagine a system that coordinates communication between multiple parties on a particular data set.
Using the examples above with both Sony and Rough Trade, imagine Sony submits a second transaction containing the same recording, “This Charming Man”. This would result in 3 database rows, 1 for Rough Trade and 2 for Sony. Without getting too much into the details, the uniqueness of the data will typically be based around a submission’s identifier.
- No ownership rules
- No identification needed
- Great for systems that broker requests between multiple parties
- Duplicate data concerns are irrelevant
- Complicated dissemination model
- Most database rows
Ultimately the system’s requirements will dictate which source type to use, but the pros and cons will help you think about the gotchas during development.
How to identify data
In well-established domains, industry standard identifiers are assigned to items such as ISBN (Books), ISRC (Musical Recordings), and UPC (Products). Often times these identifiers are good enough, but d
Some domains can use a combination of entity fields to identify data. For example, a recording could be identified by title and artist. The trick with this is submitters can send different spellings of a title or artist. For Example “Telephone” by Lady Gaga. One submitter may say the artist is “Lady Gaga feat. Beyonce” another may send it as “Lady Gaga ft. Beyonce”, another version might be “Lady Gaga, Beyonce” or “Lady Gaga featuring Beyonce Knowles”. To further confuse the situations the title might be “Telephone feat Beyonce” by Lady Gaga.
Depending on what level the system needs to track the logic entities, data may have versions, think edits, remastering, remixes, release dates.
There is no definitive way to solve these issues, which is what makes ETL systems tricky to implements. Ultimately the answers will be determined by the domain and the requirements of the system. This is why it is important to have a domain expert available during requirement gathering and design.
In this step, the data is saved into the data store. Now that the data is in a canonical form (see part 3), simple routines can be written to transform the canonical into object-relational models or persistent directly.
The thing to watch out for is handling creates, updates, and deletes as directed from the identification layer and business rules.
The other thing to consider is if data needs to be syndicated to multiple data stores. Say you have a relational database and a NoSQL data or search index. Design patterns and solutions around persistence are well known and will not be looked at as part of this series.
The final two steps, identification,