Blog

Data Identification and Persistence – Extract Transform and Load Part 5

In the last installment of the Extract Transform and Load series, we discussed the types of Business Rules. In this post, we will discuss the final two steps that have to do with data storage: Data Identification and Persistence.

Identification

Extract Transform and Load - identification barcode

The identification step has to do with matching incoming data with data already persistent in the data store. While this step can be done before the business rules or after, it is best done after the canonical is created. There are instances when this step isn’t needed at all such as

  • Data scraped from websites
  • Data that is unique to a transaction (transient data)
  • Data snapshots or time synced data

Source Types

Three typical cases for identification are Single, Submitter, and Transaction source.

Single Source

Single source refers to a single entity for the entire system. This is typically the case when the data models the world or the system is keeping a “gold” copy of each entity. Take the music industry for example: If the record label Sony Music submits “This Charming Man” by The Smiths and then Rough Trade also submits “This Charming Man” by The Smiths, the database should only save one record of the “This Charming Man”. This means that Sony’s copy is persisted first. Then Rough Trade’s submission should be identified and a new row should not be created.

In this source type, if there are multiple submitters, as in the case above, data ownership needs to be taken into consideration. Rough Trade owns the recording, but Sony submitted the recording first. Some questions to think about are:

  • What rules dictate that Rough Trade can update the recording?
  • How does the system know who actually owns the recording?
  • What happens to Rough Trade’s data if they aren’t allowed to update the row?

Pros for Single Source

  • Easily conceptually to understand
  • Models the real world
  • Reduces the number of database rows
  • A good model for dissemination

Cons for Single Source

  • Complicated update logic
  • Easy to duplicate data if identification fields aren’t well thought out

Submitter Source

Submitter source refers to a single entity for the system per submitter. This is where the same logical entity is duplicated in the database per submitter. In other words, entity uniqueness is determined by the entity’s identifier fields and who submitted it. In The Smiths example from above, each submitter would get its own unique row, resulting in two rows.

In this case, no consideration is needed to determine which submitter owns the data. However, identification and update logic need to be determined on a per submitter basis.

Pros

  • No ownership rules
  • Complicate identification fields

Cons

  • More difficult dissemination model
  • More database rows

Transaction Source

Transaction Source refers to a single row per submission. This approach is best when the system deals with request-based data. Imagine a system that coordinates communication between multiple parties on a particular data set.

Using the examples above with both Sony and Rough Trade, imagine Sony submits a second transaction containing the same recording, “This Charming Man”. This would result in 3 database rows, 1 for Rough Trade and 2 for Sony. Without getting too much into the details, the uniqueness of the data will typically be based around a submission’s identifier.

Pros

  • No ownership rules
  • No identification needed
  • Great for systems that broker requests between multiple parties
  • Duplicate data concerns are irrelevant

Cons

  • Complicated dissemination model
  • Most database rows

Ultimately the system’s requirements will dictate which source type to use, but the pros and cons will help you think about the gotchas during development.

How to identify data

In well-established domains, industry standard identifiers are assigned to items such as ISBN (Books), ISRC (Musical Recordings), and UPC (Products). Often times these identifiers are good enough, but depending on the industry standard identifiers may be misleading. Multiple agencies can assign different identifiers to the same entity.

Some domains can use a combination of entity fields to identify data. For example, a recording could be identified by title and artist. The trick with this is submitters can send different spellings of a title or artist. For Example “Telephone” by Lady Gaga. One submitter may say the artist is “Lady Gaga feat. Beyonce” another may send it as “Lady Gaga ft. Beyonce”, another version might be “Lady Gaga, Beyonce” or “Lady Gaga featuring Beyonce Knowles”. To further confuse the situations the title might be “Telephone feat Beyonce” by Lady Gaga.

Depending on what level the system needs to track the logic entities, data may have versions, think edits, remastering, remixes, release dates.

There is no definitive way to solve these issues, which is what makes ETL systems tricky to implements. Ultimately the answers will be determined by the domain and the requirements of the system. This is why it is important to have a domain expert available during requirement gathering and design.

Persistence

Extract Transform and Load Persistence

In this step, the data is saved into the data store. Now that the data is in a canonical form (see part 3), simple routines can be written to transform the canonical into object-relational models or persistent directly.

The thing to watch out for is handling creates, updates, and deletes as directed from the identification layer and business rules.

The other thing to consider is if data needs to be syndicated to multiple data stores. Say you have a relational database and a NoSQL data or search index. Design patterns and solutions around persistence are well known and will not be looked at as part of this series.

Summary

The final two steps, identification, and persistence can make or break the data chain. Identification handles the deduplication of data and informs the system how it should be persisted. Understanding how data will be used by the system whether the data should be Single, Submitter, Transaction Source. Once data is identified, persistence is easy.

Categories: Blog

Tags: , , , , ,

Ryan Van Fleet
11 Jul, 2019