Kitchen Sink: Audit/Governance/Work Queue – Extract Transform and Load Part 6

In the previous articles, we discussed the main pieces to consider when designing and implementing an Extract Transform and Load pipeline.

Submission Information Packages and High-level Validation: Data, End Points, and validation
Canonical Transformation: Converting data to a common object base
Business Rules: Low-level validation
Identification: Does this data already exist in the system
Persistence: Storing the data in a data store

I this article we will discuss the ETL kitchen sink. These are the requirements-based components for legal, tracking, and business intelligence that are not always needed in a system. These “kitchen sink” components are

Audit: Provenance (Where did the data come from, or why did it get persisted)
Governance: Who can access the data pipeline or make changes
Work Queue: What is done with ‘bad’ data.

Audit

Audit data is usually the first “kitchen sink” component to consider after the main pipeline is designed. The audit component allows for visibility into the data. It is the first thing that is looked at when data traceability comes into question. Most of the time it can be divided into 3 sections of audit data

Provenance: where did the data come from
Business rules: decisions about why data was accepted, rejected, or changed
Business intelligence: pipeline performance data

Provenance

authenticity text collage in red and white text with black background

Provenance data gives traceability from persisted data to its SIPs, which we discussed in ETL part 2. There are different levels of provenance data that are typically dictated by the system’s or the organization’s legal requirements. The highest level of provenance starts with data such as

Submitter (who submitted it)
Submission filename
File type
Submission and processing data
Status (was it processed, did it pass or fail)
File size
Where the original data is stored

Often submissions are done in batches and a single submission may contain multiple files. This may require a child relationship as is seen in the diagram below. Here we see a Submitter table that has many SIPs and an SIP table that has many submission entries.

flow chart from submitter to submission_entry

Depending on the systems requirements the SIP may need to have an aggregated status that is based on the combined statuses of the submission entries. The systems orchestration will dictate how this is handled.

Lower level provenance usually deals with field-level data. This is the data’s location in the SIP. For example, if a SIP has 4 XML files and a database entry was created from the 3^rd XML file. We may need to know what file and the location in that file the data was extracted from. This is best done through another provenance table that contains a link to the submission entry table along with a unique path identifier to the location in the file. Since provenance data can point to any entity, it is denormalized from the data entity tables. Provenance is linked to data entities using the entity_id and an entity_type that represents that data type or table name. In the below example entity_type would be ‘data_entity_1’ and ‘data_entity_2’.

Example of unique path identifiers are

XML: XPATH (/root/books/book[1]/title)
CSV: line number:column_number:separator_number (10:10:0)
JSON: JPATH (data/books/book[1]/title)

The last part of provenance has to do with versioning. In some cases, versioned changes of data need to be kept for historical or legal reasons. Handling data versioning and provenance is an advanced topic that is out of scope for this article.

Business Rules

Business rule audit data keeps track of decisions made about the data, either via the business rule components or manual interactions. This data could take the form of

Why data was accepted
Why data was updated
How data was identified (matched against existing data)
Why data was promoted from a work queue to attested/gold copy
Why data was rejected
Comments about data
Links to case numbers (linking database data to a CRM system)

In the below diagram we’ve added an action table, it uses the same denormalized design as the provenance table, so we can track business rule (action) data for individual entities.

Ultimately the operation and business requirements will dictate how this design is done.

Business Intelligence (BI)

In COTS workflow systems you may hear this called Business Analytics. This data deals with the ETL steps. Often this data relates to a system’s or user’s performance. Tracking this data will need a link to the submission entry table and the step or component of the pipeline. Often BI data deals with

How long a step took
How much data passed through a step.

If a user interaction is needed then the BI may include

How long has it been in this step
How many steps are outstanding for this user
Has the step be escalated

Typically I start considering a COTS product for ETL or Business Process Management when complex BI requirements exist for a system. The reason being, Business Intelligence often requires a less technical user to manage and manipulate ingest workflows. This requires a highly abstracted architecture, which allows COTS products to work for multiple clients, scenarios, and integrate with a variety of products. This often requires you to trade performance and customization for configuribility. The Build verse Buy discussion was briefly touched on in ETL Part 1 and is the subject of its own article.

Governance

System governance is a complex and often overlooked subject in software development and deserves honorable mention in this series, but will not be discussed in-depth here. Governance requirements take into consideration security and system-level auditing of the system. Often times government component will track the following

Access logs
Who executed a pipeline
Who can run a pipeline
Who can edit a pipeline
Who has access to the data

Governance requirements often deal with how the overall enterprise will interact with the system or pipelines. Many times these features are taken care of by the overall architecture. However, in true enterprise scenarios, a single pipeline can be created and share across multiple organizations within the enterprise. Access and edit rights may need to be maintained for that pipeline.

Governance requirements are another reason to consider a COTS product over building a custom system. The reasons are very similar to the reasons stated in the Business Intelligence section above. Governance requires an abstract design to allow for reusable, flexible pipelines or workflows.

Work Queue

people in line next to a building with a small wired fence and small bushes

The work queue is a less commonly seen feature in ETL pipelines and is often considered when the data quality is of the highest concern. An ETL Work Queue can be designed to contain two levels.

Submission entry level
Data level

Submission Entry Level Work Queue

Submission entry level work queue records deal with why an entire submission entry wasn’t ingested. This level deals with

High-level validation failures
Data pipeline errors such as
- Software bugs
- Architecture issues (database is down)
Any business rule that causes the data from an entire submission entry not to be ingested.

Submission entry work queue is important to consider when the organization maintains a data partnership with the submitting organization.

Data Level Work Queue

Data level work queue tracks data that was rejected due to business rules and data quality issues. Often the attested/gold data schema is duplicated to allow for lower quality data to be persistent without corrupting the attested version. This level of work queue allows for end-users to see what data is being ingested and act on it. Action that can be taken are

Promoting data to attested
Open a case and communicating the data issues to the submitters
Analyze business rules results
Delete data from the work queue

A data level work queue is important when dealing with a single source data system, see ETL part 5 for more details on Data Source Types. In a single-source scenario, two different submitters may be trying to modify an attested copy of an entity. If one submitter already has an attested copy ingested, the work queue can be used to track the second submitter’s data. This allows analysis to see what data the second submitter has submitted and allows for intelligent data curation.

Summary

In this article, we took a look at the “kitchen sink” components available for ETL systems. These included components to audit data, allow visibility into data quality, as well as controlling access to the pipelines. These components are unnecessary for most ETL pipelines, but knowing about them and how they fit into an ETL system can help with design and implementation.

Note from the author

male cartoon in a collared shirt and tie with his thumbs up sitting at a desk in front of laptop

Thank you for reading this lengthy series on ETL. The approach was to put down what I have learned from working with various forms of ETL systems over the last 12 years. Over the last year of writing this series the approach of how and what I communicate has evolved. The hope is that this information will help new ETL projects think about how they want to implement the pipeline and avoid some of the growing pains that my teams have experienced. Below are the links to each of the 6 articles in this series.

Extract Transform and Load Overview
Extract Transform and Load Part 2 – Submission Information Packages and High-level Validation
Reusability and Sustainability Using a Canonical – Enterprise Transform and Load Part 3
Business Rules – Extract Transform and Load Part 4
- Companion article: Basic Rules Engine Design Pattern
Data Identification and Persistence – Extract Transform and Load Part 5
Kitchen Sink: Audit/Governance/Work Queue – Extract Transform and Load Part 6