In the previous articles, we discussed the main pieces to consider when designing and implementing an Extract Transform and Load pipeline.
- Submission Information Packages and High-level Validation: Data, End Points, and validation
- Canonical Transformation: Converting data to a common object base
- Business Rules: Low-level validation
- Identification: Does this data already exist in the system
- Persistence: Storing the data in a data store
I this article we will discuss the ETL kitchen sink. These are the requirements-based components for legal, tracking, and business intelligence that are not always needed in a system. These “kitchen sink” components are
- Audit: Provenance (Where did the data come from, or why did it get persisted)
- Governance: Who can access the data pipeline or make changes
- Work Queue: What is done with ‘bad’ data.
Audit data is usually the first “kitchen sink” component to consider after the main pipeline is designed. The audit component allows for visibility into the data. It is the first thing that is looked at when data traceability comes into question. Most of the time it can be divided into 3 sections of audit data
- Provenance: where did the data come from
- Business rules: decisions about why data was accepted, rejected, or changed
- Business intelligence: pipeline performance data
Provenance data gives traceability from persisted data to its SIPs, which we discussed in ETL part 2. There are different levels of provenance data that are typically dictated by the system’s or the organization’s legal requirements. The highest level of provenance starts with data such as
- Submitter (who submitted it)
- Submission filename
- File type
- Submission and processing data
- Status (was it processed, did it pass or fail)
- File size
- Where the original data is stored
Often submissions are done in batches and a single submission may contain multiple files. This may require a child relationship as is seen in the diagram below. Here we see a Submitter table that has many SIPs and an SIP table that has many submission entries.
Depending on the systems requirements the SIP may need to have an aggregated status that is based on the combined statuses of the submission entries. The systems orchestration will dictate how this is handled.
Lower level provenance usually deals with field-level data. This is the data’s location in the SIP. For example, if a SIP has 4 XML files and a database entry was created from the 3rd XML file. We may need to know what file and the location in that file the data was extracted from. This is best done through another provenance table that contains a link to the submission entry table along with a unique path identifier to the location in the file. Since provenance data can point to any entity, it is denormalized from the data entity tables. Provenance is linked to data entities using the entity_id and an entity_type that represents that data type or table name. In the below example entity_type would be ‘data_entity_1’ and ‘data_entity_2’.
Example of unique path identifiers are
- XML: XPATH (/root/books/book/title)
- CSV: line number:column_number:separator_number (10:10:0)
- JSON: JPATH (data/books/book/title)
The last part of provenance has to do with versioning. In some cases, versioned changes of data need to be kept for historical or legal reasons. Handling data versioning and provenance is an advanced topic that is out of scope for this article.
Business rule audit data keeps track of decisions made about the data, either via the business rule components or manual interactions. This data could take the form of
- Why data was accepted
- Why data was updated
- How data was identified (matched against existing data)
- Why data was promoted from a work queue to attested/gold copy
- Why data was rejected
- Comments about data
- Links to case numbers (linking database data to a CRM system)
In the below diagram we’ve added an action table, it uses the same denormalized design as the provenance table, so we can track business rule (action) data for individual entities.
Ultimately the operation and business requirements will dictate how this design is done.
Business Intelligence (BI)
In COTS workflow systems you may hear this called Business Analytics. This data deals with the ETL steps. Often this data relates to a system’s or user’s performance. Tracking this data will need a link to the submission entry table and the step or component of the pipeline. Often BI data deals with
- How long a step took
- How much data passed through a step.
If a user interaction is needed then the BI may include
- How long has it been in this step
- How many steps are outstanding for this user
- Has the step be escalated
Typically I start considering a COTS product for ETL or Business Process Management when complex BI requirements exist for a system. The reason being, Business Intelligence often requires a less technical user to manage and manipulate ingest workflows. This requires a highly abstracted architecture, which allows COTS products to work for multiple clients, scenarios, and integrate with a variety of products. This often requires you to trade performance and customization for configuribility. The Build verse Buy discussion was briefly touched on in ETL Part 1 and is the subject of its own article.
System governance is a complex and often overlooked subject in software development and deserves honorable mention in this series, but will not be discussed in-depth here. Governance requirements take into consideration security and system-level auditing of the system. Often times government component will track the following
- Access logs
- Who executed a pipeline
- Who can run a pipeline
- Who can edit a pipeline
- Who has access to the data
Governance requirements often deal with how the overall enterprise will interact with the system or pipelines. Many times these features are taken care of by the overall architecture. However, in true enterprise scenarios, a single pipeline can be created and share across multiple organizations within the enterprise. Access and edit rights may need to be maintained for that pipeline.
Governance requirements are another reason to consider a COTS product over building a custom system. The reasons are very similar to the reasons stated in the Business Intelligence section above. Governance requires an abstract design to allow for reusable, flexible pipelines or workflows.
The work queue is a less commonly seen feature in ETL pipelines and is often considered when the data quality is of the highest concern. An ETL Work Queue can be designed to contain two levels.
- Submission entry level
- Data level
Submission Entry Level Work Queue
Submission entry level work queue records deal with why an entire submission entry wasn’t ingested. This level deals with
- High-level validation failures
- Data pipeline errors such as
- Software bugs
- Architecture issues (database is down)
- Any business rule that causes the data from an entire submission entry not to be ingested.
Submission entry work queue is important to consider when the organization maintains a data partnership with the submitting organization.
Data Level Work Queue
Data level work queue tracks data that was rejected due to business rules and data quality issues. Often the attested/gold data schema is duplicated to allow for lower quality data to be persistent without corrupting the attested version. This level of work queue allows for end-users to see what data is being ingested and act on it. Action that can be taken are
- Promoting data to attested
- Open a case and communicating the data issues to the submitters
- Analyze business rules results
- Delete data from the work queue
A data level work queue is important when dealing with a single source data system, see ETL part 5 for more details on Data Source Types. In a single-source scenario, two different submitters may be trying to modify an attested copy of an entity. If one submitter already has an attested copy ingested, the work queue can be used to track the second submitter’s data. This allows analysis to see what data the second submitter has submitted and allows for intelligent data curation.
In this article, we took a look at the “kitchen sink” components available for ETL systems. These included components to audit data, allow visibility into data quality, as well as controlling access to the pipelines. These components are unnecessary for most ETL pipelines, but knowing about them and how they fit into an ETL system can help with design and implementation.
Note from the author
Thank you for reading this lengthy series on ETL. The approach was to put down what I have learned from working with various forms of ETL systems over the last 12 years. Over the last year of writing this series the approach of how and what I communicate has evolved. The hope is that this information will help new ETL projects think about how they want to implement the pipeline and avoid some of the growing pains that my teams have experienced. If you have any questions or wish to see a topic more in-depth, please reach out to me via Twitter. Below are the links to each of the 6 articles in this series.
- Extract Transform and Load Overview
- Extract Transform and Load Part 2 – Submission Information Packages and High-level Validation
- Reusability and Sustainability Using a Canonical – Enterprise Transform and Load Part 3
- Business Rules – Extract Transform and Load Part 4
- Data Identification and Persistence – Extract Transform and Load Part 5
- Kitchen Sink: Audit/Governance/Work Queue – Extract Transform and Load Part 6