Why Data Vault is the best model for data warehouse automation

Understand the automation potential of the different modeling approaches.

Michael olschimke 1024x1024
Michael Olschimke CEO Scalefree
Data Vault 2 0 automation 03 Page 01

Introduction

This paper describes and breaks down the pros and cons of different data modeling approaches. It explores the limitations of models such as Inmon and Kimball, and the comparative advantages of the Data Vault 2.0 model. It outlines how Data Vault 2.0 facilitates automation and speeds development and deployment through pattern-based structures and templates, explaining why it is the best approach for enterprise data warehouse (EDW) automation. It also describes how VaultSpeed tools support this model.

Overview of data warehouse approaches

Data warehouse basics

EDW systems are intended to process source data into useful information. The target model for the information is typically defined by the business user and is often a dimensional model, such as a star schema or snowflake schema, with facts and their dimensions. Business users select the model according to their information needs, for example when using a dashboard application.

The data model, however, is defined by the data warehouse architects and often selected based on traditional design decisions, known as bottom-up and top-down designs. These traditional models are limited in their automation capabilities.

Automation accelerates and standardizes the loading and modeling of the source data in the data warehouse data layer, which serves as a foundation for the next layer in the enterprise data warehouse, the information layer. Standardization makes it possible to deal with analysis over time and views from different data scientists.

More modern approaches have been designed for the automation of the enterprise data warehouse and have had great success in the industry.

Traditional data warehouse modeling

There are two popular traditional options for modeling the data warehouse:

  • the Inmon data warehouse data model
  • the Kimball federated star schema.

Both options use a data mart for information delivery. Often, these data marts (also known as information marts in other architectures) are modeled using dimensional models, such as star schemas or snowflake
schemas.

Star and snowflake schemas are organized around fact entities that provide information that can be aggregated, for example, about retail transactions, call records, flights, and so on. These transactions or events provide measure values, such as, respectively, revenues, call durations and flight durations. They also provide dimensional attributes or dimension references that can be used to break down the aggregated measure values by dimensions such as products, customers or airport locations. The facts, their measures and the dimensions are defined by the business user in an information requirement.

The difference between star and snowflake schemas is quite simple. In a star schema, only fact entities can reference dimension entities, while in a snowflake schema, dimensions can also reference other dimensions.

For example, consider a fact entity that refers to products and their categories.

In a star schema, the fact entity references both the product dimension and the product category dimension.

Star schema

In a snowflake schema, the fact entity references only the product dimension. The product dimension references the product category dimension.

Preferred EDW model 03

Other models for the data mart are possible. For example, a commonly-used model is a fully denormalized fact entity that contains all dimensional attributes; it’s an easy-to-use model, especially with spreadsheetlike applications.

Inmon data warehouse data model

Inmon follows a top-down approach to data warehousing.

  • The first component modeled is the core data warehouse model that describes the enterprise based on operational corporate data.
  • The data warehouse model is modeled in a similar fashion, using a normalized data model in the third normal form (3NF), but adding effectivity timelines to the data model for data historization purposes.
  • Once that data model has been established, individual data marts for reporting purposes are derived – and sourced – from the data warehouse model.
Preferred EDW model 04

Kimball federated star schema

In the Kimball schema, the modeling starts with the individual data mart in a bottom-up approach.

  • The individual data marts use conformed dimensions such as customer and product dimensions.
  • These dimensions are used by multiple data marts and integrate the data marts into the so-called federated star schema via the information bus.
  • The sum of all data marts is then considered the data warehouse.
Preferred EDW model 02

Data Vault 2.0 for Enterprise Data Warehousing

The Data Vault 2.0 System of Business Intelligence provides concepts and techniques to build EDW solutions: an agile methodology, a reference architecture, best practices implementation and an extensible model.

The model is based on three basic entity types: hubs, links and satellites. They represent the three foundational components of any data, including enterprise data — all data consists of business keys, relationships between business keys, and descriptive data. For example, an employee has an employee number, which is the business key. It is related to an employer, which is also identified by a business key (e.g., an organization number) and there is a relationship between the two. Both business keys and their relationship can consist of descriptive data, for example, the employee’s first and last name, the employer’s organizational name, and the beginning and end date of the employment describing the relationship.

In the Data Vault 2.0 model, hubs capture a distinct list of business keys, links capture distinct lists of relationships and satellites capture any change to the descriptive data in delta records.

The Data Vault 2.0 model is sophisticated. It consists of roughly 18 entity types, depending on the definition. However, all these entity types are derived from the base entities (hubs, links, satellites) described above. They are called special entity types and are used to better capture specific enterprise data, such as transactional records, business effectivity timelines, or multi-active data (for example when employees have multiple phone numbers assigned). Because the special entity types are based on the three base types, there exists a clear learning path to learn the complete model.

The Data Vault model is used not for information delivery but for the EDW layer, which is the data layer. The information delivery model, typically a star or snowflake dimensional model, is derived from the Data Vault model.

Designed for EDW

The reason why the Data Vault model is a good fit for EDW is quite simple: unlike the 3NF approach (designed for operational systems) or dimensional models (designed for information delivery), the Data Vault model was designed for enterprise data warehousing. As such, it is not a very good fit for operational systems, even though it has occasionally successfully been used for operational purposes.

Advantages of Data Vault 2.0 for EDW

  • The Data Vault 2.0 model is simple by design and it’s easy to extend the model over time. It was designed for agile environments where a project team could start with a small scope and gradually extend the data model by additional entities — sprint by sprint. Even a zero-change-impact scenario can be achieved whereby existing entities are never touched when adding new entities. This reduces the need for testing and for required changes to downstream entities, such as in the dimensional model. It also simplifies the deployment of these additions.

  • The model can spread across multiple environments. Some data could live in an on-premise installation, while other data might be placed in a cloud environment. Multi-Cloud scenarios? Yes; possible and done many times. Different database technologies? The integration of NoSQL databases with relational database technologies? All done before, very successfully.

  • The model can also integrate structured data, often found in operational source systems based on relational database technologies, semi-structured data, such as JSON and XML data from web services, and unstructured data, such as PDF documents.

  • Besides ingesting data via traditional batch loads, e.g. once a night, other loading cycles are also used. Data can be loaded as fast as it becomes available, including Change Data Capture (CDC) data delivery, near-realtime or data streaming. Instead of processing the data in different speed and batch layers (as in the popular Lambda architecture), the Data Vault 2.0 model integrates all data sources regardless of the loading cycle. Real-time data is seamlessly integrated with batch data and can be used in a combined dimensional model during information delivery.

  • Data Vault has been developed for classified environments and so supports sophisticated security requirements. For example, the application of cell-level security, including both row-level and column-level security, is possible by design and even if the underlying database technology supports neither. Organizations use these features to implement solutions that include the application of an access control list (ACL) for dynamic adjustments of usage rights on the data set.

  • The model supports the removal or modification of records from the EDW to meet privacy requirements. Records to be deleted or reduced could be consumer records as required by the EU General Data Protection Regulation (GDPR), patient data as in the US Health Insurance Portability and Accountability Act (HIPAA), and similar regulations. In the best case, organizations want to preserve some non-personal data, thus reducing records instead of fully deleting them. This is easily supported by the Data Vault model. Data can be divided by privacy classes and their individual data retention periods respected.

How DV 2.0/VaultSpeed facilitate automation

The most obvious advantage of automation is the increased productivity and the resulting higher agility of the project team. Instead of manually building Data Vault entities by hand, automation allows the teams to produce more entities in the same time span.

This section describes how VaultSpeed, based on Data Vault 2.0, offers extensive automation possibilities and has the potential for automating even more aspects of the EDW.

Standardized structures

An additional advantage of the Data Vault model is due to the standardized structures. All hubs look alike, as do all links and satellites.

Preferred EDW model 01

Conclusion

The pattern-based design of Data Vault 2.0, especially of the model, greatly supports the automation of the enterprise data warehouse and in turn, increases the productivity of development teams. The emergence of tools such as VaultSpeed has led to the increased adoption of Data Vault in organizations of all sizes, in all regions and across all industries.

With its unique low-code / no-code features, VaultSpeed has the potential of automating more aspects of the EDW, including parts of the business logic. This trend is already seen in new features such as VaultSpeed Studio where repetitive business logic, for example for currency conversions, can be automated.

With the automation capabilities of a modern tool, the extensibility of the Data Vault 2.0 model, and the ability to implement sophisticated and demanding user requirements, including security and privacy
regulations, Data Vault 2.0 has a bright future for developing enterprise data warehouse solutions and we expect even more organizations to adopt the concept.

InmonKimballDV
Model3NFStar SchemaData Vault
Original purposeOperational SystemsInformation deliveryData warehousing
SimplicityMediumSimple

Medium

Scalability

Limited

LimitedAny volume, any speed
Multiple environmentsLimitedLimitedMultiple (cloud, on-premise)
ExtensibilityIssues with depenciesEasyEasy
AgilityInitially low, improves on the long runQuick initial solutions, issues in the long runQuick initial solutions, keeps agility in the long run
SupportLimited support by BI frontendsPreferred model by BI frontendsNot supported, but star schema can be derived easily
Real-time versus batchIssues with real-time due to loading dependenciesNo real-time data, no aggregation on the flySupports any ingestion method
Automation compatibility

Limited

LimitedBy design

About the author

This whitepaper was authored by Michael Olschimke. Michael has more than 20 years of experience in Information Technology. For the past eight years, he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. He has consulted for a number of clients in the automotive, insurance, banking, and non-profit fields.

His work includes research on massively parallel processing (MPP) systems for building artificial intelligence (AI) systems for the analysis of unstructured data. He co-authored the book “Building a scalable data warehouse with Data Vault 2.0,” which explains the concepts of Data Vault 2.0, a methodology to deliver high-performance, next-generation data warehouses.

Michael holds a Master of Science in Information Systems from Santa Clara University in Silicon Valley, California. Michael is a co-founder and one of the Chief Executive Officers (CEO) of Scalefree, where he is responsible for the business direction of the company.