A Robust Foundation for Data Products | VaultSpeed

A Robust Foundation for Data Products

Danny Profile picture
Daniel Jimenez Content Marketing Manager
DSC08181 2
Jonas De Keuster VP Product

Why data products?

Data products are an essential part of any data-driven organization. They provide business operators the necessary business context to make informed decisions. For instance, a marketer might want to know campaigns are driving revenue - there’s a data product for that. A risk analyst might want to assess which claims are high-risk or potentially fraudulent - there’s a data product for that, too. Across all business units, there are questions to answer, and data products can provide the answers.

Building data products, however, is no small feat. They often require integrating numerous data sources, each with its own nuances and challenges, resulting in complexity. This complexity introduces difficulties in the transformation layer, increasing the chances of human error and requiring significant upkeep. Fortunately, there are ways to streamline this process and reduce both overhead and risk - we’ll discuss these solutions shortly.

Documentation V2

What are data products?

Data products come in various forms and sizes. They are systems designed to answer specific data-driven questions. In your organization, your teams are likely building multiple data products for different departments: some for product, others for finance, others for operations. The question is: how do you build and maintain them?

For example, let’s consider a corporate analytics team at Tandigm, a healthcare company. Their goal is to measure success metrics related to primary care delivery, which will be used to create an incentive program for physicians. The technical goals include measuring:

  1. Cost efficiency of care
  2. Customer satisfaction
  3. Treatment success rates

Meanwhile, the business goal is to create an incentive program to improve primary care delivery. To accurately measure these metrics over time, the data team must develop a robust model capable of adapting to future system requirement & data requirement changes. This foundational work is crucial for building a strong data product, and it’s where many organizations fall short.

Data Products Image

Why use data vault?

When building data products, the model is the foundation from which everything downstream is based. But traditional methods only offer manual, non-scalable methods to build our model. With the Data Vault methodology, however, the process becomes more streamlined. It standardizes the complexity of other methods, creating “system-agnostic” models that can be automated - addressing the issue of human error, model complexity, and high maintenance costs. Here are some of the technical differences in how data is structured, modeled, and maintained in a traditional Kimball model vs a Data Vault model:


Structure and Table Design

Kimball Model:

  • Fact Tables: In the Kimball methodology, the model is centered around fact tables that store transactional data, such as Sales or Purchase Facts. These tables often contain large amounts of numeric data (e.g., revenue, quantity sold) that can be aggregated and analyzed across different dimensions. For this use case, you would create a Sales Fact Table to store information about each transaction.
  • Dimension Tables: Alongside fact tables, dimension tables are created to store descriptive data about business entities. For example, you would create Customer Dimension, Product Dimension, and Marketing Campaign Dimension tables. These dimensions are used to filter, group, and aggregate the facts.

Technical Difference: In Kimball, a lot of the work revolves around defining the grain of the fact table (e.g., one row per transaction) and linking it to dimensions via foreign keys. The complexity increases as the number of dimensions grows, especially with slowly changing dimensions (SCDs), which require custom handling of historical changes.

Data Vault Model:

  • Hubs: In the Data Vault model, the focus is on identifying key business entities and storing them in hubs. For this use case, you would create a Customer Hub, Product Hub, and Marketing Campaign Hub to store unique identifiers for each of these entities. Hubs are typically small and contain only the business key (e.g., Customer ID, Product ID, Campaign ID) and a record of when the entity was loaded.
  • Links: Relationships between entities are captured in links. In this use case, you would create a Sales Link to connect the Customer Hub, Product Hub, and Marketing Campaign Hub based on the purchase details. The Sales Link would capture which customer bought which product through which marketing campaign.
  • Satellites: Finally, descriptive data, such as customer attributes (name, location, etc.), product details, or campaign performance, would be stored in satellites. These satellites contain data that may change over time and is related to the hubs or links. For example, you might have a Customer Satellite for attributes like "Age," "Email," and "Segment," and a Product Satellite for "Price," "Category," and "Color."

Technical Difference: In the Data Vault methodology, the data model is more modular and flexible. Hubs represent unique entities (business keys), links represent relationships, and satellites store descriptive data. This modularity ensures that historical data is preserved without overwriting, allowing for easy integration of new data sources or changes in business requirements.

Handling Data Changes

Kimball Model:

Slowly Changing Dimensions (SCDs): A key challenge in Kimball is handling slowly changing dimensions. For example, if a customer's address changes, Kimball requires a special handling technique (such as SCD Type 1, 2, or 3) to update the Customer Dimension table.

  • Type 1 overwrites the old value.
  • Type 2 creates a new record with the changed value while preserving the history.
  • Type 3 adds a new column to the dimension table to track the change.

Technical Difference: The implementation of SCDs in Kimball often requires custom SQL or ETL logic to manage updates and track changes, making it difficult to scale when data volume grows or business requirements change frequently.

Data Vault Model:

History Preservation: Data Vault is specifically designed to handle historical changes more effectively. The satellites in the Data Vault model automatically preserve historical data by storing changes over time. For example, if a customer’s address changes, a new record is added to the Customer Satellite, with a new timestamp, while the original record is preserved. This history-preserving feature is built into the model, so no special handling or custom logic is needed for SCDs. Each change is simply tracked in the satellite, and all changes can be traced back to their original source.

Technical Difference: In Data Vault, historical data is managed out-of-the-box by adding new satellite records without needing complex SCD logic. This makes it far easier to handle changes and ensures that no valuable historical data is lost.

ETL and Data Integration

Kimball Model:

ETL Complexity: In Kimball, the process of data integration and transformation is often complex. The ETL pipeline must carefully map data from operational systems into the fact and dimension tables. Transformations, aggregations, and joins are required before the data can be loaded into the data warehouse. If a new source is introduced (e.g., a new product category), the ETL process needs to be adjusted, and additional mappings need to be created.

Technical Difference: As new data sources are introduced in a Kimball model, ETL processes grow more complex, requiring additional manual adjustments and creating more opportunities for errors. The rigid structure of fact and dimension tables means that any changes or additions require substantial rework.

Data Vault Model:

Automation and Flexibility: In the Data Vault model, ETL is simplified because the data is loaded into modular hubs, links, and satellites. New data sources can be incorporated easily by adding new hubs or satellites without requiring major changes to the existing data model. Tools like VaultSpeed automate much of the transformation logic, generating the necessary ETL scripts for loading data into the Data Vault structure.

Technical Difference: The Data Vault methodology is more flexible in terms of ETL. Because the model is modular, new data sources can be added without affecting the core data structure. This greatly reduces the complexity of the ETL process and makes the integration of new data much easier and faster.

Query Performance and Maintenance

Kimball Model:

Query Optimization: With Kimball, queries are often optimized for performance by using aggregated fact tables and indexes on dimension keys. While this works well for certain types of analysis (e.g., reporting and dashboarding), it can be limiting when dealing with more complex, ad-hoc queries or large-scale data volumes.

Technical Difference
: As data volumes grow, Kimball can face challenges in performance, especially when the data model is too rigid or when the fact tables become too large. Query performance can suffer without proper indexing, partitioning, and materialized views.

Data Vault Model:

Decoupled Data Structures: Data Vault creates a more decoupled data structure. By separating business entities (hubs), relationships (links), and descriptive data (satellites), the Data Vault allows for easier expansion and performance tuning. Data is loaded in its raw form, which means data warehouse performance is optimized for scalability rather than query performance upfront.

Technical Difference: While Data Vault does not prioritize query performance at the modeling stage, it offers significant flexibility in handling complex and large datasets. Queries can be optimized at the reporting layer or data mart layer (where data is transformed for business use), ensuring that performance is scalable as the dataset grows.

ETL

Consider these 3 real-world examples:

1. Retail Company

A retail company wants to track sales performance across various regions, product categories, and marketing campaigns. With a traditional Kimball Star Schema, the design might require creating new fact tables every time a new metric needs to be tracked (e.g. customer lifetime value, product discounts, etc.) Each time a new business need arises, these new tables need to be integrated into the existing model, requiring significant adjustments and potentially impacting existing reports.

  • In contrast with Data Vault, new metrics can be easily incorporated as additional satellites without modifying the underlying hubs and links.

2. Financial Services Company

A financial services company needs to comply with new regulatory reporting requirements that demand the integration of new data sources, like transaction data from new payment providers. In a normal Kimball Star Schema, this could require a large-scale restructuring of fact and dimension tables to accommodate the new data.

  • In a Data Vault, however, this integration only requires adding a new hub for the payment provider and linking it to the relevant data through links and satellites.

3. Healthcare provider

For a healthcare provider that needs to track patient data for compliance with privacy regulations (e.g. HIPAA in the U.S.) the ability to audit and trace changes is crucial. With a Star Schema, historical changes to patient records might be aggregated or overwritten in fact tables, making it difficult to track the original source of the data.

  • With Data Vault, all patient data remains stored in its original form in the satellites, and any changes to the data are captured in new satellites rather than overwriting previous records. This ensures complete auditability and data lineage, which is essential for regulatory compliance.

Why automate data vault?

Data vault automation adds guardrails to your data models, eliminating much of the manual upkeep required when adding new data assets or changing transformations. Automation makes your data products more predictable and reliable, all while minimizing risk and cost. There are 5 main benefits of automating your data vault:

  1. Reducing manual work: A significant amount of time is spent on repetitive tasks like manually creating tables, defining relationships, and ensuring that all transformations are consistent across the data pipeline. With Data Vault automation, much of this is streamlines. Vaultspeed, for example, automatically generates the necessary hub, link, and satellite tables based on business requirements. The data modeler only needs to define the key business entities and relationships, and Vaultspeed then takes care of generating the corresponding code for the tables. This ultimately leads to less personnel requirements to create, maintain, and scale the model over time.
  2. Consistent and compliant data models: Maintaining consistency across large data models - especially when working with multiple data sources - can be a cumbersome task. Vaultspeed ensures that the structure of your data model remains standardized. Vaultspeed guarantees that your hubs, links, and satellites are created in a consistent manner that adheres to the Data Vault methodology. This standardization helps prevent errors and ensures that data products are compliant with the business rules and requirements.
  3. Real-time changes with minimal effort: Without automation, when business requirements change, or new data sources are added, you often have to rewrite large portions of the transformation logic, adjust data structures, and ensure that everything still works across the whole pipeline. With DV automation, these changes are far less labor-intensive. Vaultspeed allows for easy addition of new data assets without disrupting the existing model. New satellite, hubs, or links can be added on top of the existing structure without needing to modify the foundational parts of the model, reducing the risk of breaking existing functionality.

  4. Efficiency in handling complex data sources: Working with diverse data sources can become challenging for modelers. Traditional modeling techniques, including vanilla Data Vault, require significant manual effort to reconcile and transform the data. Vaultspeed automates much of this work, ensuring that your data model can easily handle a variety of data types and scenarios. By automating the creation of hubs, links, and satellites, Vaultspeed enables data modelers to focus on ensuring data quality and business logic, rather than spending time on the mechanics of data integration.

  5. Continuous Model Evolution: Data models are never static - they need to evolve as business requirements change and new use cases emerge. With traditional modeling approaches, making changes to the model can involve reworking large sections of the code, which can be time-consuming and error-prone. With Vaultspeed, the model is built with flexibility in mind. When business requirements change, new data sources are added, or additional use cases arise, Vaultspeed lets you make quick changes without disrupting the entire model. This continuous adaptability is key for maintaining a flexible, future-proof data environment.

Why VaultSpeed?

The bull case for Vaultspeed comes down to a simple question: do you want your data products to be model-driven or code-driven?

A model-driven approach offers standardization and automation that saves your team headaches. It allows you to create, edit, and rework data products with ease. Any necessary changes to the data model are added on top, without any added complexity. Here is a more in depth explanation of the process, but at a high-level, the model is adaptable, allowing you to add new business requirements without reworking the entire model. That’s the beauty of Data Vault - it makes your model modular and then Vaultspeed takes that modular model and makes it easier to work with and scale. Vaultspeed reduces the complexity of creating and managing a scalable data vault model by offering a visual graphical user experience instead of a code-heavy one. The best part is that through their templating automations, they keep your model in sync with your transformation code.

Take one of our largest customers - Nationale Nederlanden, an insurance & insurance - developed over 15000 mappings in Vaultspeed with only a 6-person team. This took them about 6 months, compared to the 2-3 years it might have taken them if they did it through a code-first approach. And most importantly, now new data product requests are completed in less than 1 month, instead of the 3 months prior with their old approach. Some context: National Nederlanden used Vaultspeed to integrate 17 source systems (590 source tables), migrate their old Oracle DWH to Databricks Lakehouse, and create & maintain 1,452 objects in the Data Vault model.This is a great example of how Vaultspeed accelerates the modeling process and the data product development process.