VAULTSPEED Alternative To The Driving Key Implementation In Data Vault 2.0

Alternative To The Driving Key Implementation In Data Vault 2.0

Back in 2017 we introduced the link structure with an example of a Data Vault model in the banking industry. We showed how the model looks like when a link represents either a relationship or a transaction between two business objects. A link can also connect more than two hubs. Furthermore, there is a special case when a part of the hub references stored in a link can change without describing a different relation. This has a great impact on the link satellites. What is the alternative to the Driving Key implementation in Data Vault 2.0?

THE DRIVING KEY

A relation or transaction is often identified by a combination of business keys in one source system. In Data Vault 2.0 this is modelled as a normal link connecting multiple hubs each containing a business key. A link contains also its own hash key, which is calculated over the combination of all parents business keys. So when the link connects four hubs and one business key changes, the new record will show a new link hash key. There is a problem when four business keys describe the relation, but only three of them identify it unique. We can not identify the business object by using only the hash key of the link. The problem is not a modeling error, but we have to identify the correct record in the related satellite when query the data. In Data Vault 2.0 this is called a driving key. It is a consistent key in the relationship and often the primary keys in the source system.

The following tables demonstrate the relationship between an employee and a department from a source system.

Table 1: employee-department relationship

 

The following Data Vault model can be derived from this source structure.

Figure 1: Data Vault model

 

The link table “Empl_Dep” is derived from the table “Employee” in the source system. The Driving Key in this example is the Employee_Number as it is the primary key in the source table, and an employee can work in only one department at the same time. This means, the true Driving Key “lives” in the satellite of the employee. If the department of an employee switches, there is no additional record in the employee’s satellite table, but a new one in the link table, what is legitimate.

Table 2: link data

 

To query the most recent delta you have to query it from the link table, grouped by the driving key.
To sum up you will always have a new link hash key when a business key changes in a relation. The challenge is to identify the driving key, which is a unique business key (or a combination of business keys) for the relationship between the connected hubs. Sometimes you would have to add an additional attribute to get a unique identifier.

Both present an issue for power users with access to the Data Vault model. Without naming conventions there is a risk that a group by statement is performed on more attributes than just the driving key which would lead to unexpected and incorrect aggregate values – even though the data itself is correctly modelled.

When dealing with transactions (e.g. the flight data from our book) there is a better solution available than the driving key: we typically prefer to model such data as a non-historized link and insert technical counter-transactions to the data when a hub reference changes.
In the case of a modified record in the source, we insert two records to the non-historized links: one for the new version of the modified record in the source and one for the old version that still exists in the target (non-historized link) but needs to be countered now – the technical counter record. To distinguish the records from the source and the counter transactions a new column is inserted, often called “Counter”.

The standard value for this counter attribute is 1 for records from the source and -1 for the technical counter transactions. Important: We do not perform any update statements, we still insert only the new counter records. When querying the measures from the satellite you just multiply the measures with the counter value.

Table 3: Link Connection with counter attribute

 

The table 3 shows a link with a counter attribute. When a record changes in the source system it is inserted with the original value and a counter value of -1 in the link table of the data warehouse. For the changed value there is a new link hash key which is also calculated over the descriptive attribute ‘Salary’. The counter value of the new record is 1.

CONCLUSION

Because identifying the driving key of a relation can be a problem in some situations you can use an alternative solution to avoid the driving key. All changes and deletes are tracked using a counter attribute in the non-historized link table. It stores also the descriptive attributes and the link hash key is calculated over all attributes.


VAULTSPEED More advanced Link Object Staging Logic

More advanced Link Object Staging Logic

When the source relationship is not based on the business keys

Note: This is a fictive example, created to explain a case. It is used only to illustrate the more advanced logic that is necessary in the staging layer. It explains how to solve the relationship and load the link table when the foreign/primary key relationship between the source objects is not based on the Business Key.

The relational source model as shown in the image above shows the relationship between an Invoice source table and a Customer table.

The Business Keys are indicated by the U-sign in front of the column. If they do not exist, then the Business Key is indicated by the Unique key. In this model, this means:

  • For invoices, the Business Key is INVOICE_NUMBER
  • For customers, the Business Key is a composite key that consists of these columns:
    • FIRST_NAME
    • LAST_NAME
    • BIRTH_DATE
    • GENDER

The relationship between the 2 tables in this model is based on the customer_id, which is the technical key (surrogate key). It is not based on the business key, meaning the combination of First_name, Last_name,Birth_date and Gender.

As a result:

  • The link table can’t be loaded by using the Invoices table only – where the relationship exists. A lookup is needed to find the Business Keys in the Customers table.
  • Not all the records to solve the relationships will be found during the lookup because the data in the landing area holds incremental data. This means that a lookup in the Customer Hub is necessary to find the missing related records and their Business Keys.

The following image illustrates the lookups that are needed to resolve the link between Invoices and Customers. In the Invoices landing table there are two incremental records which have 5 and 7 as the customer-id. When the lookup is done in the Customers landing table it only holds the record with the Customer_Number = 5, meaning that the record with Customer_Number = 7 is not in the landing area during this incremental load.

The only way to solve this is to do the lookup to the Satellite, on the condition that the Satellite holds the primary key of the Customers table = Customer_Number. Through the lookup to the HUB using the Hash key, the Business Keys can be found resulting in a fully solved Link table.

The following image shows the SQL used to load the invoices Staging table and used to resolve the Link table between Invoices and Customers in an Oracle Database prior to 12C.

Now what if this technical key was also put into the Hub? This would result in one less lookup from the Satellite to the Hub. However, Hubs should only contain Business Keys. It is good practice to put the technical key in the Satellite but make it a special attribute that does not take part in the Change of History. This means it will not be a part of the Hash Difference Attribute calculation. 

The target Data Vault model for this article is shown below