Spark Structured Streaming - VaultSpeed
Spark Structured Streaming

Streaming Data Vault

VaultSpeed’s Streaming add-on module enables you to stream data into your Data Vault using Spark Structured Streaming. VaultSpeed — out of the box — supports conventional solutions for loading data from source to target. Multiple flavors of batch and CDC loading are available. But what if conventional data loading isn’t enough for you?

Kafka stream support

Say you have a lot of IoT data, trading data or inventory data coming in from Kafka topics. VaultSpeed Streaming would enable you to read the metadata from these topics, build a data vault model on top of them and stream the data straight into your Data Vault. Users can set up sources as streaming sources. For now, only Kafka streams are supported. VaultSpeed can read source metadata straight from the Kafka schema registry or from the landing zone where Kafka streams enter the data warehouse.

Three modules graphs TRANS 03

Same but different

With Streaming, users can harvest metadata, configure sources and model Data Vaults. Just like they are used to for any other source type. Once the Data Vault model has been conceived, VaultSpeed will generate two types of code:

• Scala code to be deployed into your Databricks cluster becoming the runtime code.

• DDL code to create the corresponding Data Vault structures on the target platform.

Screenshot 2022 06 27 at 22 07 49

Scala code deployed in Databricks

At runtime, messages will be read by Databricks from the Kafka topics, and be transformed into a Data Vault structure. The loading logic handles short-term data delivery issues such as latency, and long-term issues like missing keys in the source system or downtime. Currently, Databricks is the only platform we support to host the runtime of the streaming solution.

For our streaming solutions, we only generate incremental logic. There’s no initial streaming logic since records are captured and processed sequentially: one record at a time. Initial loads, if necessary, can be executed using the standard initial loads. Identical to how we deliver on other source types.

The target for the Data Vault can be anything. Of course, you can choose to stay in Databricks, but any other platform that supports the JDBC protocol works as well. That implies that you could also use Azure Synapse, Snowflake, and many others as your Data Vault platform. To do so, we created a JDBC sink connector that handles the loading towards the target.

Going beyond the Lambda Architecture

VaultSpeed Streaming works in combination with all other source types, like CDC (Change Data Capture) and/or (micro)-batch sources. This allows customers to run their Data Vault at multiple speeds. At the same time, all data elements are available in the same integration layer regardless of the loading strategy you choose.

Spark Structured Streaming Loaded Sensor Data

Streaming data loaded into Databricks Satellite

Watch the data flow

Real-time does not get more real than this! Transactional data is being streamed into Databricks delivering live analytics.

More add-on modules

Flow Management Control

Generate workflow schedules for Airflow, ADF & Matillion ETL

Learn more

Template Studio

Build your own custom, company-specific templates

Learn more

Agent Extensions

Read metadata directly from any source that supports JDBC or Kafka.

Learn more

Help your developers to be more productive