Source copy, Databricks and Apache 2.0 (Release 4.2.4)

Source copy, Databricks and Apache Airflow 2.0 (Release 4.2.4)

We’re back with a new release, and it is stuffed with new features.
We added support for Databricks, we updated our Flow Management connector to work with Apache Airflow 2.0. Also, VaultSpeed users can now copy an entire source configuration. These, and many more changes, come with VaultSpeed R4.2.4!

Databricks

Run your Data Vault in the Databricks data lakehouse!
You are now able to generate and deploy Spark code to Databricks and run it with Airflow. The deployment will create Spark SQL notebooks in Databricks for all your Data Vault mappings. Airflow will launch those jobs, running the Notebooks. Integration with Azure Data Factory is coming soon.
The target Database type is still Spark, but the ETL generation type has to be set to Databricks SQL.

 

Airflow 2.0

Apache Airflow 2.0 brings a truckload of great new features like a modernized user interface, the Airflow API, improved performance of the scheduler, the Taskflow API and others. VaultSpeed now supports Airflow 2.0. The VaultSpeed plugin for Airflow and all generated code have been reworked. All code will still work for previous Airflow versions. Just like before, once you’ve installed our plugin into your Airflow environment, Airflow becomes VaultSpeed aware. You’re able to generate and deploy workflows and run all the code needed to load your Data Vault.

 

Copy Sources

Users will also have the ability to copy existing sources. In some cases, an organization will need to integrate multiple sources that share a lot of similarities between them.

To give an example: Company ABC has the same version of their Sales CRM running in both Europe and the US. The only difference is that they have a few additional modules activated in the US.

Using the source copy functionality, they can now copy the entire source configuration from EU Sales to US Sales. All you need to do is identify and configure objects or settings that are specific only for the new source, but you can now skip all similar configuration you had already done for the EU source.

Using this functionality can obviously save a lot of time when integrating similar sources into your Data Vault model.

 

 

User Experience Improvements

The new release comes with a few other changes like a better screen to create a new data vault release. It has become a lot easier to indicate which version of which sources you would like to include in a specific data vault release. You can also choose to exclude certain sources from your release.

 

We also made it possible to mark objects in the source editor as completed, the completed objects will be highlighted in green. This status can be toggled by right-clicking on an object. The selection page can filter out completed objects, and there is also a button to remove all completed objects from the canvas. This allows you to track progress in your source modelling and get things organized.

 

Business Keys

We made it easier to change and re-order business keys in the Data Vault. We added a new screen where for each hub group the business keys of the grouped objects can be renamed and reordered, and the business keys of the hubs in the group can be reordered to match. So the keys in the different sources can now have different orders and names and still result in the same hash key calculation.

 

 

We added similar ability to reorder the linked hubs in many-to-many links ( and non historized links). In a separate screen, you can change the order of the HUB’s included in a many to many link or non-historical link.

Other changes

  • We renamed the “build flag” property to “ignored” everywhere in the application.
  • Added extra template variables for the custom deploy scripts in the agent, instead of only the zip name, you can now also get the generation id, the generation info, and the generation type, similar to the git commit message functionality. example:
    deploy.cmd = sh C:\Users\name\Documents\agent\deploy.sh {zipname} {code_type} ”{info}”
  • The compare functionality in the source graphical overview will now skip ignored releases. This means that it will compare with the last non ignored locked release before the current one.
  • We added support for overlapping loading windows to the Azure Data Factory FMC, this can be configured by using the following parameters: FMC_OVERLAPPING_LOADING_WINDOWS, FMC_WINDOW_OVERLAP_SIZE, FMC_WINDOW_OVERLAP_TYPE.
  • The metadata-export has been converted to a task, this is done to support exporting data for very large Data Vaults. Before the export would time out and not return a file if it takes too long.

More releases are coming!

Want to stay up to date? Subscribe to our newsletter!



Meet EON Colletive: our new Integration Partner in North America

Meet EON Colletive: our new Integration Partner in North America

Important news from the partnership front! We have recently teamed up with EON Collective. EON is a group of highly experienced data professionals located in USA & Canada. EON will act as an integration partner for VaultSpeed in the region.

We’re delighted to announce EON Collective as our newest integrator partner in North America. EON have strong focus on automation and are very familiar with Data Vault 2.0. Their expertise in data warehousing and data integration is impressive and we’re happy to team up with such a strong player.

Piet De Windt - CEO VaultSpeed

Every EONite team lead has over 20 years of experience in their discipline. They all at one time or another have worked for one of the world's largest consulting firms and all understand that that real change doesn't have to cost an arm and a leg. With that in mind, EON Collective's team developed the tools that lower the hours needed to bring you real results. They help organizations gain validated business insights faster and with greater flexibility. And help companies ensure business value through proven methodology and automated tools.

They are partnering up with VaultSpeed as their preferred solution for data warehouse automation:

We are very excited about being Vaultspeeds North American integration partner. Automation is a key component of any successful Data Vault implementation and we feel Vaultspeeds automation strategy in combination with Adept methodology for Data Vault implementation is the perfect combination.

We are also looking forward to working with Vaultspeed as we start to integrate some of our Adept technology with the Vaultspeed solution."-

Robert Scott - CTO EON Collective

The power of EON is having the collective capability to work alongside their clients utilizing the ADEPT Managed Solution. EON ADEPT links process model analysis and data-oriented analysis. In fact, ADEPT is not limited to automated process discovery based on event data. It also answers a wide variety of clients performance and compliance questions based on the identified solution's operational metrics. ADEPT was built with the simple goal of greatly reducing the cost of consulting.

We should also mention that EON are joining us at the World Wide Data Vault Conference starting May 17th. Any questions on how to get started with VaultSpeed in their region and about the ADEPT integration can be addressed. We can highly recommend Keith Belanger’s keynote presentation “Is your Data Vault speaking your language?”.


VaultSpeed @ World Wide Data Vault Consortium 2021

VaultSpeed @ World Wide Data Vault Consortium 2021

You simply have to check out the annual World Wide Data Vault Consortium on May 17 2021.
This is where the worldwide user community comes to get in-depth knowledge presenters about data hubs, the role of A.I., and of course, automation.

At the conference, VaultSpeeds will host three events: A hands-on demo session, roadmap presentation and customer succes stories.
What’s more, you can get a special 20% reduction on the subscription from us.

Hands-on Session

Skip most of the data integration preparation with VaultSpeed

VaultSpeed’s data warehouse automation enables organizations to integrate data from numerous source platforms into one data vault. We harvest source metadata, users configure their source models and our engine delivers generated structures, ELT and workflows. Vaultspeed’s guided automation framework helps the user to combine and enrichmetadata from different sources in an intuitive way that corresponds to your target model.

Our out-of-the-box templates cover 90% of your implementation needs. They are 100% production-ready as VaultSpeed handles all the quality assurance and testing. We simplify the complex process of building a data vault by forcing the user to follow a pre-defined set of steps. This significantly reduces the chance of errors and ensuing rework.

VaultSpeed is quickly evolving. New functionalities are implemented every three to four weeks. Our cloud setup ensures our customers always run on the latest version.

We always try to help users eliminate time-consuming manual work and constantly work at developing new features by which they can reach even higher levels of automation.

Key takeaways:

- Integrate sources into your data warehouse quickly using Data Vault 2.0 and VaultSpeed

- Use VaultSpeed’s powerful source editor to tailor the Raw Data Vault and Business Vault towards your business taxonomy.

Customer Succes

There are a lot of aspects that can make or break your data warehouse project. We’d like to cover three of those using three cases from the real world: Time to market, cloud architecture and fulfilling business requirements.

Learn how VaultSpeed is speeding up the implementation process at Eurocontrol, Europes Organization for the Safety of Air Navigation, using its out-of-the-box templates. One year into the project, Eurocontrol conducted an internal ROI analysis.

We’re launching a huge project at Olympus, a global player in the MedTech market. As they are playing on a global level, they are moving their data platform to the cloud. VaultSpeeds cloud architecture fits right in.

Finally, no project succeeds without fulfilling business requirements and speaking their language. At Bank de Groof Petercam, VaultSpeed enabled developers to map their business taxonomy to the data vault model.

Roadmap Presentation

The VaultSpeed automation is tool is at the top of the evolutionary/acceleration ladder.

Curious to know how a decade of hands-on experience in data integration projects has resulted in a SaaS platform that provides faster data warehouse automation?

For us, all along, that was the key of our evolution. It was not about our intrinsic strength or intellectual ability, but rather the ability to understand the difficulties that our customers encounter and adapting and tweaking our platform to help them survive.

And no, it definitely was not a rollercoaster. Or to quote Charles Darwin himself “In the long history of humankind (and animal kind, too) those who learned to collaborate and improvise most effectively have prevailed.”

We are living in an environment that continues to evolve. That’s why we’re happy to share a sneak peek of our Roadmap and how we see the world of data warehouse automation will evolve in the near and not so near future.


VaultSpeed enters the official DataVault 2.0 Certification Program

The Data Vault Alliance proudly announced their brand new Certified Software Vendor Program. VaultSpeed is happy to enter this new program as a continuation of our earlier certification efforts.

 

Why we use Data Vault 2.0?

Vaultspeed has based its automation engine for the integration layer on Data Vault 2.0. We have made this decision with a few key constraints in mind: Flexibility, agility, support to have multiple versions of the truth, repeatability and the use of a standard. Data Vault provides us with the best answer to these constraints.

Data Vault is very flexible as you can just add new business elements to the model without affecting previous efforts. Data Vault can easily absorb those changes. For the same reasons, it is a perfect fit for an agile approach. You can chop the entire workload in small, manageable sprints.

While there may be such a thing as the “single version of the truth”, we believe that it is almost impossible to obtain. Not everybody has the same point of view and this view may also change over time. This means that you will always have more versions of the truth. To achieve this, Data Vault starts from having a single version of the facts, this is the stable factor you need to be able to deliver multiple versions of the truth and still manage the data integration effort over time.

Data Vault is also perfect for automation. You can define a clear relation between source metadata and the the target model, and you can do so by using a limited set of repeatable patterns.

Why did VaultSpeed choose to be certified?

Vaultspeed values the Data Vault Standard for all the benefits it brings like resilience to change and repeatable patterns. Data Vault provides the foundation for automation. Being able to work with a well defined standard that is documented, used across the world, updated and safeguarded over time is key. In fact, this enables everyone to speak the same language. This emphasizes the importance of the Data Vault Alliance, led by founding father Dan Linstedt, as the organisation that sets the Data Vault standard. For these reasons we also want to have VaultSpeed Data Vault 2.0 certified by the DVA in order to prove that VaultSpeed provides the means to work by that very same standard.

Certified Data Vault 2.0

Starting in 2019 we started a track to get our tool certified togheter with Empowered Holdings and Scalefree.

Empowered Holdings, LLC and Scalefree teamed up in 2019 to work with VaultSpeed to get their Data Vault automation tool certified to Data Vault 2.0 standards. We are happy to announce that as of 2020, they have passed the tool certification process.

Certified Software Vendor Program

As of January 2021, Empowered Holdings LLC merged its Data Vault practices with DataVaultAlliance Holdings LLC. Going forward the DVA is currently developing a world-wide Vendor Tool Certification Program. This program and it’s details will be available to any software or hardware vendor interested in participating. The program will list a set of standards that the tool needs to meet, in order to have the components that automate Data Vault, be certified.

Read more about this program on the Data Vault Alliance’s website

 

 


Vaultspeed raises €3.6 Million Series A

Vaultspeed raises €3.6 Million Series A


Vaultspeed raises €3.6 Million Series A to accelerate growth and bring its best of breed data warehouse automation solution to the global market

PRESS RELEASE 17 March 2021 - Leuven, Belgium - Vaultspeed, the Belgium-based SaaS company specialized in data warehouse automation solutions, has closed a €3.6 million Series A round led by Fortino Capital Partners. The company was founded 2 years ago by Piet De Windt and Dirk Vermeiren with support of The Cronos Group, who remains on board through its seed investment fund the CoFoundry.

Vaultspeed’s data automation software serves data managers by accelerating and automating the entire lifecycle (design, build and maintain) of their Data Vault. Data Vault technology is an innovative approach to centralize enterprise data for business analysts who deliver the necessary real time insights that business leaders need to guide their decisions. This is where Vaultspeed comes into play. Dirk Vermeiren, CTO of Vaultspeed: “While ensuring quality and consistency, the tool automates the integration of data from multiple source systems into the Data Vault, making it available for further analysis throughout the enterprise. This is what agile business leaders need to accelerate their time to market, cut the complexity and reduce project risks.”

Vaultspeed’s customer base ranges from California (Department of Health of Santa Clara County) across Europe to Japan (Olympus). The rise of microservices driving the multiplication of distributed data sources, the move to the cloud, the increasing volume of data and degree of change and the scarcity of qualified talent stimulate organizations to respond faster and smarter, increasing the demand for automated data warehousing solutions.

Vaultspeed recently signed a global deal with the Japanese multinational Olympus who will be building regional as well as global data integration platforms in order to make faster and better use of their enterprise data and improve their customer-driven solutions for the medical, life sciences and industrial industries. By bringing Vaultspeed into the process, Olympus can now create faster and better integrated insights on their SAP data and non-SAP data which was not possible before.

Duco Sickinghe, Managing Partner at Fortino Capital: “We have seen a rapidly increasing traction for data vaults over the past years and are truly excited to support Piet De Windt (CEO) and Dirk Vermeiren (CTO) in accelerating their growth. Vaultspeeds’ data warehouse automation tool plays a crucial role in helping customers increase their agility, while responding to strong time-stamping, auditability and traceability requirements.”

Wim Bijnens, partner at CoFoundry: “We have seen Vaultspeed evolve from an idea into a prototype on to a proof of concept with some of our key customers in Belgium. Today, Vaultspeed is ready to scale up and deliver value to business leaders all over the world. With enthusiasm and belief in a great future for Vaultspeed we are pleased with the support of Fortino Capital in this exciting scaling phase.”

With the additional funding, Vaultspeed will look to further scale its organization and invest in its best of breed product in order to serve and expand its international customer base. Piet De Windt, CEO of Vaultspeed: “Vaultspeed’s cloud-based product is platform agnostic and integrates with top-tier tools in the data integration ecosystem. We strive to bring the best value and technology to our customers leveraging our strong and growing partner ecosystem. We are happy to onboard Fortino Capital and look forward to entering Vaultspeed’s next development phase together.”

About Vaultspeed

Vaultspeed is a Belgium-based software company. Its data warehouse automation solution speeds up the process of data integration through a best in class tool built on the Data Vault 2.0 methodology. More and more companies worldwide rely on Vaultspeed to simply build and maintain their enterprise data hub. The tool connects with most popular ELT(ETL)-tools, source, target technologies and orchestration engines.

About Fortino Capital Partners

Fortino Capital Partners is a Benelux-focused B2B software investor with a pan European reach. Fortino Capital invests in both Venture Capital and Growth private equity assets. With offices in Antwerp and Amsterdam, Fortino Capital’s investment portfolio includes Teamleader, Insided, MobileXpense, Efficy CRM, iObeya and Oqton among others.

For more information, please visit https://fortinocapital.com/

About Cofoundry

With a passion for innovation, The CoFoundry helps entrepreneurs transform their ideas into sustainable companies by funding them in a seed stage and by coaching them in the growth process. Embedded in the ecosystem of The Cronos Group, The CoFoundry has access to a wide network of relevant technology players.

For more information, please visit http://www.thecofoundry.co/

Read more...

Belgium:

International:


External tables & Template Previews (Release 4.2.3)

External tables & Template Previews (Release 4.2.3)

We have released VaultSpeed 4.2.3! Part of our focus was on improving performance of code generation tasks, but we also included some novelties.

External tables

We have extended DDL settings with support for INI and CDC layers. You can now generate table definitions for external/foreign tables.
This enables you to define INI and CDC tables as external tables, directly connecting them to csv’s, xls’s, db exports and many others... in your data lake.

 

 

VaultSpeed Studio Code Preview

VaultSpeed studio, our templating module now features previews. You can actually write a template and run a preview on a designated object to see what code it will generate. Copy paste preview code and test in on your development environment. Using preview, it won’t take long until you write the perfect custom template!

 

Preview for a custom effectivity SAT template

Performance improvements

We improved performance for delta generations. Delta’s calculate the difference between two separate Data Vault versions. They generate all necessary code to move from one version to the next.

 

Delta code generation

 

Calculation times were drastically improved, and the differences are especially important when changes are located only in a limited set sources.

Additionally, VaultSpeed’s agent will now only harvest metadata for objects that are included in a release, instead of all objects in the schema, this can greatly improve metadata retrieval performance for large sources with only a limited number of objects being used for the data warehouse.

Source editor improvements

We also added some new stuff in our source editor.

From now on you can use object or attribute mass update from the source graphical editor. This enables you to set the object type, CDC type, comments, data length,… for all objects or attributes matching a certain pattern.

 

Object/Attribute Mass Update

 

We improved the layout of objects in the source graphical editor, and added the ability to switch between vertical and horizontal orientation.

Third, we added shortest path functionality between 2 objects on the source graphical editor. Another tool that can help you to better understand your source model.

 

Shortest path

 

These changes were the final part in the roadmap to cover all functionality that was previously available in the tabular source editor, the old editor had become outdated and is no longer available.

 

 

For more info on VaultSpeed you can always subscribe below ⬇️

 

 

 

You value your privacy? We share your concern. So please check our privacy policy.


Azure Data Factory meets VaultSpeed FMC

Azure Data Factory meets VaultSpeed FMC (Release 4.2.2)

Some of our developers just don’t know how to stop. They came up with something new during Christmas Holidays: Support for Azure Data Factory (ADF) in our Flow Management Control (FMC) solution.

ADF FMC

In a previous blogpost we introduced our FMC solution on top of Apache Airflow. From now on, we also offer our workflow solution on top of Azure Data Factory.
This solution fits ideally for Azure DB or Synapse customers. They can use VaultSpeed to generate DDL and ELT to integrate their sources into the data warehouse. VaultSpeed can now also generate the orchestration in the form of json, which you can automatically deploy to ADF.

 

 

The VaultSpeed FMC for ADF uses Azure PAAS components exclusively. Azure Data Factory and your Data warehouse Database (SQL server or Synapse).
The database contains procedures and load metadata tables, meanwhile
ADF FMC will use stored procedure activities to execute those procedures.

 

Choose your preferred FMC platform

 

ADF has visual presentation of the workflows and built-in monitoring. ADF provides seamless pipeline restart-ability and failure management. You can also create Azure Dashboards based on ADF metrics. It is also possible to export the metrics into an external reporting tool like Power BI, Grafana or other candidates.

 

ADF FMC allows you to optimize parallelism and Azure cloud costs. Code generation is fully metadata driven while it still allows for integration with existing ADF pipelines like pre-staging or post processing.

Other changes

Despite this being a smaller release, some other quality of life changes are included. Every page in Vaultspeed now contains a link to the relevant VaultSpeed docs (book icon). We also added all the subscription info to the dashboard, such as extra modules and support tiers.

Want to stay tuned about our releases, leave your info below 👇 and we’ll add you to our mailing list.

 


Spark SQL & Non-Historized Links (Release 4.2.0)

Spark SQL & Non-Historized Links (Release 4.2.0)

A new major release for VaultSpeed is available and it comes with a few key changes.
We introduce a new target platform with Apache Spark. Second, our users can now experience a giant leap in UX from our new source editor. We added support for non-historic links and last but not least: VaultSpeed Studio becomes available in open alpha.

Apache Spark & Spark SQL

From now on we support Spark SQL. VaultSpeed has added a new target platform type APACHE SPARK. The ETL language is Spark SQL. Finally, the actual object storage behind Spark can be Hive, Delta Lake or others.

 

For this first implementation we only support batch mode, but Spark streaming support is in the works. You can generate DDL for Spark together with the ETL. We do not support Auto-deploy yet, but we will add it later. VaultSpeed delivers the ETL code in the form of SQL files and our flow management solution uses a combination of JDBC and Spark SQL CLI to execute the Spark code as optimally as possible.

Non-historized links

In our quest to fully support data vault 2.0, we added support for non-historized links.
Also known as the transactional link, this object type is very important in Data Vault for loading large tables with transactional data like sales, payments or other events.
VaultSpeed supports 2 variants: one variant with a unique identifier (such as transaction id) and one without. You can set the transaction ID in the source editor, this is a new attribute type.

A Non-Historized link by itself is a variant of a Many to Many link table but it does not have a satellite so it does not track changes (no Hash Difference). Obviously, you will only insert records into this table type. When you are using a unique identifier it also has a where not exists filter. Firewall views will filter out records based on that attribute when no CDC solution is available.

 

The payments table is modeled as a non-historized link

Source Graphical editor

Current users might have noticed in the screenshot above: Our source graphical editor was completely redesigned! It works faster, smoother, more user friendly, contains more information and is more pixel perfect than ever before. It also answers better to the standards of modern day browsers. In the video below, you can see how to model your sources and prepare them for introduction in your data vault model.

 

Source Editor Demo Video

VaultSpeed Studio

Vaultspeed Studio is now in open alpha. Everyone can try it for free for 30 days, this trial period starts when you create your first template and is limited to 5 templates. Read more about VaultSpeed Studio in this previous post.

 

Other Changes

  • You can now ping the agents, this will display the host names of the machines where agents are running. On the Agent page you can also kill agents.

 

  • We added support for PITs on sources with different loading frequencies, VaultSpeed’s Flow Management will dynamically scale the Business Vault loading window based on the loading windows of the sources that where loaded before.
  • In addition of the non-historized link, we also added support for Same-as Links and Hierarchical Links.
  • We improved the level of parallelism between tasks, the following tasks can be run at the same time now: DDL and ETL generation, deploy and generation.

If you want to get notified about our releases, you can leave your details below 👇

 


Snowflake Procedures (Release 4.1.17)

Snowflake Procedures (Release 4.1.17)

In the latest VaultSpeed release we included important improvements for Snowflake customers ❄️.
We also used valuable customer feedback to build some improvements for VaultSpeed Studio. Finally, we partly redesigned ELT generation for Talend to increase generation speed and robustness.

Snowflake

Snowflake is one of the most popular data platforms around these days. The success of their recent IPO emphasizes that. We support Snowflake as a target and starting now, the integration just got better.
Following up on exciting developments in Snowflake, VaultSpeed now generates ELT code for Snowflake that is wrapped in Javascript procedures. From now on you can actually deploy and store your procedures inside Snowflake. One of the main advantages of this is that our workflow solution in Airflow can call these procedures instead of executing saved SQL files.
We also enabled auto deploy for Snowflake: you can now deploy DDL and ETL to the target using VaultSpeed’s Agent. Both changes together make it possible to start loading your data warehouse without any manual interaction by your developers.

 

Snowflake stored procedures

VaultSpeed Studio

In a previous release we announced VaultSpeed studio in closed alpha version. In the past few weeks we went to work with initial customer feedback. First thing we improved is the integrated template code editor. It includes smoother navigation options, changing and saving a template is more solid and you now have the option to compare your changes to the pre-edit version.

Template code editor

We added a view all option to the target definition where you can see all attributes of the template target. Previously, both existing and newly created target attributes were shown in separate windows.

In a previous post we explained the need for signature fields when doing automation. We did a complete redesign of the signature definition screen. You can create and select signature attributes on top of the screen and assign them to attributes in a list below. This list can be filtered before assigning a certain signature to a filtered set of fields.

Signature attributes

VaultSpeed studio will move to open alpha in one of the next releases. From then on, all clients can start a one month free trial period with VS Studio.

Talend

Talend is one of the first ELT tools for which we supported automation. This is done by generating jobscript that can be deployed to generate ELT mappings inside Talend Studio’s repository.

Due to improvements and changes in their product, the need for a major update of our template compiler for Talend became apparent. The result is that the generation of Talend jobscript is much more robust. We also took a giant leap in terms of speed: jobscript generation is up to 3 times faster compared to the previous release.

Quality of life changes

Some smaller changes will certainly improve user experience:

  • We added a download button to the automatic deployment screen. This allows users to download generated code through the browser instead of having to obtain it from the agent.

 

 

  • We improved the description of the attribute based SAT split.
  • You can directly upload CSV files to update short and abbreviated names in the source trough your browser.
  • We moved the automatic deployment menu out of the generation settings since it didn't really belong there.
  • Users will experience improved loading performance of parameter screens.
  • We added an extra tab to the DDL settings page where you can get an overview of all applied settings per layer.
  • Business views are checked for naming conflicts with other objects. We also provided users with a button to disable/enable generation for all business views (to use in combination with filters).
  • Releases in the delta generation tab are now sorted such that the latest one is always shown first.
  • We added subscription updates as data points to the generations graph on the dashboard. It will also show the next subscription reset date.
    Based on the number of jobs granted in your subscription, we set a warning when the mapping counter is above 80% or at 100% of the subscription limit.
  • And a lot more that did not make this post... 🤷‍♂️

 

Our next release is coming up quite fast. Would you like to stay tuned on future VaultSpeed releases? Fill in this form 👇

 


Automated workflows for Apache Airflow

Automated workflows for Apache Airflow

Data Warehouse Automation is much broader than the generation and deployment of DDL and ELT code only. One of the area's that should also be automated is run and monitoring. In this area, VaultSpeed chose to integrate with Apache Airflow. Airflow is one of the most extensive and popular workflow & scheduling tools available and VaultSpeed generates the workflows (or DAG's) to run and monitor the execution of loads using Airflow.

Intro

The ultimate goal of building a data hub or data warehouse is to store data and make it accessible to users throughout the organisation. To do that you need to start loading data into it. One of the advantages of data vault 2.0 and the use of hash keys in particular is that objects have almost no loading dependencies. This means that with data vault, your loading process can reach very high degrees of parallelism.
Another advantage is that data can be loaded at multiple speeds. You might want to load some objects in (near) real time. Hourly, daily or even monthly load cycles might be satisfactory for other data sources.
In every case, you need to set up the workflow to load each object. And you need to be able to schedule, host and monitor this loading process.
In this article we will explain how to leverage automation for another important part of the data warehouse: develop, schedule and monitor your workflows.

 

VaultSpeed setup

VaultSpeed generates DDL and ELT code for your data warehouse, data hub, ... Once deployed on the target environment, these components form the backbone of your data ingestion process. The next step is to organize these objects into a proper workflow and build proper Flow Management Control.

Parameters

The first step in this process is to setup some parameters inside VaultSpeed. This setup will influence how the Airflow DAGs are built:

 

FMC parameters

 

  • USE_FMC: Whether or not you will use Flow Management Control, this will cause some extra objects to be generated with your DDL.
  • FMC_DYNAMIC_LOADING_WINDOW: You can choose whether the beginning of the loading window is determined by the end of the last successful run. Or alternatively, you can choose that the load will have a static loading window but will wait for the successful execution of the previous run.
  • SCHEMA_PL_PROCEDURES: schema where your Presentation Layer procedures are stored, this is used when generating code for loading the PL.
  • FMC_GENERATE_SRC_LOADING_SCRIPTS: If enabled, extra tasks will be added to the generated workflow that will transfer data from your source system to the data warehouse.
  • FMC_SKIP_COMPLETED_TASKS: In case of an error you need to restart your load. This parameter indicates whether tasks in the FMC are run only once if they were successful, or if all tasks are rerun upon restarting.
  • SOURCE_PL_DEP: This parameter determines whether the load of the Business Vault & Presentation Layer should wait for the load of a source to be successful or simply completed.

Add workflows

Once you have setup the parameters it is time to generate Airflow code. In our Flow Management Control screen, you can find all workflows and settings. Adding an FMC workflow is quite easy:

 

Adding an FMC Workflow

 

You can generate flows for your Raw Data Vault or Business Vault.
You have to select a load type, data vault and one of your sources. VaultSpeed always builds a single flow for a single source. You can add dependencies between them in Airflow.

Choose a DAG name (A Directed Acyclic Graph (DAG) is the name Airflow uses for a workflow). This name should be unique to this workflow e.g. <dv name>_<source name>_<load type>. Each DAG should also have a description.

We also offer a feature to enable task grouping. When using SQL code and mini batches, this reduces the overhead of creating connections to the DB. So if connecting takes a lot of time compared to actual mapping execution, enabling this option should save you some load time.

Choose a start date, this is the date and time at which your initial load will run, and the start of your first incremental load window.

By setting the concurrency, you can indicate how many tasks in this workflow are allowed to be executed at the same time. This depends on the scalability of the target platform and the resource usage (CPU) of your airflow scheduler and workers (can easily be changed later).

Finally you enter a name for your source and target database connections, these connections will be defined in Airflow later.

When you choose the incremental flow type, you need to set a schedule interval. This is the time between incremental loads, this can be

"@hourly" : Run once an hour at the beginning of the hour
"@daily" : Run once a day at midnight
"@weekly" : Run once a week at midnight on Sunday morning
"@monthly" : Run once a month at midnight of the first day of the month
"@yearly" : Run once a year at midnight of January 1
cron expression:(e.g. "0 0 * * *")
python timedelta() function  (e.g. timedelta(minutes=15))

Generate code

To generate code for a workflow, you will need to pick a version of your generated ELT’s. Select the one for which you want to build your workflow. Hitting "Start Generation", will launch generation of the code.

A few files will be generated in the back, first is a python script containing the actual DAG. Also some JSON files are added containing the rest of the setup.

 

Generating workflows

 

You can always have a look at previous workflow generations, their settings at that time, the data vault and source release, when they were generated and the name of the zipfile containing the generated python code.

Airflow Setup

To get started with Airflow you need to setup a metadata database and install Airflow. Once you have a running Airflow environment you need to finish some setup. Our Vaultspeed FMC plugin for Airflow does most of this for you.

After setting up database connections and adding some variables you can import your newly generated DAGs. You can do this manually, but of course automated deployment from VaultSpeed straight to Airflow (or through a versioning system) is also possible.

 

DAGs in Airflow

 

To start executing loads, start the scheduler and the workers.
When you unpause the freshly imported workflow, your DAG will start running.
Current day and time need to be past the start date in case of the initial load. The incremental load will start after start date + interval.

The picture below shows the DAG we generated in this example. You can clearly see we used the grouping function as it has grouped tasks into four parallel loading sets per layer.

 

Generated Airflow DAG

 

Airflow offers an intuitive interface in which you can monitor workflows, view logs and more. You can get a good overview of execution times and search for outliers. Upon failure you can inspect logs of separate tasks and start debugging.

 

Conclusion

Building a data warehouse, data hub or any other type of integration system requires building data structures and ELT code. But those two components are useless without workflows. You can build them manually, but VaultSpeed offers a solution that is less time consuming and more cost effective.

Automation in VaultSpeed is not limited to code generation and deployment. VaultSpeed also automates your workflows. When considering a standard workflow solution, Apache Airflow is our tool of choice.

About Apache Airflow: It is very well established in the market and is open source. It features an intuitive interface and makes it easy to scale out workers horizontally when you need to execute lots of tasks in parallel. It is easy to run both locally or in a (managed) cloud environment.

With a few steps, you can setup workflow generation, versioning and auto-deployment into your Airflow environment. Once completed, you can start loading data.

VaultSpeed’s FMC workflow generation is sold as a separate module in addition to the standard environment. It targets clients looking for a proper solution to organize workflows in their integration platforms. More details on pricing for this automation module are available upon request.