Make Data Warehousing great again!

Tag: dv2.0

All objects (advanced level) of Data Vault 2.0

In Data Vault 2.0, there are several types of advanced objects that can be used to model and manage data in an enterprise data warehouse. These include:

Basic objects

  1. Hubs: A hub is a central point in the Data Vault model that represents a unique business entity. Hubs are used to store the unique business keys and attributes of an entity, and they act as the central reference point for all related satellite and link objects.
  2. Satellites: A satellite is a supporting object in the Data Vault model that stores additional attributes and historical data about a business entity represented by a hub. Satellites are used to capture changes in the attributes of an entity over time.
  3. Links: A link is a connecting object in the Data Vault model that represents a relationship between two or more business entities. Links are used to storing the relationships between entities and the attributes that describe those relationships.


Intermediate objects

  1. Bridges: A bridge is an advanced object in the Data Vault model that is used to model many-to-many relationships between entities. Bridges are used to store the relationships between entities and the attributes that describe those relationships in a way that is more flexible and efficient than using multiple links.
  2. Reference Data: Reference data is a type of satellite object that stores data that is used to classify or categorize other data in the Data Vault model. Reference data can be used to store codes, descriptions, and other attributes that are used to describe business entities in the Data Vault model.

    https://www.briansestrada.com/2017/09/data-vault-20-9-modelado-references.html
  3. Link Satellite (LSAT): link satellite is a type of object that is used to store additional attributes and historical data about a link object in the Data Vault model. Link satellites are used to capture changes in the attributes of a link over time, and they are typically used to store data that is specific to the link object and not relevant to the entities that are related by the link.

    Link satellites are implemented as tables in a database, and they are typically composed of three columns: a link identifier, a satellite identifier, and a satellite attribute column. The link identifier is a foreign key that references the link object, and the satellite identifier is a unique identifier for the link satellite. The satellite attribute column is used to store the attributes and data that are specific to the link satellite.

    Link satellites are used in conjunction with hubs, satellites, and links in the Data Vault model. The hub stores the unique business keys and attributes of an entity, the satellite stores additional attributes and historical data about the entity, and the link is used to represent the relationship between the entities. The link satellite is used to store data about the link object itself and the relationship between the entities.

    https://communities.sas.com/t5/SAS-Communities-Library/Using-SAS-DI-Studio-to-Load-a-Data-Vault-Part-II-DV2-0/ta-p/221776
    Image copyright: Anna Brown (https://communities.sas.com/t5/SAS-Communities-Library/Using-SAS-DI-Studio-to-Load-a-Data-Vault-Part-II-DV2-0/ta-p/221776)

Advance objects

  1. Non-historized Links (NHLINK): A non-historized link is a type of link object in the Data Vault model that is used to represent a current or active relationship between two or more business entities. Non-historized links do not store historical data about the relationships between entities, and they are typically used to represent relationships that are expected to remain unchanged over time.

    https://www.scalefree.com/modeling/the-value-of-non-historized-links/
    Image copyright: Scalefree (https://www.scalefree.com/modeling/the-value-of-non-historized-links/)
  2. Point-in-Time (PIT) Tables: A Point-in-Time (PIT) table is a type of table in the Data Vault model that stores historical data about the state of a business entity at a specific point in time. PIT tables are used to capture snapshot data about an entity and can be used to track changes in the attributes of an entity over time.

    Image copyright: KentGraziano (https://vertabelo.com/blog/data-vault-series-the-business-data-vault/)
  3. Same as link (SAL): Same as Link is a type of connecting object that is used to represent a relationship between two or more business entities that are considered to be the same or equivalent. SALs are used to eliminate redundancy and consolidate data in the Data Vault model.

    SALs are typically used when two or more entities represent the same real-world concept or object, but they have different business keys or identifiers. For example, a customer may have multiple accounts with a company, and each account may have a different account number. A SAL could be used to link the different accounts together and indicate that they represent the same customer.


    Image copyright: Michael Olschimke (https://www.sciencedirect.com/topics/computer-science/raw-data-vault)

Did I miss any object? please let me know in the comments section below.

Best practices when splitting your satellites in Data Vault modelling

In data vault modeling, satellites are used to store historical and contextual data about an entity in a data vault. When splitting satellites, it is important to consider the following best practices:

  1. Use a consistent and logical naming convention: Use a naming convention that is easy to understand and follow. This will help you easily identify and locate the satellites you need.
  2. Keep related data together: Group data that is related or belongs to the same entity in the same satellite. This will make it easier to understand and analyse the data.
  3. Avoid overloading satellites: Avoid adding too much data to a single satellite. If a satellite becomes too large, it can be difficult to manage and maintain.
  4. Use the correct data types: Make sure to use the correct data types for each attribute in the satellite. This will ensure that the data is stored and used efficiently.
  5. Consider data integrity: When splitting satellites, make sure to consider the impact on data integrity. You want to ensure that you do not lose any data or create inconsistencies when splitting the satellites.

By following these best practices, you can ensure that your satellites are organised and maintained in a way that makes it easy to understand and use the data in your data vault.

Criteria to follow when splitting your Satellites

There are a few steps you can follow to split your satellites in data vault modeling:

  1. Identify the reason for splitting: Determine the reason for splitting the satellites. This could be because the satellite has become too large, or because you want to group related data together in a more logical way.
  2. Determine the criteria for splitting: Decide on the criteria for splitting the satellite. This could be based on the type of data being stored, the time period it covers, or any other relevant factors.
  3. Create a new satellite: Create a new satellite for the data that meets the splitting criteria. Make sure to use a consistent and logical naming convention, and to include all relevant attributes in the new satellite.
  4. Migrate the data: Migrate the data from the old satellite to the new satellite. Make sure to carefully check the data to ensure that it has been migrated correctly and that there are no inconsistencies or data loss.
  5. Update any related links: If the satellite being split is linked to other objects in the data vault, make sure to update the links to point to the new satellite.

Avoiding when splitting your Satellites

When splitting satellites in data vault modeling, it is important to avoid the following:

  1. Losing data: Make sure to carefully migrate all data from the old satellite to the new satellite, to ensure that no data is lost during the split.
  2. Creating inconsistencies: Pay attention to the data being migrated, to ensure that it is consistent and that there are no inconsistencies introduced during the split.
  3. Overloading the new satellite: Avoid adding too much data to the new satellite. If a satellite becomes too large, it can be difficult to manage and maintain.
  4. Using an inconsistent naming convention: Make sure to use a consistent and logical naming convention when creating the new satellite. This will help you easily identify and locate the satellite in the future.

By avoiding these pitfalls, you can ensure that the process of splitting your satellites in the data vault is smooth and does not compromise the integrity of your data

Introduction to DataVault 2.0

My previous blog, where I had articles for more than 10 years around Data Warehousing and Data Modeling was deleted due to some misunderstanding with my hosting provider, but after some thought I would like to start fresh a new one and focus mainly on Data Modeling and Data Warehousing surrounding my favourite architecture, Data Vault 2.0.

DISCLAIMER: If you already know DV2.0 then you can skip this article.

What is Data Vault 2.0

Data Vault 2.0 is a data modeling approach that was developed by Dan Linstedt and is designed to support the creation of a long-term, scalable data warehouse. It is based on the original Data Vault model, which was designed to provide a flexible and scalable way to manage data in a data warehouse environment.

Data Vault 2.0 is an extension of the original Data Vault model and includes additional features and improvements. Some of the key features of Data Vault 2.0 include:

  • A focus on building a data warehouse that can be easily maintained and evolved over time
  • A modular design that allows data to be added or modified without requiring changes to the entire data model
  • The use of standardized, reusable components to improve the efficiency and speed of data modeling
  • A flexible, scalable architecture that can support large volumes of data and high levels of concurrency

Data Vault 2.0 is often used in conjunction with other data management tools and technologies, such as ETL (extract, transform, load) tools and data lakes, to support the creation of a comprehensive data management solution.

Main advantages

There are several advantages to using Data Vault 2.0 as a data modeling approach:

  1. Scalability: Data Vault 2.0 is designed to support the management of large volumes of data and high levels of concurrency, making it well-suited for use in data warehouses with a high volume of data.
  2. Modularity: The modular design of Data Vault 2.0 allows data to be added or modified without requiring changes to the entire data model, which can make it easier to maintain and evolve the data warehouse over time.
  3. Reusability: Data Vault 2.0 uses standardized, reusable components, which can improve the efficiency and speed of data modeling.
  4. Flexibility: Data Vault 2.0 is a flexible data modeling approach that can accommodate a wide range of data types and structures.
  5. Historical data management: Data Vault 2.0 is designed to support the management of historical data, allowing users to track changes to data over time and support the creation of historical reports.
  6. Data governance: Data Vault 2.0 includes features that support data governance, such as the ability to track data lineage and ensure data quality.
  7. Integration with other tools: Data Vault 2.0 can be used in conjunction with other data management tools and technologies, such as ETL (extract, transform, load) tools and data lakes, to support the creation of a comprehensive data management solution.

Main disadvantages

Some potential disadvantages of using Data Vault 2.0 as a data modeling approach include:

  1. Complexity: Data Vault 2.0 can be complex to implement and may require specialized training and expertise to use effectively.
  2. Performance: Data Vault 2.0 can potentially result in a larger number of tables and relationships compared to other data modeling approaches, which may impact query performance.
  3. Lack of support for certain types of data: Data Vault 2.0 may not be well-suited for certain types of data, such as data with complex relationships or data that requires a high level of normalization.
  4. Limited support for real-time reporting: Data Vault 2.0 is primarily designed for use in data warehouses and may not be well-suited for supporting real-time reporting or analytics.
  5. Data quality challenges: Data Vault 2.0 relies on the accuracy and completeness of the data being loaded into the data warehouse, and may not include built-in features for data cleansing or validation.

It’s worth noting that the suitability of Data Vault 2.0 for a given situation will depend on the specific requirements and constraints of the data management project. It may be necessary to carefully consider the trade-offs and potential drawbacks of Data Vault 2.0 before deciding to use it as the data modeling approach.

Conclusion

Knowing that all the alternative techniques such 3NF or Dimensional Modeling, have also their pros and cons, I would say DV2.0 is the best option for a complex, long lasting and modern Data Warehouse. We could say, that it is a recent technique that can fit in our current technologies.