Make Data Warehousing great again!

Tag: data vault

All objects (advanced level) of Data Vault 2.0

In Data Vault 2.0, there are several types of advanced objects that can be used to model and manage data in an enterprise data warehouse. These include:

Basic objects

  1. Hubs: A hub is a central point in the Data Vault model that represents a unique business entity. Hubs are used to store the unique business keys and attributes of an entity, and they act as the central reference point for all related satellite and link objects.
  2. Satellites: A satellite is a supporting object in the Data Vault model that stores additional attributes and historical data about a business entity represented by a hub. Satellites are used to capture changes in the attributes of an entity over time.
  3. Links: A link is a connecting object in the Data Vault model that represents a relationship between two or more business entities. Links are used to storing the relationships between entities and the attributes that describe those relationships.


Intermediate objects

  1. Bridges: A bridge is an advanced object in the Data Vault model that is used to model many-to-many relationships between entities. Bridges are used to store the relationships between entities and the attributes that describe those relationships in a way that is more flexible and efficient than using multiple links.
  2. Reference Data: Reference data is a type of satellite object that stores data that is used to classify or categorize other data in the Data Vault model. Reference data can be used to store codes, descriptions, and other attributes that are used to describe business entities in the Data Vault model.

    https://www.briansestrada.com/2017/09/data-vault-20-9-modelado-references.html
  3. Link Satellite (LSAT): link satellite is a type of object that is used to store additional attributes and historical data about a link object in the Data Vault model. Link satellites are used to capture changes in the attributes of a link over time, and they are typically used to store data that is specific to the link object and not relevant to the entities that are related by the link.

    Link satellites are implemented as tables in a database, and they are typically composed of three columns: a link identifier, a satellite identifier, and a satellite attribute column. The link identifier is a foreign key that references the link object, and the satellite identifier is a unique identifier for the link satellite. The satellite attribute column is used to store the attributes and data that are specific to the link satellite.

    Link satellites are used in conjunction with hubs, satellites, and links in the Data Vault model. The hub stores the unique business keys and attributes of an entity, the satellite stores additional attributes and historical data about the entity, and the link is used to represent the relationship between the entities. The link satellite is used to store data about the link object itself and the relationship between the entities.

    https://communities.sas.com/t5/SAS-Communities-Library/Using-SAS-DI-Studio-to-Load-a-Data-Vault-Part-II-DV2-0/ta-p/221776
    Image copyright: Anna Brown (https://communities.sas.com/t5/SAS-Communities-Library/Using-SAS-DI-Studio-to-Load-a-Data-Vault-Part-II-DV2-0/ta-p/221776)

Advance objects

  1. Non-historized Links (NHLINK): A non-historized link is a type of link object in the Data Vault model that is used to represent a current or active relationship between two or more business entities. Non-historized links do not store historical data about the relationships between entities, and they are typically used to represent relationships that are expected to remain unchanged over time.

    https://www.scalefree.com/modeling/the-value-of-non-historized-links/
    Image copyright: Scalefree (https://www.scalefree.com/modeling/the-value-of-non-historized-links/)
  2. Point-in-Time (PIT) Tables: A Point-in-Time (PIT) table is a type of table in the Data Vault model that stores historical data about the state of a business entity at a specific point in time. PIT tables are used to capture snapshot data about an entity and can be used to track changes in the attributes of an entity over time.

    Image copyright: KentGraziano (https://vertabelo.com/blog/data-vault-series-the-business-data-vault/)
  3. Same as link (SAL): Same as Link is a type of connecting object that is used to represent a relationship between two or more business entities that are considered to be the same or equivalent. SALs are used to eliminate redundancy and consolidate data in the Data Vault model.

    SALs are typically used when two or more entities represent the same real-world concept or object, but they have different business keys or identifiers. For example, a customer may have multiple accounts with a company, and each account may have a different account number. A SAL could be used to link the different accounts together and indicate that they represent the same customer.


    Image copyright: Michael Olschimke (https://www.sciencedirect.com/topics/computer-science/raw-data-vault)

Did I miss any object? please let me know in the comments section below.

When to use or not use Data Vault

Data Vault is a data modeling technique that is often used when there is a need to store and manage large amounts of data in a data warehouse, and where the data is expected to change frequently or have complex relationships. It is well-suited for situations where the data needs to be accessed and processed in real-time or near real-time, and where the data needs to be maintained in a way that allows for easy tracking of changes over time.

When to use it

Data Vault is also well-suited for projects that require the ability to track the history and evolution of data over time. It provides a robust and flexible framework for storing and managing data, and can be easily adapted to changing business needs.

Scenarios to detect that you need it

  • When the data is expected to change frequently: Data Vault is designed to handle data that is expected to change frequently and can be used to track and store data changes over time.
  • When the data has complex relationships: Data Vault is well-suited for storing and managing data with complex relationships, as it uses a flexible, hub-and-link architecture to store data in a way that preserves these relationships.
  • When the data needs to be accessed and processed in real-time or near real-time: Data Vault is often used in combination with real-time or near real-time ETL processes, which can be useful for scenarios where the data needs to be accessed and processed as it is being generated or updated.
  • When the data needs to be maintained in a way that allows for easy tracking of changes over time: Data Vault is designed to store data in a way that allows for easy tracking of changes over time, which can be useful for scenarios where the data needs to be maintained in a way that allows for easy auditing or analysis.

When you do NOT need it

No no situation

On the other hand, Data Vault may not be the best choice for all projects. It can be more complex and resource-intensive to implement compared to some other data modeling techniques, and may not be the most suitable choice for projects with more limited resources or simpler data structures.

Data Vault may not be the best choice in situations where the data is not expected to change frequently, or where the data relationships are relatively simple. Additionally, Data Vault may not be the most appropriate choice for situations where the data does not need to be accessed and processed in real-time, or where the data volume is relatively small.

Scenarios where you’ll find that you DO NOT need it

  • When the data is not expected to change frequently: If the data is not expected to change frequently, it may be more efficient to use a different data modeling technique that is better suited for static data.
  • When the data has relatively simple relationships: If the data has relatively simple relationships, it may be more efficient to use a different data modeling technique that is better suited for storing and managing simpler data structures.
  • When the data does not need to be accessed and processed in real-time or near real-time: If the data does not need to be accessed and processed in real-time or near real-time, it may be more efficient to use a different data modeling technique that is better suited for batch processing.
  • When the data volume is relatively small: If the data volume is relatively small, it may be more efficient to use a different data modeling technique that is better suited for smaller data sets.

Key factors to identify when evaluating Data Vault as an option

There are a few key factors to consider when determining if Data Vault is the most appropriate data modeling technique for a particular project:

  1. Data volume: Data Vault is typically used to store and manage large amounts of data, and may not be the most efficient choice for smaller data sets.
  2. Data complexity: Data Vault is well-suited for storing and managing data with complex relationships, and may not be the most efficient choice for data with relatively simple relationships.
  3. Data change frequency: Data Vault is designed to handle data that is expected to change frequently, and may not be the most efficient choice for data that is not expected to change frequently.
  4. Real-time or near real-time processing requirements: Data Vault is often used in combination with real-time or near real-time ETL processes, and may not be the most appropriate choice for situations where the data does not need to be accessed and processed in real-time or near real-time.
  5. Data maintenance requirements: Data Vault is designed to store data in a way that allows for easy tracking of changes over time, which may be useful for scenarios where the data needs to be maintained in a way that allows for easy auditing or analysis. If these requirements are not present, a different data modeling technique may be more appropriate.

Conclusion

Overall, the specific choice of data modeling technique will depend on the needs and goals of the project, as well as the resources and constraints of the organization. Careful planning and evaluation of the trade-offs and benefits of different approaches may be necessary in order to determine the most appropriate solution.

How Data Vault 2.0 works with Data Mesh

Introduction

Data Vault is a data modeling approach that is designed to provide a flexible and scalable foundation for storing and managing data in a data warehouse. It is based on the idea of creating a central repository for all data in an organization, which is then used to support the various data needs of the organization, such as reporting, analytics, and data integration.

Data Mesh is a methodology for building and managing data systems that emphasizes collaboration, transparency, and shared ownership. It is based on the idea of creating a decentralized architecture for data management, in which teams are empowered to take ownership of the data they produce and use, and to work together to create a shared understanding of the data landscape.

In a Data Mesh environment, the Data Vault can be used as a central repository for storing and managing data that is shared across the organization. This can include data from multiple sources, such as transactional systems, IoT devices, and external data sources. The Data Vault can be used to store both raw data and transformed data, and can be accessed by various teams and applications as needed.

By using the Data Vault as a central repository for data in a Data Mesh environment, organizations can benefit from the flexibility and scalability of the Data Vault model, while also taking advantage of the decentralized, collaborative approach of Data Mesh. This can help organizations to better manage their data assets, foster greater collaboration among teams, and improve the overall quality and value of their data.

How Data Vault helps with Data Mesh architecture

Data Vault can help with Data Mesh in a number of ways:

  1. Central repository: Data Vault can serve as a central repository for storing and managing data in a Data Mesh environment. This can help to ensure that data is consistent, reliable, and easily accessible to all teams and applications that need it.
  2. Flexibility and scalability: Data Vault is designed to be flexible and scalable, which can be particularly useful in a Data Mesh environment where teams are working on a variety of data-related projects. The Data Vault model can accommodate a wide range of data types and structures, and can easily adapt to changes in the data landscape over time.
  3. Data governance: The Data Vault can help to facilitate data governance in a Data Mesh environment by providing a central location for storing and managing data. This can help to ensure that data is consistently defined and properly documented, and can make it easier for teams to access and use data in a consistent way.
  4. Collaboration: By using Data Vault as a central repository for data in a Data Mesh environment, teams can more easily collaborate on data-related projects. The Data Vault model provides a clear, structured approach for storing and managing data, which can help teams to work together more effectively and to achieve better results.
  5. Data Integration: The Data Vault can be used to integrate data from multiple sources in a Data Mesh environment. This can help to ensure that data is properly cleaned, transformed, and integrated, and can make it easier for teams to access and use data from different sources.
  6. Data Sharing: The Data Vault can be used to store data that is shared across the organization in a Data Mesh environment. This can help to foster collaboration and shared ownership of data, and can make it easier for teams to access and use data that is relevant to their work.

Overall, Data Vault can support Data Mesh by providing a flexible, scalable, and well-governed foundation for storing and managing data in a decentralized, collaborative environment.