Data Warehouse-Friendly Database Design

A data warehouse is a collection of data that facilitates the decision-making process. It is non-volatile, time-variant, and integrated. Database design, on the other hand, refers to the creation of a detailed data model of a database. The model includes logical and physical design choices along with physical storage parameters.

More and more technical experts are emphasizing on creating a database design that is coherent with the data warehouse (data warehouse-friendly designs).

OLAP vs. OLTP

Two types of database systems exist: OLTP and OLAP.

  1. Online Transactional Processing (OLTP): OLTP collects data and then synchronizes the data to OLAP. OLTP focuses on short but rapid and high-frequency querying.

These two database systems work closely together to create a complete data warehouse ecosystem for enterprises. In simple terms, OLTP systems feed data to data warehouses, and OLAP systems analyze the data.

Data Synchronization Method

There are two methods to synchronize data from OLTP to OLAP. Below is a brief description:

  1. Full Synchronization: This method synchronizes data from the entire table to the corresponding table of OLAP each time.

Compared with incremental synchronization, full synchronization needs to synchronize all the data each time. It takes a longer time and imposes a higher load on the system. Whereas, for incremental synchronization, the challenge is identifying the incremental data.

Typical Table Fields

The following fields are contained when creating a table:

  1. primary key: Each table has a primary key

Synchronization Process

Data warehouse includes the ODS layer and the DW layer. ODS stores OLTP’s raw data. The synchronization process mainly occurs on the ODS layer. If the modified_time is the record of the day, it indicates that the relevant data is the incremental data of the very day. Therefore, when synchronizing, you can make a judgment based on the modified_time.

  1. Possible Exceptions: For instance, the created_time of entry A is 20170304, but modified at 20170305 00:20, then the modified_time was changed to 20170305.

Deduplication Process

  1. When summarizing the ODS layer tables on the DW layer, execution of deduplication becomes necessary. This is because the same entry may have been modified several times and during different time periods.

Data Verification Process

When execution of deduplication is implemented for the data of the ODS layer on the DW layer, you can get the same data as the production data. Verify whether it is true with the following method:

Data volume on the DW layer = All the data for production (total) — Data volume created today (created) — Data volume deleted before today (deleted)

Using table_a table as an example:

total = select count(*) from table_a
created = select count(*) from table_a where created_time = T+1
deleted = select count(*) from table_a where created_time <= T and modified_time < = T and deleted_flag = true
result = total - created - deleted

During this process, understanding the calculation method of the deleted data volume (deleted) is difficult. You can understand it like this:

If the modified_time is today, it can be confirmed that this data existed before today, because when the deleted_flag is set as true, the modified_time remains unchanged.

Conclusion

In this blog, we discussed about the types of database systems that exist in data warehousing and how we can achieve data synchronization by adopting different methods. Various processes could be followed to facilitate the creation of a database design that is supportive and coherent to the database system. The consolidation of a database system and a database design yield advantages such as enhanced data availability, reduced response time, along with efficient volume replication.

Reference:

https://www.alibabacloud.com/blog/data-warehouse-friendly-database-design_593712?spm=a2c41.11614065.0.0

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.