A data warehouse is a collection of data that facilitates the decision-making process. It is non-volatile, time-variant, and integrated. Database design, on the other hand, refers to the creation of a detailed data model of a database. The model includes logical and physical design choices along with physical storage parameters.
More and more technical experts are emphasizing on creating a database design that is coherent with the data warehouse (data warehouse-friendly designs).
OLAP vs. OLTP
Two types of database systems exist: OLTP and OLAP.
- Online Transactional Processing (OLTP): OLTP collects data and then synchronizes the data to OLAP. OLTP focuses on short but rapid and high-frequency querying.
- Online Analytical Processing (OLAP): OLAP reduces the load of the OLTP and improves the utilization rate of data along with analyzing the data. OLAP focuses on relatively low but complex operations, such as data mining.
These two database systems work closely together to create a complete data warehouse ecosystem for enterprises. In simple terms, OLTP systems feed data to data warehouses, and OLAP systems analyze the data.
Data Synchronization Method
There are two methods to synchronize data from OLTP to OLAP. Below is a brief description:
- Full Synchronization: This method synchronizes data from the entire table to the corresponding table of OLAP each time.
- Incremental Synchronization: This procedure synchronizes modified data to the corresponding table of OLAP each time.
Compared with incremental synchronization, full synchronization needs to synchronize all the data each time. It takes a longer time and imposes a higher load on the system. Whereas, for incremental synchronization, the challenge is identifying the incremental data.
Typical Table Fields
The following fields are contained when creating a table:
- primary key: Each table has a primary key
- created_time: This field indicates when the data was created and it remains unchanged.
- modified_time: This field indicates when the data was modified. It updates to the latest time whenever it is modified.
- deleted_flag: This field indicates that physical deletion is not enabled. If you want to delete data, you need to set this field as true. If this field is set as true, it remains the same, and the modified_time will not change.
Data warehouse includes the ODS layer and the DW layer. ODS stores OLTP’s raw data. The synchronization process mainly occurs on the ODS layer. If the modified_time is the record of the day, it indicates that the relevant data is the incremental data of the very day. Therefore, when synchronizing, you can make a judgment based on the modified_time.
- Possible Exceptions: For instance, the created_time of entry A is 20170304, but modified at 20170305 00:20, then the modified_time was changed to 20170305.
- The synchronization task would have been executed at 20170305 00:00. However, due to the schedule delay, it was not executed until 20170305 00:30.
- At this time, entry A will be missed, because the modified_time of entry A has been changed to 20170305 at the time of synchronization.
- Possible Solution: One of the solutions is to change the synchronization condition. If the modified_time of the record is of the same day, or the created_time of the record is of the same day, it indicates that relevant data is the incremental data of the very day.
- When summarizing the ODS layer tables on the DW layer, execution of deduplication becomes necessary. This is because the same entry may have been modified several times and during different time periods.
- Deduplication execution can occur by using the analysis function row_number() over (partition by primary_key order by modified_time desc). That is, get the latest modified_time record using the primary key. Once completed, you can delete the record whose deleted_flag is true during this process.
Data Verification Process
When execution of deduplication is implemented for the data of the ODS layer on the DW layer, you can get the same data as the production data. Verify whether it is true with the following method:
Data volume on the DW layer = All the data for production (total) — Data volume created today (created) — Data volume deleted before today (deleted)
Using table_a table as an example:
total = select count(*) from table_a
created = select count(*) from table_a where created_time = T+1
deleted = select count(*) from table_a where created_time <= T and modified_time < = T and deleted_flag = true
result = total - created - deleted
During this process, understanding the calculation method of the deleted data volume (deleted) is difficult. You can understand it like this:
If the modified_time is today, it can be confirmed that this data existed before today, because when the deleted_flag is set as true, the modified_time remains unchanged.
In this blog, we discussed about the types of database systems that exist in data warehousing and how we can achieve data synchronization by adopting different methods. Various processes could be followed to facilitate the creation of a database design that is supportive and coherent to the database system. The consolidation of a database system and a database design yield advantages such as enhanced data availability, reduced response time, along with efficient volume replication.