Problem Statement
What is data-movement, and when does a system require this? What are the different architectural solutions available for such a system? How to determine the most appropriate solution for data-movement?Solution Abstract
What? When? How?
What is data-movement? The act of moving a copy of data from it's current physical location(s) to different location(s), reliably and repeatedly.
When to use data-movement? A data-movement from source-system to target-system is needed if :-
- the need is to provide a redundant copy of the data to the target-system;
- the target-system is restricted from accessing the source-system data;
- the systems have different data structure requirements;
- the target-system requires - read-only access, or does not require updates made to it to persist;
- the availability of data from source-system is no more relevant, e.g., a 8x5 availability becomes 24x7 resulting in no window for offline processing;
- the network or platform stability of the source-system is not reliable; and/or
- the network bandwidth is inadequate to support the real-time data access and performance needs.
How to determine the solution? This is a multi-step process, i.e., :-
- identify the driving forces which warrant the need of a data-movement solution;
- determine the high-level architectural approach that can be applied; and
- finally determine design approach at a more precision level.
The Inner-workings
Any data-movement system is composed of three building blocks - AMW i.e., Acquire, Manipulate, Write. Each block implements the below.
- (A) Acquire : The extraction of data from the source-system.
- (M) Manipulate : The enrichment of data acquired from source-system.
- (W) Write: Writing the acquired and manipulated data to the target-system.
The exact nature or responsibility of each of these building blocks can led to two architectural solutions, i.e.,
- Replication, and
- ETL or Extract-Transform-Load.
Replication Pipe-line
ETL Pipe-line
Decision Matrix
With the basic data-movement understanding above, the below two bus-matrix can be used to derive the most applicable architecture and design approach. In the matrix - higher the number of crosses 'X' better aligned is the solution to the problem statement.
Glossary
Master | The source data store which is to be copied. This is considered the original source of data. |
Slave | The target data store. |
Master-Master Replication | The a bi-directional replication between two source and target systems. Based on the replication direction the master and slave switch role. |
Master-Slave Replication | An uni-directional replication from source (master) to target (slave). |
Master-Master Row Synch | A special Master-Master replication in which conflict resolution is done at row level. |
Master-Slave-Snapshot Replication | A special Master-Slave replication, in which the complete data from source is copied to the target at a point-of-time. |
Master-Slave-Cascade | A Master-Slave replication topology where replication from source target is achieved through a cascade of intermediate master/slave. |
Batch ETL | An ETL in which the data from source systems are copied to target schema in a batch process, primarily to be used as for Data Warehouse based reporting. |
Real-time/Streaming ETL | An ETL in which continuous stream of data is processed and copied incrementally to the target schema, primarily used for real-time analytics. |
References
Paper Code: TWP_1001.10, Version: 1.0, Author: Biswajit Dash, License: CC-BY-ND, Published: Jan-2016