Data Movement - Architecture Essentials

Modified on Wed, 03 Oct 2018 12:45 by Biswajit Dash — Categorized as: blueprint, classical architecture, data architecture, whiteprint

Problem Statement

What is data-movement, and when does a system require this? What are the different architectural solutions available for such a system? How to determine the most appropriate solution for data-movement?


Solution Abstract

What? When? How?

What is data-movement? The act of moving a copy of data from it's current physical location(s) to different location(s), reliably and repeatedly.

When to use data-movement? A data-movement from source-system to target-system is needed if :-
How to determine the solution? This is a multi-step process, i.e., :-

The Inner-workings

Any data-movement system is composed of three building blocks - AMW i.e., Acquire, Manipulate, Write. Each block implements the below.


Image

The exact nature or responsibility of each of these building blocks can led to two architectural solutions, i.e.,

Replication Pipe-line

Image

ETL Pipe-line

Image

Decision Matrix

With the basic data-movement understanding above, the below two bus-matrix can be used to derive the most applicable architecture and design approach. In the matrix - higher the number of crosses 'X' better aligned is the solution to the problem statement.

Image

Image

Glossary

MasterThe source data store which is to be copied. This is considered the original source of data.
SlaveThe target data store.
Master-Master ReplicationThe a bi-directional replication between two source and target systems. Based on the replication direction the master and slave switch role.
Master-Slave ReplicationAn uni-directional replication from source (master) to target (slave).
Master-Master Row SynchA special Master-Master replication in which conflict resolution is done at row level.
Master-Slave-Snapshot ReplicationA special Master-Slave replication, in which the complete data from source is copied to the target at a point-of-time.
Master-Slave-CascadeA Master-Slave replication topology where replication from source target is achieved through a cascade of intermediate master/slave.
Batch ETLAn ETL in which the data from source systems are copied to target schema in a batch process, primarily to be used as for Data Warehouse based reporting.
Real-time/Streaming ETLAn ETL in which continuous stream of data is processed and copied incrementally to the target schema, primarily used for real-time analytics.

References



Paper Code: TWP_1001.10, Version: 1.0, Author: Biswajit Dash, License: CC-BY-ND, Published: Jan-2016