Problem Statement
What is batch a system? What are the essential design elements of such a system?
Design Abstract
What? When?
What is batch system? A batch process is non-interactive execution of series of steps or programs on a set of inputs. So, the three key elements being :-
- non-interactive execution : the execution does not include any manual intervention in run-time;
- set of inputs : a set of inputs like - records from database, lines from file - are processed; and
- series of programs : each input is passed through a series of steps or programs.
A batch system provides the containing infrastructure for such batch processing.
When to use a batch? A batch system is used for processing large volume of inputs in an offline mode. Some of the day-to-day usages include :-
- online order - orders captured through an online system are processed in offline batch;
- message queue - requests queued in a queuing system are picked-up and processed in sequence (mostly);
- data integration - data received from a source-system is fed into a target-system;
- data warehouse - data from multiple systems are unified and fed into a warehouse;
- month end processing - large volume of data is processed for report generation; and
- many more.
The Inner-workings
Batch Flow - Basic
At basic level a batch system consists of three steps :-
- read each input from the set of inputs;
- process each input through one or more steps; and
- repeat above until all inputs are processed.
Batch Flow - Advanced
What are the key challenges? Unlike the above mentioned steps, a real-life batch system needs to answer some key questions such as :-
- What action to take if the batch is executed repeatedly with same set of inputs?
- Should the batch continue or stop - in case of error in reading a specific input?
- Should the batch continue or stop - in case of error in processing a specific input?
The sketch depicts detail flow of a real-life (but simple and sequential) batch system.
|
Step | Description |
---|
Config Settings | The set of configuration parameters used by the batch system in run-time decision making. |
Init Batch Context | The step initializes the batch execution context, based on which different run-time decisions are taken. |
Source: Database/Files | The source of inputs to the batch process. |
Detect Duplicate Execution | The step detects if the batch is being repeat executed on the same set of inputs. |
Connect Data Source | The step to connect to the input source and buffer/read the inputs. |
Read Input | The step to read or pick a single input from the input set for processing. |
Verify Input Format | The step to verify the format compliance of the current input. This is mostly useful in file based inputs, specifically to verify - length of fields, field count, data-type etc. |
Log Error | The step to log run-time error. |
Log Format Error | The step to log the input that does not comply with format specifications. This log is used to perform corrective action on erroring inputs. |
Process Input | The step to process the current input. This step is functionality specific, and may be a composition of one or more programs/steps. |
Abort Batch | The step to abort the current batch execution. This can perform clean-up tacks like - logging, and closing connection etc. |
Close Batch | The step to successfully complete the batch. This can perform clean-up tacks like - logging, and closing connection etc. |
Implementation Notes
Besides the above basic flow, the design and implementation of a real-life
high performing batch system also need to support :-
- parallel processing of multiple inputs for faster processing/completion;
- scale-out configuration where sub-set of inputs can be processed on different nodes/hardware;
- fail-over and recovery mechanism where the pending tasks on a failed node can be picked-up by a different node;
- real-time notification for faster corrective action; and
- capturing of run-time stats like - batch duration, failed/passed counts, average processing time etc.
Glossary
Input | A "single input element" over which processing is applied. It can be a record from database record-set, of line from a file. |
Node | A hardware hosting the batch system capable of independently executing a batch process end-to-end. |
Continue vs. Abort | The decision to either "continue processing" or "abort processing". |
Paper Code: TWP_1003.10, Version: 1.0, Author: Biswajit Dash, License: CC-BY-ND, Published: Aug-2016