Managing Large Datasets – data availability
Author: Marcel Hylkema, Solution Architect at Group 2000
Data retention is all about persisting and disclosing data to meet legal, compliance and business data archival requirements.
One of the key requirements of data retention is that the stored data is available for retrieval during the entire operational
period. This poses some challenges for the responsible organizations. The data must be protected against hardware and software
failures, should survive site disasters and be guarded against human errors. And yet, it must be possible to disclose requested
data within acceptable time frames.
Protection against hardware failures
The selection and deployment of the database in which the data is retained determines its ability to withstand hardware failures
such as power supply failures, disk crashes and more severe server problems.
Common SQL databases typically consist of a single database engine running on a single server or on a shared server architecture
with shared storage. Using a single server, the stored data may be replicated to a backup server. If a shared storage is applied
(e.g. Storage Area Network SAN or Network Attached Storage NAS), the storage device may/will consist of redundant hardware, RAID
disks, and it may be deployed in a redundant configuration keeping a replica of the data on a backup storage device. However, when
very large amounts of data have to be stored or the size of the data grows, the hardware setup may become challenging.
No-SQL databases provide an alternative when it comes to storing large amounts of data. In No-SQL databases, the data is distributed
across multiple servers and has intrinsic redundancy by storing the data multiple times (data replication). By ensuring that copies
of the data are stored on different hardware components, no data is lost in case a hardware component fails. By design, such databases
can withstand the loss of hardware components without data loss. Maintenance processes and tooling ensure that failed components are
replaced and data is redistributed to the new components.
Another advantage of storing data multiple times is that the insert and query load is distributed across multiple components, increasing
performance. Typically, such databases are designed to run on low-cost hardware and scale up in capacity and performance by adding more
Protection against site disasters
Protection against site disasters is another challenge for data retention. The stored data may be lost if the site where the database is
located, is struck by e.g. fire, flooding or earth quakes. Data can be protected against such disasters in roughly two manners:
• Keeping a recent backup copy of the data
• Replicating the data in a live (backup) data store
The main difference between the two manners is in the speed with which the data is accessible again after the disaster.
In the first option, keeping a backup copy, the database must be rebuilt from scratch in case the site is lost; all hardware must be replaced,
software installed, and the backup copies must be restored. The major advantage is that the initial costs are kept low as backup storage is
relatively cheap. Replacement costs only occur when a disaster has actually happened. However, the time before the database is operational
again may be considerable due to delivery times and the required restoration effort.
The second option, keeping up a backup data store, leads to almost no data unavailability in case of a disaster. The database is fully
operational at the backup site and the operation can continue with practically no delay. To achieve this however, it is needed to set up and
maintain a live database environment that is used only when a disaster actually happens, although the backup data store may be included in
the operational process of data retrieval, increasing the query capacity.
Protection against human errors
Human errors normally do not occur when the data processing and handling has been automated throughout the entire data life time.
The data life time consists of a number of steps:
1. Insert or ETL (Extract, Transform, Load) phase: this typically consists of one or more processes tasked to receive/retrieve data feeds, transform the data and load the transformed data into the database.
2. Retention phase: the data is available for disclosure; during this period the data is disclosed on request.
3. Expiration phase: in data retention applications, data may only be stored for a limited time; the organization must ensure that the data is no longer available once the retention period has expired.
During each phase, the data access should be fully automated to eliminate the possibility of human errors. The main risk that remains is
that of software updates. However, extended testing, extensive monitoring during and after the software update and creating backup copies
reduces the risk of data loss caused by a software update.
Protection against intentional data loss (data tampering) is not covered here in detail. This risk can be mitigated using organizational measures
(e.g. adequate procedures, authorizing selected personnel and safeguarding access credentials) and technical measures (e.g. access restrictions and data hashing).
When it concerns data retention, keeping the data available during its lifetime is a key requirement. The risks of data loss due to hardware failures,
site disasters and human errors can be minimized by careful deliberation and preparation, and taking the right precautions. The costs for uninterrupted
data availability should be weighed against the risk of data being unavailable for a longer period.