Managing Large Datasets – data availability


Author: Marcel Hylkema, Solution Architect at Group 2000



Data retention is all about persisting and disclosing data to meet legal, compliance and business data archival requirements. One of the key requirements of data retention is that the stored data is available for retrieval during the entire operational period. This poses some challenges for the responsible organizations. The data must be protected against hardware and software failures, should survive site disasters and be guarded against human errors. And yet, it must be possible to disclose requested data within acceptable time frames.

Protection against hardware failures

The selection and deployment of the database in which the data is retained determines its ability to withstand hardware failures such as power supply failures, disk crashes and more severe server problems.

Common SQL databases typically consist of a single database engine running on a single server or on a shared server architecture with shared storage. Using a single server, the stored data may be replicated to a backup server. If a shared storage is applied (e.g. Storage Area Network SAN or Network Attached Storage NAS), the storage device may/will consist of redundant hardware, RAID disks, and it may be deployed in a redundant configuration keeping a replica of the data on a backup storage device. However, when very large amounts of data have to be stored or the size of the data grows, the hardware setup may become challenging.

No-SQL databases provide an alternative when it comes to storing large amounts of data. In No-SQL databases, the data is distributed across multiple servers and has intrinsic redundancy by storing the data multiple times (data replication). By ensuring that copies of the data are stored on different hardware components, no data is lost in case a hardware component fails. By design, such databases can withstand the loss of hardware components without data loss. Maintenance processes and tooling ensure that failed components are replaced and data is redistributed to the new components.
Another advantage of storing data multiple times is that the insert and query load is distributed across multiple components, increasing performance. Typically, such databases are designed to run on low-cost hardware and scale up in capacity and performance by adding more hardware.

Protection against site disasters

Protection against site disasters is another challenge for data retention. The stored data may be lost if the site where the database is located, is struck by e.g. fire, flooding or earth quakes. Data can be protected against such disasters in roughly two manners:
• Keeping a recent backup copy of the data
• Replicating the data in a live (backup) data store
The main difference between the two manners is in the speed with which the data is accessible again after the disaster.

In the first option, keeping a backup copy, the database must be rebuilt from scratch in case the site is lost; all hardware must be replaced, software installed, and the backup copies must be restored. The major advantage is that the initial costs are kept low as backup storage is relatively cheap. Replacement costs only occur when a disaster has actually happened. However, the time before the database is operational again may be considerable due to delivery times and the required restoration effort.

The second option, keeping up a backup data store, leads to almost no data unavailability in case of a disaster. The database is fully operational at the backup site and the operation can continue with practically no delay. To achieve this however, it is needed to set up and maintain a live database environment that is used only when a disaster actually happens, although the backup data store may be included in the operational process of data retrieval, increasing the query capacity.

Protection against human errors

Human errors normally do not occur when the data processing and handling has been automated throughout the entire data life time.
The data life time consists of a number of steps:
1. Insert or ETL (Extract, Transform, Load) phase: this typically consists of one or more processes tasked to receive/retrieve data feeds, transform the data and load the transformed data into the database.
2. Retention phase: the data is available for disclosure; during this period the data is disclosed on request.
3. Expiration phase: in data retention applications, data may only be stored for a limited time; the organization must ensure that the data is no longer available once the retention period has expired.

During each phase, the data access should be fully automated to eliminate the possibility of human errors. The main risk that remains is that of software updates. However, extended testing, extensive monitoring during and after the software update and creating backup copies reduces the risk of data loss caused by a software update.

Protection against intentional data loss (data tampering) is not covered here in detail. This risk can be mitigated using organizational measures (e.g. adequate procedures, authorizing selected personnel and safeguarding access credentials) and technical measures (e.g. access restrictions and data hashing).

Conclusion

When it concerns data retention, keeping the data available during its lifetime is a key requirement. The risks of data loss due to hardware failures, site disasters and human errors can be minimized by careful deliberation and preparation, and taking the right precautions. The costs for uninterrupted data availability should be weighed against the risk of data being unavailable for a longer period.