The New ACID Database Properties for Big Data
From aggregation techniques to the data lifecycle, a variety of things need to be revisited for big data–the popular ACID Database properties included. While the original ACID properties are still relevant, IT teams need to take note of a different set of ACID properties for big data.
How is ACID Property Related to Database Type?
First, it is important to know the history and importance of ACID properties. In 1983, German Computer Science Professors Andreas Reuter and Theo Härder created the acronym ACID to qualify, categorize, and streamline the transaction concept. These properties (Atomicity, Consistency, Isolation, and Durability) encapsulate the major characteristics of the standard transaction and work to guarantee transaction validity even in the event of an error.
This concise naming and categorization has influenced many aspects of standardization and development in database systems since its creation. In short, a database should guarantee the ACID properties to ensure readability.
Big Data Properties
The database landscape is growing to address the storage needs of big data. The increase in data growth is exploding in the form of data sets that traditional relational databases simply cannot handle. For example, over 400 hours of content is uploaded to YouTube every minute, in various unstructured and semi-structured data types. As the database landscape has grown, the ACID properties have to be revisited to accommodate big data properties. While the original ACID properties are still relevant, IT teams should take note of the different set of ACID properties in regards to big data properties.
Distinctive Big Data Properties:
- High scalability
- Low latency
- Semi-structured and unstructured/ flexible data formats
- High-Speed Data Ingestion
- No Single Point of Failure
The abovementioned big data properties are a far cry from the traditional characteristics of databases, which are more suitable for simple transactions. Big data properties require an update to the ACID method. Want to know how is ACID property related to databases? We’ll explain below.
The Updated ACID Properties Database
The new ACID database properties in relation to big data properties are:
- Acquire: In all data lakes, the primary function is to gather data from various transaction source systems, Log files, and structured and unstructured sources into a processing framework that can manage big data processing and storage, like Hadoop or HDFS. Acquire raw data before pushing the down data to Hadoop or HDFS.
- Cleanse: As data types can vary widely, data analysts, data scientists, tech leads, software architects, and others cannot directly use the raw data that comes into the database. Using the previous example, YouTube allows its users to update data in different formats, resolutions, and sizes, including MOV, MP4 (MPEG4), AVI, WMV, FLV, 3GP, MPEGPS, WebM. The copious amounts of raw data coming from various sources around the world must be cleansed to be converted into high-quality, readable data. This data must be cleaned iteratively to integrate with YouTube’s NoSQL databases. This step enables consumers down the line to not have to worry about strict limitations on upload quality and appropriateness. If required, one can generate derived keys and make these keys consistent across all of the datasets.
- Integrate: The third property is about integrating joinable datasets together using derived keys. Big data believes in flattening the dataset into a small number of large column datasets. Integrate is a key attribute in doing that. Due to its scalability, distributing process, clusters of community hardware, and available tools, Hadoop is the most used file-system for big data integration projects.
- Distribute: Distributing integrated datasets available for data analysts, data scientists, and business analysts to consume.
In the enterprise data lake, everyone wants to get their hands on a variety of datasets. IT teams should follow the updated ACID database properties before making the data widely available. All of the other rules of discovery, ACL, derived datasets, and data exporting continue to apply to these distributed datasets. It is important to realize that data is no longer the sole prerogative of IT; the responsibility and actions lie equally with data scientists and analysts. How they process it further and discover meaningful insights is up to them.
To stay competitive, it is imperative that enterprise IT teams know how is ACID property related to database. Going forward, it is recommended to prioritize these updated ACID database properties Acquire, Cleanse, Integrate, Distribute for big data in the same way as they applied Atomicity, Consistency, Isolation, and Durability in the traditional transactional systems.
“ACID compliance means that a database provides consistent views of changing data.” – Owen O’Malley, Software Architect for Yahoo and a significant contributor to the Hadoop project.