Since the invention of the transistor, the proliferation and application of computer technologies has been shaped by Moore’s Law. The growth in CPU compute capacity, high-density memory, and low-cost data storage has resulted in the invention and mass adoption of a variety of computing devices over time. These devices have become ubiquitous in our life and provide various modes of communication, computation, and intelligent sensing. As more and more of these devices are connected to the cloud, the amount of online data generated by these devices is growing tremendously. Until recently, there did not exist a very cost-effective means for businesses to store, analyze, and utilize this data to improve competitiveness and efficiency. In fact, the sheer volume and sparse nature of this data has necessitated the development of new technologies to store and analyze the data.
Introduction to NoSQL Systems
In recent years, there has been a huge surge in the use of big data technologies to gain additional insights and benefits for business. Big data is an informal term that encompasses the analysis of a variety of data from sources such as sensors, audio and video, location information, weather data, web logs, tweets, blogs, user reviews, and SMS messages among others. This large, interactive, and rapidly growing data presents its own data management challenges. NoSQL data management refers to the broad class of data management solutions that are designed to address this space.
The idea of leveraging non-intuitive insights from big data is not new, but the work of producing these insights requires understanding and correlating interesting patterns in human behavior and aggregating the findings. Historically, such insights were largely based on the use of secret, custom-built, in-house algorithms, and systems. Only a handful of enterprises were able to do this successfully, because it was very difficult to analyze the large volume of data and the various types of data sources involved.
During the first decade of the twenty-first century, techniques and algorithms for processing large amounts of data were popularized by web enterprises such as Google and Yahoo!. Because of the sheer volume of data and the need for cost-effective solutions, such systems incorporated design choices that made them diverge significantly from traditional relational databases, leading to their characterization as NoSQL systems. Though the term suggests that these systems are the antithesis of traditional row and column relational systems, NoSQL solutions borrow many concepts from contemporary relational systems as well as earlier systems such as hierarchical and CODASYL systems. Therefore, NoSQL systems are probably better characterized as Not only SQL rather than Not SQL.
Brief Historical Perspective
It is useful to review a brief history of data management systems to understand how they have influenced modern NoSQL systems. Database systems of the early 1960s were invented to address data processing for scenarios where the amount of data was larger than the available memory on the computer. The obvious solution to this problem was to use secondary storage such as magnetic disks and tapes in order to store the additional data. Because access to secondary storage is typically a few hundred (or more) times slower than access to memory, early research in data processing was focused on addressing this performance disparity. Techniques such as efficient in-memory data structures, buffer management, sequential scanning, and batch processing and access methods (indices) for disk resident data were created in order to improve the performance of such systems.
The issue of data modeling also posed significant challenges because each application had its own view of data. The manner in which information was organized in memory as well as on disk had a huge influence on application design and processing. In the early days, data organization and modeling was largely the responsibility of the application. As a result, any changes to the methods in which data was stored or organized forced drastic changes to applications. This was hugely inefficient, and gave the impetus to decouple data storage from applications.
Early database management systems were based on the hierarchical data model. Each entity in this model has a parent record and several sub-records that are associated with the parent record organized in a hierarchy. For example, an employee entity might have a sub-record for payroll information, another sub-record for human resource (HR) information, and so on. Modeling the data in this manner improves performance because an application needs to access only the sub-records that are required, resulting in fewer disk accesses and better memory utilization. For example, a payroll application needs to reference only the payroll sub-record (and the parent record that is the “root” of the hierarchy). Application development is also simplified because applications that manage separate sub-records can be modularized and developed independently. Figure 1 illustrates how an employee entity might be organized in the hierarchical model.
FIGURE 1. Employee entity represented in the Network model of data
The CODASYL model improved upon the hierarchical data model by providing indexing and links between related sub-records, resulting in further improvements in performance and simplified application development. If we use the earlier example of modeling Employee records, the CODASYL data model allows the designer to link the records of all the dependents of an employee, as shown in Figure 2.
FIGURE 2. Employee entity and child records in the CODASYL model
Despite these improvements, the issue of record structure and schema design continued to be the dominant factor in application design. To add to the complexity, the data model was relatively inflexible; making a significant change to the organization of data often necessitated significant changes to the applications that used the data. In spite of these limitations, it is important to remember that these early systems provided excellent performance for data management problems of the day. The overall simplicity of the system also contributed to better stability and reliability of the software. To this day, several common database applications such as airline reservation systems and banking applications are based on these architectures, a testament to their simplicity, performance, and reliability.
Ted Codd’s seminal research on relational database theory in the early 1970s, the introduction of Structured Query Language (SQL) for data manipulation, and the subsequent work on relational database management systems revolutionized the data management industry. Relational database systems support logical relationships between data items and provide a clean separation between the data model and the application. The database system assumes the responsibility of mapping logical relationships to physical data organization. This data model independence has several important benefits, including significant acceleration of application development and maintenance, ease of physical data reorganization, and evolution and use of the relational data repository in multiple ways for managing a variety of data for multiple applications. Relational data is also referred to as structured data to highlight the “row and column” organization of the data. Since the mid-1980s, the use of relational database systems has been growing exponentially; it is fair to say that present-day enterprise data management is dominated by SQL-based systems.
In addition to the advances in data modeling and application design, the last 40 years have also seen major architectural and technological innovations such as the concept of transactions, indexing, concurrency control, and high availability. Transactions embody the intuitive notion of the all-or-nothing unit of work, typically involving multiple operations on different data entities. Various indexing techniques provide fast access to specific data quickly and efficiently; concurrency control ensures proper operation when multiple operations simultaneously manipulate shared resources. Recovery and high availability ensure that the system is resilient to a variety of failures. These technologies have been adapted and used in a variety of ways in modern NoSQL solutions.
Modern NoSQL systems were developed in the early 2000s in response to demands for processing the vast amounts of data produced by increasing Internet usage and mobile and geo-location technologies. Traditional solutions were either too expensive, not scalable, or required too much time to process data. Out of necessity, companies such as Google, Yahoo!, and others were forced to invent solutions that could address big data processing challenges. These modern NoSQL systems borrowed from earlier solutions but made significant advances in horizontal scalability and the efficient processing of diverse types of data such as text, audio, video, image, and geo-location.