Big data is often characterized by the three Vs—volume, variety, and velocity. Volume obviously refers to the terabytes and petabytes of data that need to be processed, often in unstructured or semi-structured form. In a relational database system, each row in a table has the same structure (same number of columns, with a well-defined data type for each column and so on). By contrast, each individual entity (row) in an unstructured or semi-structured system can be structurally very different and therefore, contains more, less, or different information from another entity in the same repository. This variety is a fundamental aspect of big data and can pose interesting management and processing challenges, which NoSQL systems can address. Yet another aspect of big data is the velocity at which the data is generated. For data capture scenarios, NoSQL systems need to be able to ingest data at very high throughput rates (for example, hundreds of thousands to millions of entities per second). Similarly, results often need to be delivered at very high throughput as well as very low latency (milliseconds to a few seconds per recipient).
Unlike data in relational database systems, the intrinsic value of an individual entity in a big dataset may vary widely, depending on the intended use. Take the common case of capturing web log data in files for later analysis. A sentiment analysis application aggregates information from millions or billions of individual data items in order to make conclusions about trends and patterns in the data. An individual data item in the dataset provides very little insight, but contributes to the aggregate results. Conversely, in the case of an application that manages user profile data for ecommerce, each individual data item has a much higher value because it represents a customer (or potential customer). Traditionally, every row in a relational database repository is typically a “high value” row. We will refer to this variability in value as the fourth V of big data.
In addition to this “four Vs” characterization of big data, there are a few implicit characteristics as well. Often, the volume of data is variable and changes in unpredictable or unexpected ways. For example, it may arrive at rates of terabytes per day during some periods and gigabytes per day during others. In order to handle this variability in volume, most NoSQL solutions provide dynamic horizontal scalability, making it possible to add more hardware to the online system to gracefully adapt to the increased demand. Traditional solutions also provide some level of scalability in response to growing demand; however, NoSQL systems can scale to significantly higher levels (10 times or more) compared to these systems.
Another characteristic of most NoSQL systems is high availability. In the vast majority of usage scenarios, big data applications must remain available and process information in spite of hardware failures, software bugs, bad data, power and/or network outages, routine maintenance, and other disruptions. Again, traditional systems provide high availability; however, the massive scalability of NoSQL systems poses unique and interesting availability challenges. Unlike traditional relational database solutions, NoSQL systems permit data loss, relaxed transaction guarantees, and data inconsistency in order to provide availability and scalability over hundreds or thousands of nodes.
Types of Big Data Processing
Big data processing falls into two broad categories—batch (or analytical) processing and interactive (or “real-time”) processing. Batch processing of big data is targeted to derive aggregate value (data analytics) from data by combining terabytes or petabytes of data in interesting ways. MapReduce and Hadoop are the most well-known big data batch processing technologies available today. As a crude approximation, this is similar to data warehousing applications in the sense that data warehousing also involves aggregating vast quantities of data in order to identify trends and patterns in the data.
As the term suggests, interactive big data processing is designed to serve data very quickly with minimal overhead. The most common example of interactive big data processing is managing web user profiles. Whenever an ecommerce user connects to the web application, the user profile needs to be accessed with very low latency (in a few milliseconds); otherwise the user is likely to visit a different site. A 2010 study by Amazon.com found that every 100 millisecond increase in latency results in a 1 percent reduction in sales. Oracle NoSQL Database is a great example of a database that can handle the stringent throughput and response-time requirements of an interactive big data processing solution.