Oracle NoSQL Database leverages the features and functionality of Oracle Berkeley DB Java Edition. We begin with a high-level description of the features and characteristics of the system and then explore some of the topics in more detail in the following blog.
Oracle NoSQL Database is a shared-nothing system designed to run and scale on commodity hardware. Key-value pairs are hash partitioned across server groups known as shards. At any point in time, a single key-value pair is always associated with a unique shard in the system. The major key (described shortly) of the key-value pair is hashed in order to determine which shard the record will belong to. Most Oracle NoSQL Database deployments use multiple machines (also referred to as nodes) per shard; each shard is configured as a highly available system using Berkeley DB Java Edition’s high availability feature. The recommended configuration requires a minimum of three machines per shard; this is called the replication factor for the configuration. Depending on application requirements, a replication factor greater or less than 3 might be more appropriate. For example, a highly available 10-shard system with a replication factor of 3 would be deployed on 30 nodes. Of course, other configurations are possible in practice.
At the API level, Oracle NoSQL Database provides a key-value paradigm that is similar to Berkeley DB’s key-value API. Oracle NoSQL Database supports the notion of a primary key (called major key) of a key-value pair, which is used to determine which shard the key-value pair should belong to. Because Oracle NoSQL Database is a shared-nothing, sharded, key-value client-server system, there are some key differences between the features offered in Berkeley DB Java Edition and Oracle NoSQL Database.
Oracle NoSQL Database supports the notion of minor keys. The combination of major and minor keys can be used to identify and address specific portions of the information in a key-value pair record. Minor keys are optional, but provide a significant convenience to the application developer. The combination of major and minor keys serves as the fully qualified unique primary key. Oracle NoSQL Database provides APIs for accessing all the contents of a specific key-value pair record as well as APIs for accessing parts of the record identified by a major and minor key combination. For example, a record in Oracle NoSQL Database might contain textual information about a user as well as the user’s photo (image). The textual information and the image would each have a minor key; of course, the major key would be the person’s identifier key. The value associated with a minor key can be retrieved and updated without having to access or modify other content in a key-value pair. Thus, the notion of minor keys not only provides a significant convenience to the application developer, but also results in performance improvements.
Oracle NoSQL Database leverages Berkeley DB Java Edition’s ACID transaction capabilities in order to provide transactional semantics for data access. The notion of Berkeley DB Java Edition transactions is more general and can support multiple operations on multiple records within a single transaction. Unlike Berkeley DB, an operation in Oracle NoSQL Database can only affect the contents of a single major key. Further, Oracle NoSQL Database operations are single-API call transactions (except for scanning the contents of the entire database) where each API request from the client to the server is an atomic unit of work. Within the context of a single major key, a client request might modify the contents of some minor keys, delete others, and add some new minor keys (and values); all these activities are executed as a single transaction. Robust ACID transaction support is one of the key distinguishing features of Oracle NoSQL Database.
Oracle NoSQL Database leverages the high availability features in Berkeley DB in order to provide resiliency, fault tolerance, and read scalability. In the event of a node failure, Oracle NoSQL Database manages elections automatically and transparently to the application. Other than a momentary delay while the election is in progress, the application is not affected by node failures. Further, Oracle NoSQL Database automatically optimizes the placement of masters and replicas on the hardware servers in order to ensure the best performance of the system.
Berkeley DB provides APIs for administering and monitoring the database. Maintenance activities such as backups and log archiving can be initiated by the application by invoking the appropriate APIs. The application can also monitor resource usage, performance, and other metrics of the system using the provided APIs. Administering and monitoring a distributed system such as Oracle NoSQL Database is significantly more complex than managing a Berkeley DB application. Oracle NoSQL Database provides an administration console, a command-line interface and APIs for managing and monitoring all components of the system. This provides a tremendous convenience to the system administrator running a production Oracle NoSQL Database application. Besides support for activities such as backups and troubleshooting, it is possible to configure and alter the topology of the system, add more shards to the cluster in response to increased demand and data volumes, identify hotspot nodes, and redistribute the data as needed in order to maintain optimal performance. These simple-to-use but powerful administration capabilities are critical for smooth operation of a large production Oracle NoSQL Database deployment.
Let us now look at some of the features of Oracle NoSQL Database in more detail.
Oracle NoSQL Database System Architectures
Database systems are generally categorized as shared memory (database system runs in a single, shared address space), shared-disk (the database system runs in multiple processes and multiple address spaces on different computers with access to shared storage—Oracle pioneered the concept of shared-disk database systems in the 1980s), or shared-nothing systems (the database system runs in multiple processes and multiple address spaces on multiple machines without any shared storage; database system processes communicate with each other using network messages). Tandem Non-Stop SQL pioneered the concept of scalable shared-nothing database systems in the mid-1980s. A shared-nothing system partitions the data into disjoint subsets (called shards), each shard managed by a node (along with replicas for providing high availability).
Berkeley DB is a shared memory database system; this means that a Berkeley DB application is generally constrained to run on a single computer. Though the high availability and replication features do support running Berkeley DB on multiple computers, this is still considered to be a shared memory architecture since the total amount of data managed by Berkeley DB is constrained by the capacity of one computer. On the other hand, Oracle NoSQL Database is a shared-nothing system. Oracle Real Application Clusters is an example of a shared-disk architecture. Figure 1 illustrates the differences between the three types of database system architectures.
FIGURE 1. Shared memory, shared-disk, and shared-nothing database architectures
Shared Memory, Shared-Disk, and Shared-Nothing Systems
A shared memory system’s performance is limited by the hardware (memory, processor, disk). To get additional performance, you need to get a bigger system. A shared-disk system needs to synchronize and coordinate access to shared data. The performance of such a system is limited by the performance of low-level synchronization primitives. A shared-nothing system’s performance is constrained by the limits of the messaging infrastructure. One of the keys to getting scalable performance from shared-nothing systems is to reduce the messaging overhead to the minimum possible. Besides the obvious solution of using a faster network, NoSQL systems optimize messaging requirements by limiting the kinds of operations that can be performed by the application. As we shall see, this is a key aspect of all big data systems.
Unlike Berkeley DB Java Edition, which is an embeddable database library, Oracle NoSQL Database is a client-server system. Figure 2 illustrates the architecture of the system in a typical deployment; in this example, there are two client nodes and six server nodes, configured as two shards, for managing the data. Each shard is a highly available cluster of three nodes and manages a subset of the data. The system is designed to be dynamically scalable in the number of clients as well as the number of shards. In short, a highly available shard is the building block for implementing a highly scalable, shared-nothing system. Similarly, a node is the building block for implementing a highly available shard.
FIGURE 2-4. Typical Oracle NoSQL Database deployment architecture
Partitioning and Sharding
The primary goal of shared-nothing systems is to achieve horizontal scalability by using additional compute and storage hardware in order to keep pace with growing demands for data capacity and data retrieval. This “army of ants” approach requires that the storage and processing be partitioned across the individual nodes. In contrast, data and processing are managed by a single compute and storage resource in a shared memory system; similarly, in a shared-disk system, the data are shared, but processing is partitioned across multiple compute nodes.
A shared-nothing system is composed of shards; each shard in the cluster manages a distinct and disjoint dataset. Each shard is termed the owner of the subset of data that it manages. There are various alternatives to partitioning the data across shards. The partitioning algorithm uses a key (specified by the user) in order to partition the data. The most popular technique is to hash-partition the data, with the intent of distributing data evenly across shards. Hash partitioning works well when user requests are limited to accessing a single entity in the system and there is no relationship between consecutive data requests. Another technique is to range partition the record key. Range partitioning is advantageous when multiple related records (records with adjacent keys) are accessed by the application. Range partitioning is vulnerable to skewed distribution of data and “hotspots” in data access if a disproportionate number of requests have to be satisfied by the same shard. Thus, range partitioning is not widely used in practice.
A request for a specific record is routed to the owning shard. The owner processes the request and returns the answers to the client. The key to achieving scalability is to eliminate any single point of control and minimize the network messages that are required to satisfy a request. This architectural requirement has a significant impact on the features and functionality that a shared-nothing system can offer. In particular, the system is designed so that the vast majority of network messages are between the Oracle NoSQL Database client driver and a specific node in a shard, rather than messages between shards (intra-shard messages). Further, the system minimizes data processing in the client driver and pushes as much processing as possible onto the shard so that messages between a client and server are mostly simple request-response kinds of messages. Any new state information from the node serving the request is also included (piggybacked) on the response message to the client driver. The client driver is primarily responsible for maintaining information about data distribution and topology of the system so it can route requests intelligently to the appropriate node in a shard. This separation of responsibility between the client driver and the nodes in the shard results in minimal messaging overhead and a highly scalable architecture.
Most often, the amount of data managed by a system keeps growing over time. Oracle NoSQL Database supports the ability to add new hardware resources as data and processing demands grow. When one or more new shards are added, data on existing shards must be repartitioned across all the shards in the cluster. The simplest approach is to disable user requests temporarily, redistribute the data, and resume normal operation. However, this is unacceptable in practice, because the temporary outage for adding new capacity can last for a significant period of time (several hours or days), depending on the volume of data to be repartitioned. Most systems, including Oracle NoSQL Database, redistribute data online and dynamically, without compromising availability of the system. This is not a trivial exercise; the data movement needs to be correct and atomic (all or nothing), the client drivers need to be updated to reflect the new distribution, and the repartitioning needs to be done in a way that takes maximum advantage of the newly added resources. Careful implementation of data migration within a highly available system is a key metric of a product’s maturity. Maximizing throughput, minimizing the impact on user queries, allowing for operation failure and restart, and updating the system with the new topology are all key functions and design considerations in Oracle NoSQL Database.
If the data and/or processing requirements decrease, then it is possible to free up some of the resources, thus reducing the amount of hardware needed by the system. When the usage of the system is cyclical over time, or follows a predictable pattern (for example, dramatic spike in processing demand for ecommerce systems during the holiday season), there is a temporary need to grow the number of shards. At the end of the peak demand period, the number of shards can be reduced to handle “steady-state” requirements. Data redistribution required to shrink the cluster is also an online activity.
We will use the term dynamic elasticity to refer to the ability of the system to add and remove resources dynamically in response to changes in data and processing requirements. Dynamic elasticity is a key characteristic of shared-nothing systems that makes them very attractive for big data applications. Oracle NoSQL Database provides administrative tools to support dynamic elasticity.
High availability is another key characteristic of Oracle NoSQL Database. Processors, memory, storage, software, and networks can fail in unpredictable ways. As the number of such components in a system increases, the probability of failure of some components increases dramatically. For example, the failure rate of an individual disk may be one failure during a period of two years. Statistically, a shared-nothing system with 1,000 disks will experience at least one disk failure every day! If you also take processors, memory, software, and network components into account, the frequency of a failure is even higher! A distributed system needs to be designed to handle these failures without impact to the application.
Availability is achieved by adding redundancy to the system. In NoSQL systems, redundancy is commonly achieved by maintaining multiple copies of the data on multiple nodes. Each shard comprises two or more nodes (called replicas) that have identical copies of the data. As changes are made to the data on one node, they are propagated to the other replicas to keep them current. Monitoring tools are used to detect and repair failures. Should one of the nodes fail, the system automatically detects and handles the change in the membership of a shard without any noticeable impact to the application. It is not trivial to determine whether a particular node is currently a member of a shard since a node might fail or there may be a temporary network outage that makes the node temporarily unreachable.
There are two alternative approaches to handling data updates in a highly available shard. One approach is to designate one of the replicas as the master node; a master can serve update requests as well as read requests, while all other nodes can only serve read requests. This architecture is called single-master. Note that it is possible to have passive, standby replicas, but this is not common in practice. Another approach is to allow updates at any node of the shard and then propagate those changes to the other replicas. This architecture is called multi-master.
The advantage of a single-master architecture is that there cannot be concurrent changes to the same record on multiple replicas; the master always has the most current value of any record in the shard. This property of single-master architectures simplifies the job of the application developer because there is no possibility of lost updates, or conflicting, concurrent changes to the same record. A single-master system needs to have a mechanism to elect a new master from one of the surviving replicas in the shard, should the current master fail. Master re-election uses a distributed quorum-based algorithm to unambiguously choose a new master. This is why most highly available systems have three (or an odd number of) replicas; this ensures that it is possible to gather a majority of votes to correctly determine the outcome of a master election. Electing a new master is typically a very quick process, lasting no more than a second or two; during this period, update activity to the shard is temporarily suspended. Figure 3 illustrates the process of master re-election in a shard with a master and two replicas.
FIGURE 3. Electing a new master
The advantage of a multi-master system is that any node can handle application requests to change a record. In fact, it is possible that the same record is changed concurrently on two different replicas. If a node fails, a request can simply be routed to another surviving node without any pause in update activity (there is no need for master election). As in the single-master case, changes to records are constantly propagated to the other replicas in the shard. Resolving conflicting changes to the same record provides interesting challenges in a multi-master system. Because the changes occur on separate and distinct machines, it is not easy to determine the timing and sequence of the conflicting changes. In some cases, the system can resolve the change on its own. Most often, however, conflicts are detected when a record is retrieved, and conflict resolution is left up to the application (or even the end user, in some cases). Update operations (even concurrent updates to the same record on different replicas) proceed normally. For read requests, the application typically requests the same record from multiple replicas of the shard. If the versions (along with the timestamp of the latest change) returned by different replicas are not identical, then it is necessary to determine which is the most current version using timestamps, application-specific semantics, and knowledge of the data. Oracle NoSQL Database is a single-master per shard system.
A discussion about distributed systems such as Oracle NoSQL Database would not be complete without a mention of eventual consistency. A distributed system maintains copies of data on multiple machines in order to provide high availability and scalability. When an application makes a change to a data item on one machine, that change has to be propagated to the other replicas. Because the change propagation is not instantaneous, there’s an interval of time during which some of the copies will have the most recent change, but others won’t. In other words, the copies will be mutually inconsistent. However, the change will eventually be propagated to all the copies. In a single-master system, if an application makes a change to a record, that request will be handled by the master node. As soon as the update request completes, if the application retrieves the same record (same major key), it is possible that the request will be routed to one of the replicas in the shard. If the master has not yet propagated the changes to that replica, the application will see the older version of the data. However, if the application requests the data after the changes on the master have been propagated to the replica, then the application will see the latest version of the record. Depending on the relative timing of the read request, the application might see different values!
Thus, the notion of eventual consistency is simply an acknowledgment that there is an unbounded delay in propagating a change made on one machine to all the other copies. Eventual consistency is not relevant in centralized (single-copy) systems because there is no need for change propagation.
Various distributed systems address consistency in different ways because there is a trade-off between operation latency, availability, and consistency. In some systems, the machine where the change originates will simply send asynchronous (and possibly unreliable) messages to the other machines and declare the operation as successful. This is fast, but at the cost of potential data loss if the originating machine fails before the replica(s) have received the update. Other systems send synchronous (blocking) messages to all other machines, receive acknowledgments, and only then, declare the operation as successful. These systems favor consistency and availability at the cost of performance. Finally, a system might implement some variant of these two extremes (for example, wait for acknowledgments from a majority of the replicas).
Oracle NoSQL Database allows the application designer to choose the consistency level required, on a per-operation basis; of course, there is a default setting of consistency as well. The developer can either choose to use the default semantics of consistency or specify consistency on a per-operation level for critical operations. Per-operation choice of consistency is the most flexible and the most application-friendly option because the application designer has a clear understanding and control on the performance as well as the consistency guarantees without additional complexity in the application program.
Oracle NoSQL Database offers several choices for read consistency. The application can specify absolute consistency if it needs the most recent version; in this case, the client driver will route the request to the master node of the shard. The application developer can also specify time-based or transaction ID–based consistency for read operations. For example, an application might be willing to tolerate reading data that is no more than one second out-of-date with respect to the most recent update. Transaction ID-based consistency is useful in scenarios where the application modifies a record at a certain point in time and wants to ensure that a subsequent read operation will read a version of that same record that is at least as current as the change it made to that record (it is okay to read a more recent version). The client driver keeps track of the change propagation between each master and its replicas, so it is able to route the request to the replica that can satisfy such a request. Finally, the application can also specify that it doesn’t care how consistent the data are for a particular read request. The Oracle NoSQL Database client driver is free to route the request to any of the replicas of the shard.
Thus, depending on the kind of read consistency required, the client driver will route the request to the most appropriate replica of the shard. This also serves to distribute the workload across the various nodes of the shard, thus achieving better system utilization and improved performance.
Durability - Making Changes Permanent
Generally, a database system ensures that a change is made permanent (durable) by writing the updated version to stable storage. Making a change durable means that the change survives processor and memory failure. However, I/O is very expensive compared to memory access. As a first approximation, I/O is 1,000 times slower (millisecond latency) than accessing memory (microsecond latency). Over the years, relational database system designers have invented several optimizations such as write-ahead logging in order to alleviate the cost of I/O for providing durability.
Some demanding applications require more performance than what is achievable in a cost-effective manner using the traditional optimizations such as write-ahead logging. Quite often, these applications are willing to relax the durability guarantees in order to achieve better performance. Some database systems (and NoSQL systems, in particular) have implemented a variety of relaxed durability guarantees in order to meet the needs of such applications.
For example, some systems buffer changes in memory and only propagate changes to disk periodically. A system might choose to write the contents of the buffer to disk every 5 seconds. Clearly, this strategy alleviates the I/O overhead significantly, resulting in dramatic performance benefits. However, if there is a failure (memory loss), the most recent set of changes will be lost. Other systems might choose to issue the I/O to operating system buffers and declare the change to be durable before the operating system writes the buffered data to disk. In this case, a process failure (but not operating system failure) will not affect the durability of the changes; however, an operating system failure will result in data loss of the most recent changes. Of course, the most stringent (and most expensive) method to ensure durability is to issue the I/O and then wait for the write operation to complete. It is also possible to write multiple disks (usually, this is done by the operating system or storage subsystem) to ensure that the changes can also survive a disk failure.
A distributed system such as Oracle NoSQL Database can take advantage of the multiple replicas to ensure durability. Because the goal of durability is to protect against processor, memory, and operating system failures, distributed systems leverage the fact that an update can be made durable by propagating the change to one or more replicas concurrently while writing the change to the local disk. The system can declare an operation to be durable after receiving acknowledgments for the update from the replicas, without waiting for the disk I/O to complete because the replicas have received the update (it is durable on another node). Depending on the speed of the network, the message delivery and receipt may be faster than the time it takes to complete a local write (I/O) operation.
Oracle NoSQL Database supports the notion of varying degrees of durability for update operations and exposes these options through the API so that the application designer can make the appropriate trade-offs between performance and durability on a per-operation basis. Three independent dimensions of durability are supported and the application developer can choose the option that best suits the requirements of the application. In the case of the master node, the application designer can choose whether the change should be considered durable when it is written to the log buffer, when it is written to the file system buffers, or when it is written to disk. The application designer can also choose whether the change should be propagated to the replicas asynchronously or synchronously (with acknowledgment). Finally, when the change has been propagated to the replicas, the application can also choose whether the change is considered durable when it is written to the log buffer, when it is written to the file system buffers, or when it is written to disk on the replicas. Thus, the developer has complete control over the degree of durability and required performance for each operation. For example, the choice of “write to local disk, wait for acknowledgments from all replicas, write to replica disk” is the most stringent option an application can choose.
Figure 4 illustrates the durability and consistency options that are available in Oracle NoSQL Database. Oracle NoSQL Database allows the user to choose the durability policy on a per-operation basis. Oracle NoSQL Database uses this information during transaction commit processing in order to achieve the best performance while honoring the durability requirements of the operation.
FIGURE 4. Configurable durability and consistency policies
Atomicity, consistency, isolation, and durability (ACID) are the key characteristics provided by transactions. Oracle NoSQL Database leverages the transaction capabilities of the underlying Berkeley DB storage engine. Berkeley DB supports row-level locking and two-phase locking to ensure that the effects of one transaction are isolated from other, concurrent transactions. We’ve already discussed the semantics of consistency and durability. In the rest of this section, we discuss the property of atomicity.
Transactional access to data is a critical requirement in many Oracle NoSQL Database applications. Transactions provide atomicity (“all or nothing” semantics) to ensure that either all or none of the changes in a transaction are made durable. Consider an Oracle NoSQL Database application that stores the list of items that the user intends to purchase (popularly referred to as the shopping cart) during a particular shopping session. Most often, the shipping costs depend on when the user expects the items to be delivered. For example, overnight delivery is more expensive than delivery within 8 business days. If the user changes the delivery dates for some items during the session, then it is important that the total cost of the transaction be updated to reflect the changes in delivery costs. Thus, the changes to the delivery date for each item and the shipping total cost of the purchase (including shipping costs) need to be updated atomically. Oracle NoSQL Database supports atomicity for all changes performed on various contents of the same major key, as long as all those changes are specified in a single request to the server.
Data modeling is a critical aspect of proper application design for Oracle NoSQL Database applications. The data model is very flexible and enables the application designer to model a wide variety of data structures, without compromising efficiency of storage or data access. Let us examine these capabilities in more detail below.
Major Keys, Minor Keys, and Values
Oracle NoSQL Database provides a key-value paradigm to the application developer. Every entity (record) is a set of key-value pairs. A key has multiple components, specified as an ordered list. The major key identifies the entity and consists of the leading components of the key. The subsequent components are called minor keys. This organization is similar to a directory path specification in a file system (for example, /Major/minor1/minor2/). The “value” part of the key-value pair is simply an uninterpreted string of bytes of arbitrary length.
This concept is best explained using an example. Consider storing information about a person, John Smith, who works at Oracle Headquarters, start date January 1, 2012, and has a telephone number +1-650-555-9999. The employee ID might be a logical choice for the major key for the person entity (for example, 123456789). In addition, the “person” entity might contain personal information (such as the person’s telephone number) and employment information (such as work location and hire date). The application designer can associate a minor key (for example, personal_info) with the personal information (+1-650-555-9999) and another minor key (for example, employment_info) with the employment information (Oracle Headquarters, start date January 1, 2012). Specifying the major key “123456789” would return “John Smith.” Specifying “/123456789/personal_info” as the key would access John Smith’s personal information; similarly, “/123456789/employment_ info” would be the key to access the employment information. Leading components of the key are always required. Oracle NoSQL Database internally stores these as separate key-value pair records; one for the user_id, a second for
user_id/personal_ info, and the third for
The API for manipulating key-value pairs is simple. The user can insert a single key-value pair into the database using a put() operation. Given a key, the user can retrieve the key-value pair using a
get() operation or delete it using a
delete() operation. The
get(), put(), and delete() operations operate on only a single (multi-component) key. Oracle NoSQL Database provides additional APIs that allow the application to operate on multiple key-value pairs within an entity (same major key) in a single transaction.
The major key determines which shard the record will belong to. All key-value pairs associated with the same entity (same major key) are always stored on the same shard. This implementation enables efficient, single-shard access to logically related subsets of the record. Figure 5 illustrates the concept of major and minor keys. Note that minor keys can be nested.
FIGURE 5. Major and minor keys
Oracle NoSQL Database also provides an unordered scan API that can be used to iterate over all the records in the database; unordered scans do not have transaction semantics, although only committed data will be returned to the application.
Large Object Support
An Oracle NoSQL database is often used to store large objects such as images, audio, videos, and maps. In the vast majority of usage scenarios, once such content is stored in the database, it is either retrieved or deleted, but never updated. For example, an audio or video streaming service might store vast amounts of such media content and then serve it up on demand.
Oracle NoSQL Database provides efficient support for managing large objects in the database and a streaming API for easy access to the information. A large object is stored internally as a sequence of object fragments (or chunks). Because each object fragment is much smaller than the entire object, this design is much more efficient in terms of memory requirements in the user application as well as the server. Further, the streaming API ensures that the fragments can be fetched efficiently from the containing shards.
Oracle NoSQL Database manages key-value pair data; the key and value can be arbitrary byte strings that are interpreted only by the application. Minor keys are a great convenience for representing the structure of the record. These capabilities provide a lot of flexibility in terms of evolving and changing the structure of content stored in Oracle NoSQL Database. However, the interpretation of the contents of a record is left entirely up to the application; the contents (value portion of the key-value pair) are represented as byte-arrays, which can make it difficult to share the data between multiple applications.
Oracle NoSQL Database also supports JSON schemas and Apache Avro for specifying the structure of the value in a key-value pair. JSON schemas are self-describing, support schema evolution, and are widely used in big data applications. Apache Avro is an extremely space-efficient serialization format for JSON schemas; thus, the use of JSON schemas and Avro serialization enables ease of application design and data exchange between various applications and systems. For example, JSON schemas enable easy sharing of data between Oracle NoSQL Database applications and MapReduce (Hadoop). Very often, big data applications use a variety of tools and technologies such as key-value stores, map-reduce processing, relational databases, and analytics in order to derive new insights; the easy and efficient exchange of data from one system to another is critical in such scenarios.
Oracle NoSQL Database Performance
Oracle NoSQL Database has been designed for applications that need fast, predictable, low latency access to vast amounts of data. Let us examine how Oracle NoSQL Database benefits such applications by considering a typical ecommerce environment. Such systems manage vast numbers of user profiles and have stringent response-time requirements. Whenever a user visits the site, the retailer provides a personalized web page based on the user’s profile. If no such profile exists, the site must create one. These user profiles will change over time as the retailer learns more about the users. Different user profiles may contain radically different information and the retailer may decide to collect new information at any time. Oracle NoSQL Database addresses this use case by virtue of its flexible key-value paradigm and scales to meet increasing customer demand. Oracle NoSQL Database has been optimized extensively to provide excellent scalable throughput and low latency. As of this writing, Oracle NoSQL Database has been benchmarked at over 1.2 million operations per second with an average latency of 1 millisecond for the 95 percent reads and 5 percent updates workload in the Yahoo! Cloud Serving Benchmark test suite. This test was performed on a 15-machine cluster running 10 shards with over 2 terabytes of data. To put these numbers in perspective, credit card fraud scoring applications typically require a throughput of less than 10,000 operations per second. Thus, Oracle NoSQL Database delivers performance that is more than sufficient to meet the requirements of the most demanding applications.
Oracle NoSQL Database Administration
A distributed system is composed of large numbers of hardware and software components. This necessitates a comprehensive and easy-to-use monitoring and administration tool to manage the system.
Oracle NoSQL Database includes administration utilities to manage operational tasks such as configuring the system, defining the topology of the system configuration, as well as adding new resources as needed. It also includes monitoring tools to track the health of the overall system as well as individual components, detect performance issues and hotspots, and dynamically redistribute the work as needed. These monitoring capabilities are invaluable for ensuring that the system continues to operate smoothly in spite of component and software failures.
Oracle NoSQL Database also provides JMX (Java Management Extensions) and SNMP (Simple Network Management Protocol) APIs for programmatic monitoring of the system. This makes it easy to integrate Oracle NoSQL Database with other monitoring and administration tools that might already be in use. This is a huge convenience to system administrators because it allows them to minimize the number of separate tools that might be required in order to ensure smooth operation of a production system.
Integration with Other Products
Most big data applications use multiple technologies including Oracle NoSQL Database in order to derive value from big data. For example, an ecommerce site might use Oracle NoSQL Database for the customer-facing application, a relational database repository to store master data, data warehousing and business intelligence tools for tracking key business parameters, and a MapReduce system to process and analyze unstructured information. It is crucial that the components of a big data application be well integrated so as to simplify the task of the application designer as well as the system administrator.
Oracle NoSQL Database integrates well with these related technologies and tools. The external tables capability allows the developer to query data stored in Oracle NoSQL Database from Oracle Database using SQL. SQL is arguably the most popular programming language today; being able to query Oracle NoSQL Database data using SQL is a tremendous benefit to many developers. This also provides a huge benefit for applications that need to reference key-value data along with relational data. For example, this is very useful in data warehousing applications that need to have a unified view of all data.
Oracle NoSQL Database also integrates with MapReduce technologies. MapReduce typically reads input data from a file system (most commonly, HDFS). Oracle NoSQL Database provides an interface that allows the mapper in a MapReduce job to read data directly from Oracle NoSQL Database. Because MapReduce is designed to process all semi-structured and unstructured data, this capability is very important in big data applications.
Oracle NoSQL Database is integrated with Oracle Event Processing. Data stored in Oracle NoSQL Database can be referenced by the event processing engine in interactive time in order to provide real-time alerts and notifications for meaningful events. For example, the event processing engine might be used for real-time monitoring and trading of stocks. If a user is “watching” a particular stock, then the event processing engine can look up the user’s parameters (buy, sell thresholds) and alert the user immediately when the specified conditions are satisfied.
Oracle Coherence is an in-memory data grid solution that enables organizations to predictably scale mission-critical applications by providing fast access to frequently used data. Oracle NoSQL Database is integrated with Oracle Coherence; Coherence is used to cache the most frequently accessed data in memory, while NoSQL Database provides a scalable persistent repository for vast amounts of data stored on disk.
Oracle NoSQL Database is also integrated with Oracle RDF Graph; this makes it easy to discover relationships between key-value pair records stored in the Oracle NoSQL Database. The most obvious example of this capability is social networking to discover new friends and contacts, as popularized by Facebook and LinkedIn. There are many other scenarios such as fraud detection and security where graph traversal capabilities for big data are important as well.
Oracle NoSQL Database Licensing
Oracle NoSQL Database is distributed as an open source version as well as an enterprise version. The Community Edition is available under the open source AGPLv3 license and is intended for use in open source applications. Oracle NoSQL Database Enterprise Edition is available under a commercial license and is intended for proprietary applications.
Both versions of the product provide the same basic capabilities that are needed to manage large amounts of key-value data. The Enterprise Edition also offers tighter integration between Oracle NoSQL Database and other related Oracle products such as RDF Graph, Oracle Event Processing, Oracle Coherence, and Oracle Database. These additional capabilities make it easy to use Oracle NoSQL Database within a larger data management ecosystem that may include semantic data and streaming event data as well as relational data. Most big data applications use a combination of products and technologies in order to derive new insights and business value from multiple data sources. Oracle NoSQL Database Enterprise Edition is an excellent choice for a scalable key-value store in such deployments.
In this article, we discussed Oracle Berkeley DB, which is the foundational building block for Oracle NoSQL Database. We discussed the three types of database architectures and then examined some of the characteristics of Oracle NoSQL Database like partitioning and sharding, availability, consistency and durability options, support for transactions and data modeling. We discussed the need for integrating Oracle NoSQL Database with related technologies such as MapReduce, Oracle Database, Oracle Event Processing, and Oracle Coherence. All these features and capabilities make Oracle NoSQL Database a compelling solution for today’s big data processing needs.