Our economy is changing. Companies that want to remain competitive need to find new ways to attract and retain their customers. To do this, the technology and people who create it must support these efforts quickly and in a cost-effective way. New thoughts about how to implement solutions are moving away from traditional methods toward processes, procedures, and technologies that at times seem bleeding-edge.
The following case studies demonstrate how business problems have successfully been solved faster, cheaper, and more effectively by thinking outside the box. Table 1. summarizes five case studies where NoSQL solutions were used to solve particular business problems. It presents the problems, the business drivers, and the ultimate findings. As you view subsequent sections, you’ll begin to see a common theme emerge: some business problems require new thinking and technology to provide the best solution.
Table 1. The key case studies associated with the NoSQL movement - the name of the case study/ standard, the business drivers, and the results (findings) of the selected solutions
Need to flexibly store tabular data in a distributed system.
Need to accept a web order 24 hours a day, 7 days a week.
Need to query large collections of XML documents stored on commodity hardware using standard query languages.
By using a sparse matrix approach, users can think of all data as being stored in a single table with billions of rows and millions of columns without the need for up-front data modeling.
A key-value store with a simple interface can be replicated even when there are large volumes of data to be processed.
By distributing queries to commodity servers that contain indexes of XML documents, each server can be responsible for processing data in its own local disk and returning the results to a query server.
Case study: LiveJournal’s Memcache
Engineers working on the blogging system LiveJournal started to look at how their systems were using their most precious resource: the RAM in each web server. LiveJournal had a problem. Their website was so popular that the number of visitors using the site continued to increase on a daily basis. The only way they could keep up with demand was to continue to add more web servers, each with its own separate RAM.
To improve performance, the LiveJournal engineers found ways to keep the results of the most frequently used database queries in RAM, avoiding the expensive cost of rerunning the same SQL queries on their database. But each web server had its own copy of the query in RAM; there was no way for any web server to know that the server next to it in the rack already had a copy of the query sitting in RAM.
So the engineers at LiveJournal created a simple way to create a distinct “signature” of every SQL query. This signature or hash was a short string that represented a SQL
SELECT statement. By sending a small message between web servers, any web server could ask the other servers if they had a copy of the SQL result already executed. If one did, it would return the results of the query and avoid an expensive round trip to the already overwhelmed SQL database. They called their new system Memcache because it managed RAM memory cache.
Many other software engineers had come across this problem in the past. The concept of large pools of shared-memory servers wasn’t new. What was different this time was that the engineers for LiveJournal went one step further. They not only made this system work (and work well), they shared their software using an open source license, and they also standardized the communications protocol between the web front ends (called the memcached protocol). Now anyone who wanted to keep their database from getting overwhelmed with repetitive queries could use their front end tools.
Case study: Google’s MapReduce - use commodity hardware to create search indexes
One of the most influential case studies in the NoSQL movement is the Google MapReduce system. In this paper, Google shared their process for transforming large volumes of web data content into search indexes using lowcost commodity CPUs.
Though sharing of this information was significant, the concepts of map and reduce weren’t new. Map and reduce functions are simply names for two stages of a data transformation, as described in figure 1.
The initial stages of the transformation are called the map operation. They’re responsible for data extraction, transformation, and filtering of data. The results of the map operation are then sent to a second layer: the reduce function. The reduce function is where the results are sorted, combined, and summarized to produce the final result.
The core concepts behind the map and reduce functions are based on solid computer science work that dates back to the 1950s when programmers at MIT implemented these functions in the influential LISP system. LISP was different than other programming languages because it emphasized functions that transformed isolated lists of data. This focus is now the basis for many modern functional programming languages that have desirable properties on distributed systems.
Google extended the map and reduce functions to reliably execute on billions of web pages on hundreds or thousands of low-cost commodity CPUs. Google made map and reduce work reliably on large volumes of data and did it at a low cost. It was Google’s use of MapReduce that encouraged others to take another look at the power of functional programming and the ability of functional programming systems to scale over thousands of low-cost CPUs. Software packages such as Hadoop have closely modeled these functions.
Figure 1. The map and reduce functions are ways of partitioning large datasets into smaller chunks that can be transformed on isolated and independent transformation systems. The key is isolating each function so that it can be scaled onto many servers.
The use of MapReduce inspired engineers from Yahoo! and other organizations to create open source versions of Google’s MapReduce. It fostered a growing awareness of the limitations of traditional procedural programming and encouraged others to use functional programming systems.
Case study: Google’s Bigtable - a table with a billion rows and a million columns
Google also influenced many software developers when they announced their Bigtable system white paper titled A Distributed Storage System for Structured Data. The motivation behind Bigtable was the need to store results from the web crawlers that extract HTML pages, images, sounds, videos, and other media from the internet. The resulting dataset was so large that it couldn’t fit into a single relational database, so Google built their own storage system. Their fundamental goal was to build a system that would easily scale as their data increased without forcing them to purchase expensive hardware. The solution was neither a full relational database nor a filesystem, but what they called a “distributed storage system” that worked with structured data.
By all accounts, the Bigtable project was extremely successful. It gave Google developers a single tabular view of the data by creating one large table that stored all the data they needed. In addition, they created a system that allowed the hardware to be located in any data center, anywhere in the world, and created an environment where developers didn’t need to worry about the physical location of the data they manipulated.
Case study: Amazon’s Dynamo - accept an order 24 hours a day, 7 days a week
Google’s work focused on ways to make distributed batch processing and reporting easier, but wasn’t intended to support the need for highly scalable web storefronts that ran 24/7. This development came from Amazon. Amazon published another significant NoSQL paper: Amazon’s 2007 Dynamo: A Highly Available Key-Value Store. The business motivation behind Dynamo was Amazon’s need to create a highly reliable web storefront that supported transactions from around the world 24 hours a day, 7 days a week, without interruption.
Traditional brick-and-mortar retailers that operate in a few locations have the luxury of having their cash registers and point-of-sale equipment operating only during business hours. When not open for business, they run daily reports, and perform backups and software upgrades. The Amazon model is different. Not only are their customers from all corners of the world, but they shop at all hours of the day, every day. Any downtime in the purchasing cycle could result in the loss of millions of dollars. Amazon’s systems need to be ironclad reliable and scalable without a loss in service.
In its initial offerings, Amazon used a relational database to support its shopping cart and checkout system. They had unlimited licenses for RDBMS software and a consulting budget that allowed them to attract the best and brightest consultants for
their projects. In spite of all that power and money, they eventually realized that a relational model wouldn’t meet their future business needs.
Many in the NoSQL community cite Amazon’s Dynamo paper as a significant turning point in the movement. At a time when relational models were still used, it challenged the status quo and current best practices. Amazon found that because key-value stores had a simple interface, it was easier to replicate the data and more reliable. In the end, Amazon used a key-value store to build a turnkey system that was reliable, extensible, and able to support their 24/7 business model, making them one of the most successful online retailers in the world.
Case study: MarkLogic
In 2001 a group of engineers in the San Francisco Bay Area with experience in document search formed a company that focused on managing large collections of XML documents. Because XML documents contained markup, they named the company MarkLogic.
MarkLogic defined two types of nodes in a cluster: query and document nodes. Query nodes receive query requests and coordinate all activities associated with executing a query. Document nodes contain XML documents and are responsible for executing queries on the documents in the local filesystem.
Query requests are sent to a query node, which distributes queries to each remote server that contains indexed XML documents. All document matches are returned to the query node. When all document nodes have responded, the query result is then returned.
The MarkLogic architecture, moving queries to documents rather than moving documents to the query server, allowed them to achieve linear scalability with petabytes of documents.
MarkLogic found a demand for their products in US federal government systems that stored terabytes of intelligence information and large publishing entities that wanted to store and search their XML documents. Since 2001, MarkLogic has matured into a general-purpose highly scalable document store with support for ACID transactions and fine-grained, role-based access control. Initially, the primary language of MarkLogic developers was XQuery paired with REST; newer versions support Java as well as other language interfaces.
MarkLogic is a commercial product that requires a software license for any datasets over 40 GB. NoSQL is associated with commercial as well as open source products that provide innovative solutions to business problems.
Applying your knowledge
To demonstrate how the concepts can be applied, we introduce you to Sally Solutions. Sally is a solution architect at a large organization that has many business units. Business units that have information management issues are assigned a solution architect to help them select the best solution to their information challenge.
Sally works on projects that need custom applications developed and she’s knowledgeable about SQL and NoSQL technologies. Her job is to find the best fit for the business problem.
Now let’s see how Sally applies her knowledge in two examples. In the first example, a group that needed to track equipment warranties of hardware purchases came to Sally for advice. Since the hardware information was already in an RDBMS and the team had experience with SQL, Sally recommended they extend the RDBMS to include warranty information and create reports using joins. In this case, it was clear that SQL was appropriate.
In the second example, a group that was in charge of storing digital image information within a relational database approached Sally because the performance of the database was negatively impacting their web application’s page rendering. In this case, Sally recommended moving all images to a key-value store, which referenced each image with a URL. A key-value store is optimized for read-intensive applications and works with content distribution networks. After removing the image management load from the RDBMS, the web application as well as other applications saw an improvement in performance.
Note that Sally doesn’t see her job as a black-and-white, RDBMS versus NoSQL selection process. Sometimes the best solution involves using hybrid approaches.
This blog began with an introduction to the concept of NoSQL. We then showed how the power wall forced systems designers to use highly parallel processing designs and required a new type of thinking for managing data. You also saw that traditional systems that use object-middle tiers and RDBMS databases require the use of complex object-relational mapping systems to manipulate the data. These layers often get in the way of an organization’s ability to react quickly to changes (agility).
When we venture into any new technology, it’s critical to understand that each area has its own patterns of problem solving. These patterns vary dramatically from technology to technology. Making the transition from SQL to NoSQL is no different. NoSQL is a new paradigm and requires a new set of pattern recognition skills, new ways of thinking, and new ways of solving problems. It requires a new cognitive style.
Opting to use NoSQL technologies can help organizations gain a competitive edge in their market, making them more agile and better equipped to adapt to changing business conditions. NoSQL approaches that leverage large numbers of commodity processors save companies time and money and increase service reliability.
As you’ve seen in the case studies, these changes impacted more than early technology adopters: engineers around the world realize there are alternatives to the RDBMS-as-our-only-option mantra. New companies focused on new thinking, technologies, and architectures have emerged not as a lark, but as a necessity to solving real business problems that don’t fit into a relational mold. As organizations continue to change and move into global economies, this trend will continue to expand.