The amount of data being generated is on the verge of an explosion, and according to an International Data Corporation (IDC) 2012 report, the total amount of data stored by corporations globally would surpass a zettabyte (1 zettabyte = 1 billion terabytes) by the end of 2012. Therefore, it is critical for the data companies to be prepared with an infrastructure that can store and analyze extremely large datasets, and be able to generate actionable intelligence that in turn can drive business decisions. Oracle offers a broad portfolio of products to help enterprises acquire, manage, and integrate big data with existing corporate data, and perform rich and intelligent analytics.
Implementing big data solutions with tools and techniques that are not tested or integrated is too risky and problematic. The approach to solve big data problems should follow best practice methodologies and toolsets that are proven in real-world deployments. The typical best practices for processing big data can be categorized by the flow of data in the processing stream, mainly the data acquisition, data organization, and data analysis. Oracle’s big data technology stack includes hardware and software components that can process big data during all the critical phases of its lifecycle, from acquisition to storage to organization to analysis.
Oracle engineered systems such as Oracle Big Data Appliance, Oracle Exadata, and Oracle Exalytics, along with the Oracle’s proprietary and open source software, are able to acquire, organize, and analyze all enterprise data, including structured and unstructured data, to help make informed business decisions.
The acquire phase refers to the acquisition of incoming big data streams from a variety of sources such as social media, mobile devices, machine data, and sensor data. The data often has flexible structures, and comes in with high velocity and in large volumes. The infrastructure needed to ingest and persist these big datasets needs to provide low and predictable latencies when writing data, high throughput on scans, and very fast and quick lookups, and it needs to support dynamic schemas. Some of the popular technologies that support the requirements of storing big data are NoSQL databases, Hadoop Distributed File System (HDFS), and Hive.
NoSQL databases are designed to support high performance and dynamic schema requirements; in fact, they are considered the real-time databases of big data. They are able to provide fast throughput on writes because they use a simple data model in which the data is stored as-is with its original structure, along with a single identifying key, rather than interpreting and converting the data into a well-defined schema. The reads also become very simple: You supply a key and the database quickly returns the value by performing a key-based index lookup. The NoSQL databases are also distributed and replicated to provide high availability and reliability, and can linearly scale in performance and capacity just by adding more Storage Nodes to the cluster. With this lightweight and distributed architecture, NoSQL databases can rapidly store a large number of transactions and provide extremely fast lookups.
NoSQL databases are well suited for storing data with dynamic structures. NoSQL databases simply capture the incoming data without parsing or making sense of its structure. This provides low latencies at write time, which is a great benefit, but the complexity is shifted to the application at read time because it needs to interpret the structure of stored data, which is often a great trade-off because when the underlying data structures change, the effect is only noticed by the application querying the data. Modifying application logic to support schema evolution is considered more cost-effective than reorganizing the data, which is resource-intensive and time-consuming, especially when multi-terabytes of data are involved. Project planners already assume that change is part of an application lifecycle, but not so much for reorganization of data.
Hadoop Distributed File System (HDFS) is another option to store big data. HDFS is the storage engine behind the Apache Hadoop project, which is the software framework built to handle storage and processing of big data. Typical use of HDFS is for storing data warehouse–oriented datasets whose needs are store-once and scan-many-times, with the scans being directed at most of the stored data. HDFS works by splitting the file into small chunks called blocks, and then storing the blocks across a cluster of HDFS servers. As with NoSQL, HDFS also provides high scalability, availability, and reliability by replicating the blocks multiple times, and providing the capability to grow the cluster by simply adding more nodes.
Apache Hive is another option for storing data warehouse - like big data. It is a SQL-based infrastructure originally built at Facebook for storing and processing data residing in HDFS. Hive simply imposes a structure on HDFS files by defining a table with columns and rows—which means it is ideal for supporting structured big datasets. HiveQL is the SQL interface into Hive in which users query data using the popular SQL language.
HDFS and Hive are both not designed for OLTP workloads and do not offer update or real-time query capabilities, for which NoSQL databases are best suited. On the flip side, HDFS and Hive are best suited for batch jobs over big datasets that need to scan large amounts of data, a capability that NoSQL databases currently lack.
Once the data is acquired and stored in a persistent store such as a NoSQL database or HDFS, it needs to be organized further in order to extract any meaningful information on which further analysis could be performed. You could think of data organization as a combination of knowledge discovery and data integration, in which large volumes of big data undergo multiple phases of data crunching, at the end of which the data takes a form suitable to perform meaningful business analysis. It is only after the organization phase that you begin to see a business value from the otherwise yet-to-be-valued big data.
Multiple technologies exist for organizing big data, the popular ones being Apache Hadoop MapReduce Framework, Oracle Database In-Database Analytics, R Analytics, Oracle R Enterprise, and Oracle Big Data Connectors.
The MapReduce framework is a programming model, originally developed at Google, to assist in building distributed applications that work with big data. MapReduce allows the programmer to focus on writing the business logic, rather than focusing on the management and control of the distributed tasks, such as task parallelization, inter-task communication, and data transfers, and handling restarts upon failures.
As you can imagine, MapReduce can be used to code any business logic to analyze large datasets residing in HDFS. MapReduce is a programmer’s paradise for analyzing big data, along with the help of several other Apache projects such as Mahout, an open source machine learning framework. However, MapReduce requires the end user to know programming language such as Java, which needs quite a few lines of code even for programming a simple scenario. Hive, on the other hand, translates the SQL-like statements (HiveQL) into MapReduce programs behind the scenes, a nice alternative to coding in Java since SQL is a language that most data analysts are already familiar with.
Open source R along with its add-on packages can also be used to perform MapReduce-like statistical functions on the HDFS cluster without using Java. R is a statistical programming language and an integrated graphical environment for performing statistical analysis. R language is a product of a community of statisticians, analysts, and programmers who are not only working on improvising and extending R, but also are able to strategically steer its development, by providing open source packages that extend the capability of R.
The results of R scripts and MapReduce programs can be loaded into the Oracle Database where further analytics can be performed (see the next section on the analyze phase). This leads to an interesting topic - integration of big data with transactional data resident in a relational database management system such as the Oracle Database. Transactional data of an enterprise has extreme value in itself, whether it is the data about enterprise sales, or customers, or even business performance. The big data residing in HDFS or NoSQL databases can be combined with the transactional data in order to achieve a complete and integrated view of business performance.
Oracle Big Data Connectors is a suite of optimized software packages to help enterprises integrate data stored in Hadoop or Oracle NoSQL Database with Oracle Database. It enables very fast data movements between these two environments using Oracle Loader for Hadoop and Oracle Direct Connector for Hadoop Distributed File System (HDFS), while Oracle Data Integrator Application Adapter for Hadoop and Oracle R Connector for Hadoop provide non-Hadoop experts with easier access to HDFS data and MapReduce functionality.
Oracle NoSQL Database also has the capability to expose the key-value store data to the Oracle Database by combining the powerful integration capabilities of the Oracle NoSQL Database with the Oracle Database external table feature. The external table feature allows users to access data (read-only) from sources that are external to the database such as flat files, HDFS, and Oracle NoSQL Database. External tables act like regular database tables for the application developer. The database creates a link that just points to the source of the data, and the data continues to reside in its original location. This feature is quite useful for data analysts who are accustomed to using SQL for analysis.
The infrastructure required for analyzing big data must be able to support deeper analytics such as data mining, predictive analytics, and statistical analysis. It should support a variety of data types and scale to extreme data volumes, while at the same time deliver fast response times. Also, supporting the ability to combine big data with traditional enterprise data is important because new insight comes not just from analyzing new data or existing data, but by combining and analyzing together to provide new perspectives on old problems.
Oracle Database supports the organize and analyze phases of big data through the in-database analytics functionality that is embedded within the database. Some of the useful in-database analytics features of the Oracle Database are Oracle R Enterprise, Data Mining and Predictive Analytics, and in-database MapReduce. The point here is that further organization and analysis on big data can still be performed even after the data lands in Oracle Database. If you do not need further analysis, you can still leverage SQL or business intelligence tools to expose the results of these analytics to end users.
Oracle R Enterprise (ORE) allows the execution of R scripts on datasets residing inside the Oracle Database. The ORE engine interacts with datasets residing inside the database in a transparent fashion using standard R constructs, thus providing a rich end-user experience. ORE also enables embedded execution of R scripts, and utilizes the underlying Oracle Database parallelism to run R on a cluster of nodes.
In-Database Data Mining offers the capability to create complex data mining models for performing predictive analytics. Data mining models can be built by data scientists, and business analysts can leverage the results of these predictive models using standard BI tools. In this way the knowledge of building the models is abstracted from the analysis process.
In-Database MapReduce provides the capability to write procedural logic conforming to the popular MapReduce model, and seamlessly leverage Oracle Database parallel execution. In-database MapReduce allows data scientists to create high-performance routines with complex logic, using PL/SQL, C, or Java.
Each one of the analytical components in Oracle Database is quite powerful by itself, and combining them creates even more value to the business. Once the data is fully analyzed, tools such as Oracle Business Intelligence Enterprise Edition and Oracle Endeca Information Discovery help assist the business analyst in the final decision-making process.
Oracle Business Intelligence Enterprise Edition (OBI EE) is a comprehensive platform that delivers full business intelligence capabilities, including BI dashboards, ad-hoc queries, notifications and alerts, enterprise and financial reporting, scorecard and strategy management, business process invocation, search and collaboration, mobile, integrated systems management, and more.
OBI EE includes the BI Server that integrates a variety of data sources into a Common Enterprise Information Model and provides a centralized view of the business model. The BI Server also comprises an advanced calculation and integration engine, and provides native database support for a variety of databases, including Oracle. Front-end components in OBI EE provide ad-hoc query and analysis, high precision reporting (BI Publisher), strategy and balanced scorecards, dashboards, and linkage to an action framework for automated detection and business processes. Additional integration is also provided to Microsoft Office, mobile devices, and other Oracle middleware products such as WebCenter.
Oracle Endeca Information Discovery is a platform designed to provide rapid and intuitive exploration and analysis of both structured and unstructured data sources. Oracle Endeca enables enterprises to extend the analytical capabilities to unstructured data, such as social media, websites, e-mail, and other big data. Endeca indexes all types of incoming data so the search and the discovery process can be fast, thereby saving time and cost, and leading to better business decisions. The information can also be enriched further by integrating with other analytical capabilities such as sentiment and lexical analysis, and presented in a single user interface that can be utilized to discover new insights.