Integrator ETL is the central hub of Oracle Endeca Integrator. The other components listed previously provide data to Integrator ETL. Only Integrator ETL is capable of executing all the steps in Figure 1. Integrator ETL is an ETL tool, and the “load” part of Integrator ETL facilitates data ingest into Endeca Server. Integrator ETL is not limited to loading data from Endeca IAS and Endeca WAT. Integrator ETL can read data from many data sources, including flat files, Excel spreadsheets, XML files, and nearly every type of database.
Integrator ETL Background
Integrator ETL is an Oracle distribution of CloverETL, a commercial software product that is one of the most widely used ETL software products. The version of CloverETL distributed by Oracle has special features for Endeca, and this is what makes it Endeca ETL. These special features are in a palette of commands on the Discovery tab under the New drop-down menu, with two types of wizards for Endeca. These are shown in Figure 1.
FIGURE 1. Endeca features added to CloverETL for Endeca ETL
Endeca Server operations are carried out through the web services using Bulk Add/Replace Records, available on the Discovery tab. In addition, output writers are available for nearly all of the major databases, including Oracle, MySQL, DB2, PostgreSQL, and DB2, as well as for Hadoop, via native HDFS connectivity or a JDBC hive.
This versatility enables Endeca ETL to serve as a general-purpose ETL tool for your organization to address any integration needs since it can read nearly any data source and output to a wide array of data servers.
Oracle provides usage documentation for Integrator ETL, including the “Integrator ETL Designer Guide,” which is documentation by CloverETL on how to use the tool, and the “Integrator ETL User’s Guide,” which is documentation by Oracle that gives instructions and examples specific to Endeca.
Text Enrichment
Earlier in my blog we covered two unstructured data acquisition tools that are part of Oracle Endeca: the Endeca IAS and the Endeca WAT. Both tools acquire unstructured data and make it available to Endeca ETL. Unstructured data usually consists of voluminous amounts of text, and under the circumstance where little is known about the content of the unstructured data, Text Enrichment can provide insight into the content of it. A product known as the Salience Engine from Lexalytics adds the Text Enrichment capability to Endeca ETL. The Salience Engine extracts entities from text data, including people, places, and organizations, and it also extracts quotations and themes. Sentiment Analysis is another feature available from the Salience Engine, and when bundled with Text Enrichment, it is known as Text Enrichment with Sentiment Analysis. Text Enrichment with Sentiment Analysis provides an overall sentiment score for the current document, for specific entities, or for specific themes.
Text Enrichment and Text Enrichment with Sentiment Analysis features can be accessed only from within Integrator ETL. The Salience Engine is installed separately from Integrator ETL and is also licensed separately. We will cover aspects of the installation in the “Integrator ETL Text Enrichment Installation” section of this blog.
Be aware that Text Enrichment will only process sentences ending with appropriate punctuation, which includes periods, question marks, and exclamation marks. If sentences are not properly terminated with punctuation, then themes may not be extracted. Text Enrichment also supports foreign languages and can process Twitter feeds. Note that Twitter feed processing is supported only for English.
Integrator ETL Basic Usage
Previously we providing a thorough example and explanation of using Endeca ETL. In this blog article we will cover some aspects of Endeca ETL not covered in late. As was stated earlier in my blog, Integrator ETL is based on CloverETL, a popular ETL tool, and has features to support its use with Endeca. In this section, we will cover some basic concepts of Integrator ETL and explain these concepts with some examples. This section does not endeavor to explain ETL tools in general and assumes you are familiar with the basic features of ETL tools. If you have used ETL tools before, many of the components in Endeca ETL will be familiar to you in terms of their function. Because Endeca ETL is a mainstream ETL tool adapted to Endeca usage, we will focus on features specific to Endeca.
Clarifying Enrichment Features of Endeca Studio and Integrator ETL
You will recall that enrichment in Endeca Studio was covered in this blog with an example using Whitelist Text Tagging. The enrichment feature in Endeca Studio consists of Whitelist Text Tagging and another feature, Extract Terms. In the Integrator ETL documentation called “Integrator ETL User’s Guide,” there is a section titled “Choosing Text Enrichment or Text Tagging.” This can be confusing, so let’s try to get this straightened out. Endeca Studio’s enrichment features that provide whitelisting and term extraction have a counterpart in Endeca ETL known as Text Tagging. Endeca Studio Enrichment and Endeca ETL Text Enrichment are not the same. Endeca Studio Enrichment does not have the Text Enrichment features provided by the Lexalytics Salience Engine.
Integrator ETL provides a design environment where extract, transform, and load (ETL) processes are developed for Endeca. These processes are known as graphs and consist of components connected by lines that are referred to as edges. Components are added to a graph by dragging them from a palette and making connections with a mouse from one component to another to create edges. Reader components collect data from a variety of sources and can be thought of as providing input to the graphs. Writer components send transformed data to data stores and can be thought of as producing output from the graphs. Between reader components and writer components are transformation components that modify data, combining data from multiple sources, and manage control of program flow in a graph. Let’s take a look at a simple example of a graph, as shown in Figure 2.
FIGURE 2. Endeca ETL graph
In this example, there are two reader components for Microsoft Excel spreadsheets whose data is joined and then written to Endeca Server. This graph is used to combine the FDIC bank failure data with longitude and latitude data to allow the map component to be used with this data in Endeca Studio. The transformation component of interest in this example is the ExtHashJoin component, which joins the data from the two spreadsheets. The lines between the components, known as edges, have metadata associated with them.
Edges determine what data moves from one component to another and always have metadata associated with them. You can view the metadata associated with an edge simply by double-clicking it. Figure 3 shows how this metadata appears in Endeca ETL; this is the metadata associated with the input to the BankFailureData writer component.
FIGURE 3. Endeca ETL metadata
In this metadata, item number 8 is the geo data used to drive the map component. This data consists of latitude and longitude data joined together and separated by a space. This is performed in the ExtHashJoin component, as shown in Figure 4.
FIGURE 4. Creating geo data by joining latitude and longitude
For Endeca Server to recognize the geo field as GeoSpatial data that can be used with the Endeca Studio map component, you must create a special attribute in the metadata. To do this, you add a special attribute or mdexType and assign it a value of mdex:geocode, as shown in Figure 5.
FIGURE 5. mdex:geocode attribute required for the Endeca Studio Map component
Integrator ETL and Text Enrichment
Earlier in my blog, we showed an example of a crawl of FDIC press release information using Endeca IAS. This press release information is unstructured data containing unknown information and can be enhanced by using Text Enrichment. First let’s look at how you connect to Endeca IAS from Endeca ETL. Endeca ETL has a component specifically for collecting data from the web service exposed by Endeca IAS, the RECORD_STORE_READER
. Figure 6 shows the parameters for this component. This component requires the DNS name or IP address of the IAS server, the port used for the web service, and the name of the record store. The client ID is an arbitrary name used to identify the client. The IAS read type specifies that all records should be read, with the Full Extract option, or only records that have changed since the last read by the same client ID, with the incremental option.
FIGURE 6. RECORD_STORE_READER Endeca ETL component
The edge on the output of the RECORD_STORE_READER
must have metadata that matches the data stored in the record store. Endeca ETL has a wizard to make this possible. Figure 7 shows this wizard.
FIGURE 7. Loading metadata from Endeca IAS record store
To make sense of the text collected from RECORD_STORE_READER
, you can use the Text Enrichment component. The Text Enrichment component requires that a file called TextEnrichment.properties be present on the Integrator ETL file system. This file determines how Text Enrichment will function. An example of this file is shown here:
Also, the Salience Engine by Lexalytics must be installed on the same server as Integrator ETL. The installation is covered in the “Integrator ETL Text Enrichment Installation” section later in this article. Be careful to use forward slashes in all path names on Windows systems when referring to the installation location of the Salience Engine. The TextEnrichment.properties file is used to specify which features of Text Enrichment are used with the Text Enrichment component. In this example, you will extract entities, themes, and a summary. Figure 8 shows the Text Enrichment component. Note that the input field defines which field will be enriched by the component. The location of the text enrichment con figuration file must be specified on the component in the configuration parameter field. Note the forward slashes used on Windows systems instead of the usual backslashes used in directory names. In this example, the Endeca_Document_Text field will be enriched. This is one of the fields in the metadata for the RECORD_STORE_READER
component. The Salience data path refers to the installation location of the Salience Engine.
FIGURE 8. Endeca ETL Text Enrichment component
One final item to configure for the Text Enrichment component is the metadata for the output of the component. This metadata must include all fields added by the Text Enrichment component. These field names are defined in the TextEnrichment.properties file. This is shown in Figure 9.
FIGURE 9. Endeca ETL Text Enrichment metadata
With this configuration done, you can now run the graph in debug mode and examine some of the output. Figure 10 shows some of the entities discovered by Text Enrichment.
FIGURE 10. Sample of entities extracted by Text Enrichment
From Figure 10 you can see Text Enrichment generally extracts people correctly, but some additional filtering of this field may be needed. Using Text Enrichment and developing applications that use Text Enrichment are iterative processes, where extraction processes are optimized, or “dialed in,” and the quality of the source data is evaluated.
Integrator ETL Installation
Oracle provides an installation guide specifically for Integrator ETL called “Oracle Endeca Information Discovery Integrator: Integrator ETL Installation Guide” that covers in depth the installation of Integrator ETL. Red Hat and Oracle Linux are supported, as is Microsoft Windows 2008 Server. Only Intel 64-bit x86 processors are supported. We have used both the Linux and Windows versions of Integrator ETL and found the Windows version to be a better experience. The installation guide states that Windows 7 can be used for development and nonproduction environments. We have used Windows 7 and Windows 8 successfully.
Since Integrator ETL is a Java-based application, installation differs from the process you would normally use with Windows installer programs. The installation file is delivered as a ZIP file, which should be unzipped to a convenient location. A Windows batch script file is used to facilitate the installation.
We recommend running the batch file from the Windows command-line interface instead of from Windows Explorer. This involves opening a Windows command prompt and changing to the directory where the batch file is located using the window change directory command cd. Run the batch file to begin the installation. The installation guide describes two required files, the Eclipse IDE for Java Developers version 3.7 and the Remote System Explorer. These two ZIP files should be located on the Windows file system where you are installing Integrator ETL. Type install.bat to run the installation; an example of this is shown here:
In Integrator ETL, type the name of the executable, as shown here:
To allow easy access to Integrator ETL, create a Windows shortcut and copy it to the Windows taskbar with your other favorite development tools, as shown in Figure 11.
FIGURE 11. Windows shortcut to Integrator ETL for ease of access
Integrator ETL Text Enrichment Installation
If you plan on using Text Enrichment, be aware that a separate installation is required for the Salience Engine. You can download this from the Oracle Software Delivery Cloud; it is available in two versions, Text Enrichment and Text Enrichment with Sentiment Analysis. The Salience Engine must be installed on the same server hosting Endeca ETL. These are large downloads, 1.7GB as of this writing. Oracle provides a document with this download describing the installation of Text Enrichment titled “Oracle Endeca Text Enrichment Installation Guide.” For Windows installations, the Text Enrichment installer downloads and installs required components from the Internet.
If you plan on using Text Enrichment with Foreign Languages, then you must download and install the Text Enrichment foreign languages packages from Oracle Software Delivery Cloud.