This blog provides more detail about the major components of Oracle Endeca Information Discovery Integrator. The Endeca Information Discovery Integrator consists of the following five components:
- Integrator ETL
- Integration Server
- Integrator Acquisition System
- Web Acquisition Toolkit
- IKM SQL to Endeca Server
Integrator ETL is part of Endeca Information Discovery Integrator and is an integration platform that enables source records to be extracted from many types of sources. Integrator is a powerful graphical-based development environment; its primary purpose is to develop, debug, and execute ETL processes. The graphic diagrams of these processes are referred to as graphs. A graph is a series of sequential components that process data. The user interface for Integrator ETL is based on the popular open-source Eclipse integrated development environment (IDE). Integrator ETL can run graphs without Integrator Server; when the integration server is not used, the graphs are run on the machine running the Integrator ETL IDE.
Integrator ETL is very capable and supports connectivity to a wide variety of other systems and software. JDBC is used to provide connectivity to the following databases:
Bulk loader capability is available for Oracle, Microsoft SQL, Informix, and IBM DB2.
Integration with systems supporting Java Message Service (JMS) is available for the following:
- IBM WebSphere MQ
- Apache Active MQ
- JBoss Messaging
The following protocols are also supported:
Some proprietary formats are also supported, which include
- Microsoft Excel
- FoxPro (dBase)
The integration server allows graphs to be treated as a production process and run in an enterprisewide team environment. As discussed in the previous section, graphs can run in Integrator ETL. The integration server provides an alternative to this and should be considered for any large-scale deployment of Oracle Endeca. As with any production process tool, the integration server allows graphs to be scheduled and monitors the execution status of graphs.
The integration server comes with a web-based user interface for configuration and administration. In addition, the integration server features an API for remote operations and interoperability with other systems.
Integrator Acquisition System
The Integrator Acquisition System (IAS) is a set of components that crawl source data stored in a variety of formats, including file systems, JDBC databases, flat files, web sources, and custom data sources. IAS transforms the data, if necessary, and outputs it to an XML file or a record store that can be accessed by Integrator ETL. Within IAS there are two major processing entities: the IAS Server and the Web Crawler.
The IAS Server crawls JDBC sources, file systems, or custom-created sources. The number of types of sources and file formats supported is extensive. Table 1 lists the types of sources and some examples to illustrate the power and versatility that is available with this product. For a complete list of supported formats, please refer to the appendix.
The IAS Web Crawler is installed by default as part of the IAS. The IAS Web Crawler gathers data by crawling HTTP and HTTPS web sites. Once a crawl is completed, the Integrator ETL can access the data acquired during the crawl. A crawl usually writes data directly to Endeca Server. However, data can also be written to an XML file for debugging and development. The Web Crawler is for large-scale crawling and has an architecture that allows developers to create custom plug-ins. Plug-ins provide a means to extract additional content, such as HTML meta tags, from web pages.
Web Acquisition Toolkit
The Oracle Endeca Web Acquisition Toolkit (WAT) provides an intuitive, simple-to-use graphical interface for collecting content from the Web, allowing users to rapidly access and integrate any information exposed through a web front end. Endeca WAT allows for information to be collected from many sources, including content from consumer sites, industry forums, government or supplier portals, cloud applications, and other big data sources.
Endeca Web Acquisition Toolkit Design Studio is an IDE for building data integration workflows. Endeca Web Acquisition Toolkit Design Studio combines the best aspects of a web browser and a visual flow editor and eliminates the need to write code by enabling developers to visually navigate applications and data sources and by using a powerful XML editor to generate workflows in minutes.
IKM SQL to Endeca Server
IKM SQL to Endeca Server provides an integration module that allows users of the Oracle Data Integrator (ODI) to write directly to Endeca Server.