The Oracle Endeca Web Acquisition Toolkit, or Endeca WAT, is a new offering with Endeca 3.1 and, like Endeca IAS, is a tool intended to capture unstructured data from web sources and make it available to Integrator ETL.
Endeca WAT Background
Oracle has partnered with Kapow Software to offer Kapow Katalyst as part of Endeca, branded as the Oracle Endeca Information Discovery Web Acquisition Toolkit. Kapow Katalyst is a widely used software product for acquiring web data. By partnering with Kapow for this offering, Oracle has chosen a venerable and reliable solution.
One compelling feature of Endeca WAT is the graphical user interface, used to design web acquisition applications. This graphical user interface is referred to as Design Studio, and for the purpose of this book, we will refer to it as Endeca WAT Studio.
Perhaps the most compelling feature of Endeca WAT is its ability to extract field data from elements of web pages that can be used to create attributes. With this feature, extracted dates, product names, user ratings, and comments are each distinct attributes of a schema.
Oracle Endeca Studio. already covered in our blogs. Endeca WAT refers to its primary user interface, Design Studio. Needless to say, studio is a commonly used name for software features that provide a graphical environment for development tasks. Our choice of Endeca WAT Studio is to avoid confusion between Kapow Katalyst Design Studio and Endeca Studio.
Endeca WAT Insights
Endeca WAT Studio is used to create what are referred to as robots. Robots allow the automation of any task that can be performed in a browser, from clicking buttons or hyperlinks to selecting elements on a web form. Endeca Studio is well suited to developing applications that click a set of recurring buttons or hyperlinks on a web form and extracting data after each click. A common use of Endeca WAT is extracting opinion information for retail web sites that allow users to enter ratings and opinion information. Within a WAT robot, there are many nested loops that depend on the tendency of the web forms to be consistent, regardless of how many pages of opinion data are available. This paradigm is successful until the tag information that Endeca WAT relies on changes from one element to the next. This occurs on web sites that contain web content that spans years or decades.
For example, in this blog, you used data from the FDIC web site on bank failures. The web page for this site provides hyperlinks for bank failures starting at 2000. When you attempt to drill into these hyperlinks with Endeca WAT to obtain press release information on each bank closure, you quickly find the robot reporting errors once it reaches older hyperlinks because of the underlying HTML document format changing. It is possible within WAT to have exception handling to manage this, but you quickly find a complex programming and configuration task emerging in these circumstances.
Endeca WAT also has the capability of collecting data from Microsoft Excel spreadsheets and from PDF files. Endeca WAT stores its data in a database, and the database servers supported include all of the major databases, including Oracle RDBMS and MySQL.
Endeca WAT Example
In the example for Endeca WAT, you will revisit the FDIC press release web site you crawled with Endeca IAS. You will focus on the page for 2014 press releases, using the URL. Create a new project and specify this URL. For this example, choose the date tag on the page for a loop. Figure 1 shows the robot used to collect date and text information for each press release. After each pass through the loop, the date and text are written to an Oracle database. The Oracle database in this example serves as the repository for data collected by the robot and makes the data available to Endeca ETL.
FIGURE 1. Robot to collect FDIC press release data
You can use the For Each Tag Path control on the robot to step through the robot for debugging purposes. This feature makes testing and debugging robots easy. Variables can be viewed in a watch window during this single stepping mode, as shown in Figure 2.
FIGURE 2. Endeca WAT variable debugging window
When robots are debugged and ready for production usage, they can be uploaded to the Endeca WAT management console. The management console can schedule a robot for execution and notify administrators when a robot has a run-time issue. Figure 3 shows the management console.
FIGURE 3. Endeca WAT management console
Endeca WAT Installation
In this section we will provide insight regarding the installation of Endeca WAT not covered in Oracle’s official installation documentation. Oracle provides an installation guide specifically for Endeca IAS called “Oracle Endeca Web Acquisition Toolkit: Installation Guide”. It covers the installation of Endeca WAT. Red Hat and Oracle Linux are supported, as is Microsoft Windows 2008 Server. We have used both the Linux and Windows versions of Endeca WAT and found the Windows version to be superior with regard to the clarity of the visual presentation and usability. Windows 7 and Windows 8 can be used for development and nonproduction environments. We have used Windows 8 successfully. Installing Endeca WAT is straightforward, and in minutes you will be using it.