Text Analytics Library Plugins
In this document, we show how to build custom connectors to any NLP / text analytics library to perform text analytics tasks. These connectors handle the invocation of the underlying library to process text data in table columns, in the query processing pipeline.
Building Text Analytics Library Connectors◄
To build a custom datasource connector, you need to provide implementations of the following abstract classes in the SDK:
- Provides text analytics operators as a service to Sclera, using the specified library.
- Contains an
idthat identifies this service.
- Contains the method
createObjectthat is used to create a new task object for the task named in the parameter
taskNamefor this service.
- Wrapper over classes implementing text analytics algorithms.
- Provides a function
evalthat takes a data stream (an iterator over rows, with associated metadata) as input and returns the same data stream, with each row augmented by columns
resultColscontaining the output of executing the task
taskNameon the text in column
inputCol. If the evaluation on a row emits multiple evaluation results, the input row is repeated in the output for each such result.
The Sclera - Apache OpenNLP Connector, included with the Sclera platform, is open source and implements the interface mentioned above. The code for the Sclera - Apache OpenNLP Connector, in Scala, also appears as an illustrative example in the Sclera Extensions (Scala) Github repository.
Packaging and Deploying the Connector◄
The implementation has a dependency on:
- the library for the text analytics package used.
"sclera-config"core components. Note that these dependencies is annotated
"provided"since these libraries will already be available in the
CLASSPATHwhen this connector is run with Sclera.
- (optional) the test framework
scalatestfor running the tests.
These are specified in the build file. As an example, see the Sclera - Apache OpenNLP Connector's build file.
Follow steps similar to those described here.
Note: Please ensure that the identifier you assign to the connector is unique, that is - does not conflict with the identifier of any other available