Using spacy in RapidMiner

In this post, I will run through how RapidMiner (RM) users can integrate python, and more specifically, spacy, into their processes. We will cover the steps necessary to run python code within RM, see a basic example to get the vector representation of a document, and finally, a worked example that will compare two models that classify articles based on their headline text.

To follow along, I am using RapidMiner 9.9 with the following extensions:

  1. Deep Learning and ND4J (two separate)
  2. Python Scripting
  3. Operator Toolbox

Setup

  1. I am using conda to manage my python environments. More specifically, I am using miniconda, but that is a preference
  2. With conda installed, I created an environment called rapidminer
  3. I have a number of libraries installed (via pip) but specifically pandas and spacy[full]. You can review the conda environment file as well as the pip packages here: https://github.com/Btibert3/rapidminer-examples
  4. The the packages installed, you will need to download the medium-sized english language model via python -m spacy download en_core_web_md. This part is important! The small model that ships with spacy does not include word vectors.
  5. Tell RapidMiner that you want to use the rapidminer conda environment by default. The python scripting operator will be able to make use of the tooling within this environment via the script we will pass in.

Basic Example

One note. I do not believe that is possible to pass a column name as a parameter to the operator, so your dataset must have a field called text.

The last operator executes a script, one constructed to comply with what RM needs:

import pandas as pd
import spacy
import numpy as np
# requires that the lang model has already been installed
# python -m spacy download en_core_web_md
# rm_main is a mandatory function,
def rm_main(data):
# load the nlp model
nlp = spacy.load("en_core_web_md")
docs = list(nlp.pipe(data.text))
vectors = [doc.vector for doc in docs]
vectors = np.array(vectors)
vectors = pd.DataFrame(vectors)

# will result in a single output port, and be an exampleset
return vectors

The only real trick above is that we need to use use the rm_main function to house our logic.

By running that simple, process, you will get a 1x300 ExampleSet for the document. Under the hood, spacy will parse the text, and using the pre-trained word vectors, create a document vector which is simply the average across the word vectors.

A Deep Learning with RapidMiner

Let’s step through this. First, the Build Dataset operator is a cached operator from Operator Toolbox. This allows us to cache the results of the subprocess, which is helpful as you are building your process iteratively.

There is a lot going on above, and while this certainly could be refactored, I did want to highlight RapidMiner does include building blocks to visually “code” our solution.

  1. We are using the headlines dataset, which can be found here
  2. I am selecting just what I need to be passed to spacy via the Execute Python operator, the trick is that I am using Rename to ensure that the headline column is now called text.
  3. Once the document is parsed via the python operator, I am adding an arbitrary ID to the dataset for the join
  4. I am using the full dataset, adding the same ID (I don’t believe we can row-align datasets in RM), and joining the data together. I am setting some roles for RapidMiner, and ensuring that our target y is treated as binomial by RM. The benefit of keeping this logic in the cache Operator is that we only have to load and pre-process the data via spacy once.

Above highlights the output of that first process. We have our id column that was created for the join, our target y, and a simple hack of using the cluster role in RM to retain the site variable as a “special attribute.” From here, we have our document vectors for each. Simply, each document can be viewed as a 300-length vector. We will use these vectors as the inputs to predict the column y.

The dataset I show above is then split in to train/test splits (70/30). The training and test splits are sent into the Build Models subprocess. Below represents the fit of two models.

Above, I am using the Multiply operator in order to feed the same training and test sets to the KNN and Deep Learning models. The kNN model is a simple k=5 model using cosine similarity. The deep learning model is more involved and will be discussed in a moment, but once both models are trained, the model is passed to the Apply Model operators and applied to the test set in both cases. The subprocess will have two outputs, each being a performance vector for each model.

Let’s dive into the Deep Learning operator, which is a subprocess.

The vector representation (300 attributes) will be fed into these layers. Above, all layers are Fully Connected Feed Forward layers, with the last operator representing the output layer.

With the models constructed, I am simply converting the performance output to a “dataset” and exporting the metrics I asked for as a dataset, as well as passing through the performance vector to see the Confusion Matrix and AUC outputs.

When comparing the two models, the arbitrary) construction of the Deep Learning model outperformed the simple KNN on F1 (.76 vs .81).

Conclusion

Finally, all of my work in this post can be found in my github repo here: https://github.com/Btibert3/rapidminer-examples.

Future posts will explore other explore alternative applications and use-cases of RapidMiner, including RM as an ETL tool.

Lecturer, Data Scientist, and Product/Analytics Consultant with a focus on highered and Strategic Enrollment Management. Teach a number of R and python courses.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store