Using spacy in RapidMiner

Brock Tibert
6 min readApr 19, 2021

In this post, I will run through how RapidMiner (RM) users can integrate python, and more specifically, spacy, into their RM processes. We will cover the steps necessary to run python code within RM, review a basic example to get the vector representation of a document, and finally, a worked example that will compare two models that classify articles based on their headline text.

To follow along, I am using RapidMiner 9.9 with the following extensions:

  1. Deep Learning and ND4J
  2. Python Scripting
  3. Operator Toolbox

Setup

While there are other text-specific extensions that we could use within RM, namely Text Processing and Word2Vec, the python scripting operator allows us to pull in advanced NLP toolkits, including spacy. The process to get up and running is relatively painless, but it does require a few steps.

Below, I will outline my approach.

  1. I am using conda to manage my python environments. More specifically, I am using miniconda, but that is a preference
  2. With conda installed, I created an environment called rapidminer
  3. I have a number of libraries installed (via pip) but specifically pandas and spacy[full]. You can review the conda environment file as well as the pip packages here: https://github.com/Btibert3/rapidminer-examples
  4. After the packages are installed, you will need to download the medium-sized english language model via python -m spacy download en_core_web_md.

This part is important! The small model that ships with spacy does not include word vectors!

  1. Tell RapidMiner that you want to use the rapidminer conda environment by default. The python scripting operator will be able to make use of the tooling within this environment via the script we will pass in.

Basic Example

With the setup in place, we can now use spacy within our RapidMiner process! To just show the basics, we will create a simple document, convert it to a dataset (RM calls this an ExampleSet), and then pass this through the python operator which will parse the document and return the document vector (length 300).

One note. I do not believe that is possible to pass a column name as a parameter to the operator, so your dataset must have a field called text.

The last operator executes a script, one constructed to comply with what RM needs:

import pandas as pd
import spacy
import numpy as np
# requires that the lang model has already been installed
# python -m spacy download en_core_web_md
# rm_main is a mandatory function,
def rm_main(data):
# load the nlp model
nlp = spacy.load("en_core_web_md")
docs = list(nlp.pipe(data.text))
vectors = [doc.vector for doc in docs]
vectors = np.array(vectors)
vectors = pd.DataFrame(vectors)

# will result in a single output port, and be an exampleset
return vectors

The only real trick above is that we need to use use the rm_main function to house our logic.

By running that simple, process, you will get a 1x300 ExampleSet for the document. Under the hood, spacy will parse the text, and using the pre-trained word vectors, create a document vector which is simply the average across the word vectors.

A Deep Learning Example with RapidMiner

With that basic example grounding the mechanics of how we can pull in the powerful spacy toolkit, we will now run through a more practical example.

Let’s step through this. First, the Build Dataset operator is a cached operator from Operator Toolbox. This allows us to cache the results of the subprocess, which is helpful as you are building your process iteratively.

There is a lot going on above, and while this certainly could be refactored, I did want to highlight RapidMiner does include building blocks to visually “code” our solution.

  1. We are using the headlines dataset, which can be found here
  2. I am selecting just what I need to be passed to spacy via the Execute Python operator, the trick is that I am using Rename to ensure that the headline column is now called text.
  3. Once the document is parsed via the python operator, I am adding an arbitrary ID to the dataset for the join
  4. I am using the full dataset, adding the same ID (I don’t believe we can row-align datasets in RM), and joining the data together. I am setting some roles for RapidMiner, and ensuring that our target y is treated as binomial by RM. The benefit of keeping this logic in the cache Operator is that we only have to load and pre-process the data via spacy once.

Above highlights the output of that first process. We have our id column that was created for the join, our target y, and a simple hack of using the cluster role in RM to retain the site variable as a “special attribute.” From here, we have our document vectors for each. Simply, each document can be viewed as a 300-length vector. We will use these vectors as the inputs to predict the column y.

The dataset I show above is then split in to train/test splits (70/30). The training and test splits are sent into the Build Models subprocess. Below represents the fit of two models.

Above, I am using the Multiply operator in order to feed the same training and test sets to the KNN and Deep Learning models. The kNN model is a simple k=5 model using cosine similarity. The deep learning model is more involved and will be discussed in a moment, but once both models are trained, the model is passed to the Apply Model operators and applied to the test set in both cases. The subprocess will have two outputs, each being a performance vector for each model.

Let’s dive into the Deep Learning operator, which is a subprocess.

The vector representation (300 attributes) will be fed into these layers. Above, all layers are Fully Connected Feed Forward layers, with the last operator representing the output layer.

With the models constructed, I am simply converting the performance output to a “dataset” and exporting the metrics I asked for as a dataset, as well as passing through the performance vector to see the Confusion Matrix and AUC outputs.

When comparing the two models, the arbitrary) construction of the Deep Learning model outperformed the simple KNN on F1 (.76 vs .81).

Conclusion

Hopefully above highlights what is possible in RapidMiner when you go beyond the “basic” operators. I (mostly) enjoy using RapidMiner to teach my IS841 course. The visual programming interface can be expressive (if you know where to look) and helps form the foundation of coding via “lego pieces.” Moreover, via the extensions marketplace, we can leverage the power of python and advanced toolkits like spacy.

Finally, all of my work in this post can be found in my github repo here: https://github.com/Btibert3/rapidminer-examples.

Future posts will explore other explore alternative applications and use-cases of RapidMiner, including RM as an ETL tool.

--

--

Brock Tibert

IS Faculty (Questrom) — Embedded Data/Product/Analytics Advisor — @brocktibert