In this post, I will run through how RapidMiner (RM) users can integrate python, and more specifically, spacy, into their processes. We will cover the steps necessary to run python code within RM, see a basic example to get the vector representation of a document, and finally, a worked example that will compare two models that classify articles based on their headline text.

To follow along, I am using RapidMiner 9.9 with the following extensions:

  1. Deep Learning and ND4J (two separate)
  2. Python Scripting
  3. Operator Toolbox


While there are other text-specific extensions that we could use within RM, namely Text Processing…

In this quick post, I want to highlight a quick proof-of-concept that highlights how simple it can be to get data in, and out, of Salesforce. In addition, I want to discuss how an organization can leverage machine learning on the data that they are already capturing within their CRM.

The goals of this post are to:

  1. Discuss the basics (and benefits) of using Airflow for data integration with Salesforce
  2. Highlight how easy it can be to implement custom machine learning models on top of data within a CRM

If you are familiar with the Salesforce ecosystem, you might be…

A first-pass at uncovering shooter embeddings for skaters in the 2019–20 NHL season.
A first-pass at uncovering shooter embeddings for skaters in the 2019–20 NHL season.


  • A discussion of using the RStudio IDE and Rmarkdown/knitr to write a post that demonstrates how to achieve the same result in both R and python, side-by-side. Yes, this is possible, given the amazingreticulate package alongside knitr.
  • Use TensorFlow and Keras to estimate shooter embeddings based on shot location/type data for NHL skaters. This work is only superficial, but demonstrates a pathway to what might be possible with deep learning and play-by-play data.
  • Demonstrate how we can easily inject Tableau into our exploratory data analysis work via the excellent python library pantab, as well as my simple port of that…

TL;DR: The setup is generic, but this tutorial is aimed at readers who are interested in moving the data analysis off of their laptop and into the cloud. With a few clicks on Digital Ocean and a handful of commands at the terminal, you can have RStudio Server, Jupyter Notebook with multiple kernels, and a Neo4j database.

More database backends to come.

Ben Hammer recently wrote on Quora about reproducibility. I encourage you to read his piece, as it centers the rationale behind this post. Those insights are below:

Repeatable actions allow us to save time and talk about…

Brock Tibert

Lecturer, Data Scientist, and Product/Analytics Consultant with a focus on highered and Strategic Enrollment Management. Teach a number of R and python courses.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store