Apply Scikit-Learn Models to Your Salesforce Data via Apache Airflow

Brock Tibert
6 min readApr 13, 2021

In this quick post, I want to highlight a quick proof-of-concept that highlights how simple it can be to get data in, and out, of Salesforce. In addition, I want to discuss how an organization can leverage machine learning on the data that they are already capturing within their CRM.

The goals of this post are to:

  1. Discuss the basics (and benefits) of using Airflow for data integration with Salesforce
  2. Highlight how easy it can be to implement custom machine learning models on top of data within a CRM

If you are familiar with the Salesforce ecosystem, you might be asking “Why not use Einstein.” You could, and to be fair, Salesforce does allow you to do some customization when building predictive models. In addition, the models can be applied directly on top of the data in the CRM.

In my approach, I am suggesting a flow that extracts new records, applies a classifier, and then updates the records back in Salesforce. Of course, there are always tradeoffs, but in my proposal, I believe the benefits of ETL’ing the data out of a Salesforce instance far outweigh the drawbacks. We have far more control over the modeling process and we are not limited to the models supported by Salesforce.

Business Problem

Let’s frame this as a business problem.

Suppose that you use the Case object to collect and process customer support requests. Classifying the cases in order to route the ticket to the appropriate team has proven to be time consuming and error prone. You want to employ machine learning to predict the intent of the new Case record. This process does not need to be real time, but the tickets should be classified within 5 minutes of hitting Salesforce.

This use-case is pretty universal, as this could be anything really.

  1. Score new marketing leads in your CRM
  2. Calculate the probability of a Closed Won Opportunity
  3. Predict the likelihood of an inquiry enrolling at your institution

In my examples above, I am ignoring real-time scoring within Salesforce, but rather, suggesting that we can leverage tools like Airflow and scikit-learn to enable domain-specific machine learning models on data within the CRM. The end result is a robust, and scalable, machine learning pipeline.

Process Flow

Before I discuss Apache Airflow, let’s review how one could implement a machine learning process on top of their data in Salesforce.

  1. Data are extracted from Salesforce to train a model. This allows analysts to experiment offline, assess risk (e.g. precision/recall tradeoffs), and persist the final models on disk via joblib.
  2. With the model in place, configure an Airflow DAG which is essentially a formatted python script that can be used to extract, score, and load the modeled data back into Salesforce
  3. Schedule the Airflow DAG to run at whatever interval is needed per business requirements. In the use-case above, while the data are not real-time, with Airflow we could run this DAG throughout the day, say, every 5 minutes.

To simplify, the process above is simply looking for new records of interest (e.g. un-scored Case records), and if found, apply the custom model to the data and have the results shipped back to Salesforce.

Apache Airflow

Airflow is a python-backed ETL tool born out of work at AirBnB. The tool allows anyone who can write a few lines of python to construct ETL pipelines. Airflow uses DAGs to manage a given task. That is, the tasks are connected in a graph-like structure that allows us to employ dependencies and monitor the status of each task. What’s more, Airflow has a great web UI tool to monitor the state of the system, or a given task.

For a great introduction, please refer to this article: https://towardsdatascience.com/a-complete-introduction-to-apache-airflow-b7e238a33df

Proof-of-Concept

Let’s say we have some new Case records that were created via Email-to-Case.

The list view above is showing these new cases. For now, we can ignore the column Actual Intent because what is shown above is some seeded data with the known, or true, customer service request. If you are curious, I am using the ATIS dataset from Kaggle in order to build the POC classifier.

Because this is a demo, I don’t have a need to schedule my DAG workflow, Airflow is great because I can simple trigger this to run by hand.

Above, my DAG is called case_sfdc. I can trigger this process to run by simply clicking the play icon.

We can monitor the process of our jobs, as shown below.

Above you can see that I have a very simple DAG; just two linear tasks. The main ETL task is collecting the data from Salesforce, scoring the new records if found, and then updating those records back in Salesforce using the Bulk API. Behind the scenes, Airflow is also logging data about each job, which can help us investigate any issues that might pop up if there were actually running scheduled jobs in production.

Within Salesforce, that same list view showing new records indicates that the records have been processed.

In my case, I also would create a listview to monitor recently scored cases. While I am only showing list views in this case, it’s not hard to imagine using other Salesforce tools like Flow builder to take these newly scored records and process them via additional business requirements.

In the picture above, this machine learning process also sends over some other data for each scored record, namely, the confidence of the prediction. This is important as business requirements might require a human in the loop if the predicted confidence is below a certain threshold.

From here, if you are familiar with Salesforce, you can easily extend this process to monitor the data via Reports and Dashboards, which can allow end-users to recognize when there may be issues with the model (e.g. decay) or even a larger issue such as a broken process.

Summary

I often hear from colleagues and clients that data integration with Salesforce can be painful. Yes, the above process does make some assumptions that you are comfortable with python. However, given the large price-tags of these “fancy” ETL tools in the Salesforce ecosystem, I really do believe organizations should really consider Airflow for the SFDC data integration needs. Airflow is relatively simple to get spun up, and building the workflow DAG is a simple modification to an analyst’s already existing python script.

Moreover, while I do find Einstein interesting as product person, the data scientist in my wants to own the entire machine learning process for my company. The above example, while extremely basic, can allow organizations to get up and running rather quickly while also enabling a much richer ecosystem of tools than those currently supported in Salesforce’s Einstein modeling product.

--

--

Brock Tibert

IS Faculty (Questrom) — Embedded Data/Product/Analytics Advisor — @brocktibert