Trying Out Machine Learning on Elasticsearch using Python

Calvin Ku
5 min readMar 29, 2021

Recently I’ve been doing some research on Elasticsearch to see if it suits my needs at work. One thing that I care about most is its Python support. If you do data science then there are three libraries available on Python for Elasticsearch you might wanna look into: elasticsearch-py, elasticsearch-dsl-py, and eland.

elasticsearch-py provides you with low level APIs with which you can do most what you need to do with Elasticsearch. elasticsearch-dsl-py is a higher level client library which is more pythonic and sits on top of elasticsearch-py. To see the difference, here’s some code straight from their github:

elasticsearch-py vs elasticsearch-dsl-py

Below is a search request using elasticsearch-py:

from elasticsearch import Elasticsearch
client = Elasticsearch()

response = client.search(
index="my-index",
body={
"query": {
"bool": {
"must": [{"match": {"title": "python"}}],
"must_not": [{"match": {"description": "beta"}}],
"filter": [{"term": {"category": "search"}}]
}
},
"aggs" : {
"per_tag": {
"terms": {"field": "tags"},
"aggs": {
"max_lines": {"max": {"field": "lines"}}
}
}
}
}
)

for hit in response['hits']['hits']:
print(hit['_score'], hit['_source']['title'])

for tag in response['aggregations']['per_tag']['buckets']:
print(tag['key'], tag['max_lines']['value'])

In comparison, here’s code using elasticsearch-dsl-py:

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search

client = Elasticsearch()

s = Search(using=client, index="my-index") \
.filter("term", category="search") \
.query("match", title="python") \
.exclude("match", description="beta")

s.aggs.bucket('per_tag', 'terms', field='tags') \
.metric('max_lines', 'max', field='lines')

response = s.execute()

for hit in response:
print(hit.meta.score, hit.title)

for tag in response.aggregations.per_tag.buckets:
print(tag.key, tag.max_lines.value)

I guess it’s quite obvious which to go for if you just need to do some simple queries. :)

Machine Learning with eland…not so much

eland isn’t a full-fledged machine learning library that’s for sure. As far as I know, what it does so far is:

  1. Lets you manipulate data in Elasticsearch as if you’re dealing with a pandas dataframe, so you can do data preprocessing on Elasticsearch instead of doing it in your RAM.
  2. Deploying your scikit-learn models to Elasticsearch and making inference there.

And that’s that. If’ you’re not too disappointed so far, read on. :)

I’ve used the toy examples on their website to test out some of the features on my own with some modifications.

Just so you know, before I started trying out Elasticsearch I was hoping to do some anomaly detection with it, so I started with some generated data just for this purpose.

Constructing Toy Dataset

from datetime import datetimeimport eland as ed
from eland.conftest import *
from eland.ml import MLModel
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import make_moons
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from tqdm import tqdm
sample_size = 100000
outlier_fraction = 0.01
outlier_set_size = int(sample_size * outlier_fraction)
inlier_set_size = sample_size - outlier_set_size
random_gen = np.random.RandomState(1113)
X_raw, y_raw = make_moons(n_samples=sample_size, shuffle=True, noise=outlier_fraction, random_state=None)X_normed = 4 * (X_raw - np.array([0.5, 0.25]))X_noise = random_gen.uniform(low=-6, high=6, size=(outlier_set_size, 2))X = np.concatenate([X_normed, X_noise])df = pd.DataFrame(X)
df.columns = ['feature_1', 'feature_2']

Let’s visualize the data:

plt.figure(figsize=(16, 9))
plt.scatter(X[:, 0], X[:, 1])

Training Model

I chose a simple Isolation Forest just to see what happens.

anomaly_algo = IsolationForest(n_estimators=100,
contamination=outlier_fraction,
random_state=1113)
y_iso = anomaly_algo.fit(X).predict(X)

And here’s the result

plt.figure(figsize=(16, 9))
plt.scatter(X[:, 0], X[:, 1], c=y_iso)

At this point I had really high hopes to see how inferencing would work on Elasticsearch, but unfortunately it didn’t go so well…

Connecting and Deploying Model to Elasticsearch

I couldn’t get the deployment to work on my local machine. If you try to do this you’ll get a 403 error saying your license isn’t powerful enough for what you need to do. I’m guessing it’s either the settings issue or that I didn’t get the X-Pack installed. Either way I decided it was too much trouble so I went for getting a Elastic Cloud 14-day trial account and just did it there.

Here’s the minimum code to get it to work (…or not work).

es = Elasticsearch(cloud_id='<your_cloud_id>', 
http_auth=("<your_username>", "<your_password>"))
es_model = MLModel.import_model(es_client=es,
model=anomaly_algo,
model_id='model',
feature_names=df.columns
)

If you run this you’ll get this error:

NotImplementedError: Importing ML models of type <class 'sklearn.ensemble._iforest.IsolationForest'>, not currently implemented

Too bad. It turned out that Elasticsearch only supports a couple of tree-based supervised learning models. I’ll list them here to save you some time:

  • DecisionTreeClassifier
  • DecisionTreeRegressor
  • RandomForestRegressor
  • RandomForestClassifier
  • XGBClassifier
  • XGBRegressor
  • LGBMRegressor
  • LGBMClassifier

So after some googling I found out about this sad truth and decided to at least try out one algorithm just as a proof of concept.

Supervised Learning with Elasticsearch

Since our dataset isn’t labeled. We can make use of the Isolation Forest we set up just above for this. First we assign the pseudo-labels to each rows, and change the outlier values to 0 instead of -1.

df['label'] = y_iso
df.loc[df['label'] != 1, 'label'] = 0
df.head()

And then data splitting

X_train, X_test, y_train, y_test = train_test_split(df.loc[:, ['feature_1', 'feature_2']], 
df['label'],
test_size=0.33,
shuffle=True,
random_state=42)

At this stage we can build a simple decision tree model:

sl_model = DecisionTreeClassifier()
sl_model.fit(X_train, y_train)

Now comes the key part. With just one line of code, we can easily deploy our model to Elasticsearch:

es_model = MLModel.import_model(es_client=es,
model=sl_model,
model_id='model001',
feature_names=list(df.columns)
)

And let’s try inferencing using Elasticsearch

y_pred = es_model.predict(X_test[:500].values)plt.figure(figsize=(16, 9))
plt.scatter(x=X_test.iloc[:500, 0], y=X_test.iloc[:500, 1], c=y_pred[:500])

Key Takeaways

  1. Scikit-learn models can be deployed to Elasticsearch with just a one-liner
  2. Only tree-based models are supported for the moment
  3. Unsupervised learning models are not supported

What you can’t do with Elasticsearch

  1. pandas dataframe can be converted to an eland dataframe using ed.pandas_to_eland. But an eland dataframe can only be used for data processing purposes. You can’t train your models either using scikit-learn, or any Elasticsearch APIs, so after data processing with eland, you’ll need to convert the dataframe back to pandas using to_pandas.
  2. You can’t train models on Elasticsearch
  3. You can’t do inference for data stored in Elasticsearch
  4. There’s a limit to how much data you can inference at a time. Anything too big will result in a timeout.

--

--