Machine Learning for Everyday Tasks

Machine learning is often thought to be too complicated for everyday development tasks. We often associate it with things like big data, data mining, data science, and artificial intelligence. Sometimes it feels something like this:

artificial intelligence
Machine Learning is hard

I have always felt like we can benefit from using machine learning for simple tasks that we do regularly.

Real life example

At Mailgun, we work with e-mail and as part of our offering, we parse HTML quotations. This allows a user to grab the latest reply instead of the entire conversation, which is returned as part of our webhook response. You can read more about how we handle inbound message processing in our documentation

For those of you who don't know, here's what parsing HTML from the public Internet looks like:

driving offroad
Parsing HTML from public internet

It's messy and sometimes processes get stuck.

Changing the parsing library can help, but it won't solve the issue completely because every library has its limitations. You have to restrict the parsing to something reasonable.

Cluster analysis to the rescue

But what should the criteria and threshold be? Should we limit by HTML length or tag count? Maybe both? Maybe by something else? The objective obviously is to process as many messages as possible without shooting yourself in the foot, but the path isn't super obvious.

That's where cluster analysis and statistical classification become handy. For the research I was using R, but depending on your task and personal preferences you might use something else - scikit-learn, Weka, MOA, etc.

Collect a dataset

First, we logged HTML length, tags count, message processing time and put them into a csv file:

htmllen,tagscount,took  
2893762,85527,34.300139904  
31378,518,0.0368919372559  
19105,413,0.0545339584351  
...

The vast majority of messages take fractions of a second to process. So, when collecting the dataset, we had to make sure that we have enough "slow" messages.

We ended up with two csv files, collected on different days. One had 13831 lines and was reserved for analysis and model-training (the train dataset). Another had 12149 lines and was reserved for model validation (the validation dataset).

Generally you want to have at least two datasets - one for training and one for validation. Otherwise you might run into overfitting problem, when your model is well adjusted to the train data but fails in the real world.

Determine the number of clusters

To visualize the data and look for patterns k-means clustering was first tried:

messages

datapoints proximity within clusters for different number of clusters

As you can see there is a significant performance improvement up to 4 clusters. After that there is no real boost.

Clustering

The next step was to figure out how the data points get distributed between the clusters:

## K-Means Clustering with 4 clusters
fit 
And you can somewhat anticipate the problem already: the clusters form by nipping off the datapoints that are far away, while the interval we're interested in (1-20 sec) is in the very midst. The issue persists with increasing the number of clusters.
Moreover there is a significant overlap between the clusters in the interval:
## get clusters mean, min, max
mean 
Compare clusters number 1 and 4. The issue persists with increasing the number of clusters.
At this point, we decided to try a different approach and look at the percentiles for message processing time:
percentiles 
As you can see, after the 78th percentile the processing time quickly bubbles up.  Here was our first threshold - 78th percentile that corresponded to 6.5 seconds.
All datapoints that took less than 6.5 sec were marked as “fast” and others as “slow”.
Classification
For classification we tried SVM (Support Vector Machines), Random Forests and CART (Classification And Regression Tree).
CART showed slightly better results for this task but its main advantage is that it gives you a decision tree that is easy to understand, explain and implement vs SVM or Random Forests that work like a black box and require using heavy ML libraries in production.
Here's how you classify using CART:
x Here's the decision tree:
And that's how the implementation looks like:
def html_too_big(s):  
    return s.count(' _MAX_TAGS_COUNT
Isn't it beautiful? All the research complexity in a single line!
Validation
For evaluation we took the validation dataset and tried to predict the classification using our model:
validate Here's the confusion matrix and some common classification metrics:
Lessons Learned
Machine learning is not just for data scientists. Even simple decisions powered by ML can benefit you. Know your data. Do not rely blindly on scientific algorithms and models.
Useful links

Happy machine learning and data mining!

Sergey Obukhov

Software Developer

Machine Learning for Everyday Tasks

Real life example

Cluster analysis to the rescue

Collect a dataset

Determine the number of clusters

Clustering

Classification

Validation

Lessons Learned

Useful links

Sergey Obukhov

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112