Jonah's AI Lab - Machine Learning for Anomaly Detection
This is first in a series of Jonah Lab articles. Jonah Group provides
free technology labs, providing 4-6 weeks of consultant time to solve
your business problem with leading edge AI/Machine Learning, Blockchain
or Big Data technologies. If you are interested in participating,
Contact Us to let us know!
Business technology systems are frequently complex and involve many
sources of data. As system complexity increases, errors with data
sources are increasingly more difficult to identify. Anomaly detection
is an important tool to quickly identify and resolve potential problems
without causing significant damage and costs to the business.
Anomaly detection is an important business problem; especially, for a
critical service that requires fixing potential problems quickly to
avoid extended damage to the business. The difficulty in anomaly
detection is actually defining what an anomaly is in the business
context. Traditional methods such as rules-based anomaly detection,
suffer from having to be prescriptive; you have to define the rules in
advance of infinitely many unforeseen events or variations in data.
This makes it an ideal problem for machine learning (ML) because
of ML's adaptivity to learn directly from the relevant data set and its
potential for on-going improvements adapting to real-time pattern
Anomaly Detection - Proof of Concept
For Jonah's AI Lab, we acquired one of our client's production
transaction data sets to demonstrate the use of ML as a solution. The
data was fully anonymized to allow us to analyze and report on
techniques and findings without risking disclosure of confidential or
The emphasis of the PoC is a realistic data set where the anomaly is
unknown initially to our Lab team. Our goals were to:
i) describe the steps to analyze this transaction data set,
ii) discover those anomalous points that significantly deviated from the
regular transactional patterns
iii) provide an initial solution to allow the client to be alerted
immediately for detected system faults.
We proceeded by taking the following steps:
Step 1: Prepare Data: Analyze data by breaking it down into
categories and understanding it in order to know what an anomaly is
and what we're looking for.
Step 2: Label Data: Understanding the data lets us define labels
for specific data characteristics by applying unsupervised learning
algorithms on the data and visually scoring it.
Step 3: Train Model: With labelled data, we can then train our
supervised learning algorithms with the labelled data and develop an
algorithm to accurately score the results.
Step 4: Deploy Model: When the algorithm is “trained”, we deploy
our model to the machine learning server (Apache PredicitionIO) and
predict anomalies in real time. The results are made available via
RESTful web services.
Step 5: Monitor Model: When our detection algorithms are put
into production, we publish the ongoing results via a dashboard that
allows users to monitor incoming transactions and be alerted when a
possible anomaly is detected.
Our lab was able to meet all project objectives. We analyzed the data for a
single transactional data source (aka. contributor) and were able to
achieve the desired results. An anomaly was classified either as sudden
spikes in the received data or data that is outright missing (such as
truncated files). By incorporating both unsupervised and supervised
machine learning algorithms, we used the results from the unsupervised
algorithms to label data in order to train the supervised learning
algorithm to generate a model. We then deployed the trained model to a
server to predict real-time anomalies with an accuracy above 95%.
A more detailed explanation of each step follows:
Step 1: Prepare Data
The first step is to understand the data itself, how to break it down,
analyze and visualize the data set. This analysis allows us to explore
the best method of structuring the detection approach. Since the client
data set is a discrete time series (data that is received at specific
points in time), this means we should predict the data as "point"
anomalies within a day. In other words, we should design the prediction
towards an anomalous period within a day. The anomalous period could be
triggered due to lack of receiving data (3 transactions happening in an
hour versus the expected 20) or suddenly receiving a large amount of
data (500 transactions happening in that hour). Looking at the chart
below - we see an example of receiving a large amount of data on
September 11, not receiving data for a specific time on September 18 or
not receiving much data on October 16.
Data analytic techniques guided our exploration to clearly visualize the
detection goal; subsequently, we chose to break down the data set by
transaction data source, since each contributor is unique with its own
daily pattern of timing and frequency with which the contributor sends
its transactions. From that discovery, we perform Feature
extracting the useful features from the data set. For example, befitting
a non-continuous time series, we extracted the day of the week the
transaction occurred, the transaction time and the cumulative total
number of transactions. The engineered feature set helps the machine
learning on the anomalous events.
Step 2: Label Data
After constructing the useful feature set, this step automatically
assigns labels to the data points, whether they are anomalous or not.
Since the given client data is unlabelled, we applied Unsupervised
algorithms in order to predict the anomalies. We experimented with 5
different unsupervised learning algorithms, the LocalOutlierFactor
produced the best scores when detecting anomalies. In order to determine
the best labelled data, we decided to label a data point as an anomaly
based on the consensus of 5 algorithms. If more than 3 algorithms
identified the same data point as an anomaly, we kept that result. With
this consensus rule, we obtained the following labelled data set:
Step 3: Train Model
We proceeded onto Supervised
training with the labelled data. Because it is unreasonable to pinpoint
the anomalies at the exact time, we scored the algorithm's accuracy
based on how many detected anomalies in each hour. For example, because
we featured the hourly data, if a specific hour has 10 anomalies, we
expect our algorithm to produce that number of anomalies within that
hour. With this scoring method, we split the data into training and
testing data sets. We used the training data set to train and used the
testing data set to evaluate using the designed scoring method.
Of the 5 algorithms we tried, the RandomForest
Classifier and AdaBoost
Classifier algorithms showed
the most promising results. In particular, RandomForest Classifier shows
the best result with the larger dataset. Classification Report, showed
an above 95% accuracy rate, illustrated in the image below:
The colored points are:
TN: True Negative - how many points the algorithm correctly
determined were not anomalous (represented by the green line).
TP: True Positive - how many points the algorithm correctly
determined were anomalous (represented by the red dots).
FN: False Negative - how many points the algorithm incorrectly
determined were not anomalous (represented by the black dots).
FP: False Positive - how many points the algorithm incorrectly
determined were anomalous (represented by the yellow dots).
Step 4: Deploy Model
After obtaining the trained model with high confidence, we can deploy
the model as a real-time RESTful detection service. We have the option
to deploy locally or on a Cloud
We use the high-performance framework, such as
PredicitionIO, to implement the best
machine learning algorithm running with the trained model. After the
deployment, the detection web service can predict whether the incoming
real-time data stream is anomalous or not.
Step 5: Monitor Model
The final step is to provide a user friendly UI dashboard; allowing the
user to monitor the transaction stream. We can graphically annotate the
anomalous points on the incoming transaction data and alert the user
when an anomaly is detected. The UI dashboard can be customized
according to the user's monitoring preferences.
What We Learned
A key learning for anomaly detection algorithms is the value of the
upfront analysis of the data understand where anomalies can occur. Since
the data set does not display strong time-series seasonality, we framed
the problem as a point anomaly detection. The methodology is using
multiple unsupervised learning algorithms to label the data set.
Consensus from multiple algorithms was used to categorize data points as
Subsequently, the labelled points will be fed into supervised learning
to train the detection model with 95% or better accuracy. After
deployment, the incoming transaction data will be streamed through the
high-performance RESTful detection web service, to predict the anomaly.
The research shows that each transaction stream should run with it's
unique model, which is a good finding according to the unique business
practice from our real clients.
Varun Chandola, Arindam Banerjee, and Vipin Kumar, Anomaly
Detection : A Survey, ACM Computing Survey, September 2009.
Jonah Group is a digital consultancy the designs and builds high-performance software applications for the enterprise. Our industry is constantly changing, so we help our clients keep pace by making them aware of the possibilities of digital technology as it relates to their business.