Your browser is unable to display this site correctly. Please try an up-to-date version of Chrome or Firefox instead.

< Back to all posts

Jonah's AI Lab - Machine Learning for Anomaly Detection

Benny Cheung

By Benny Cheung

Senior Technical Architect

View bio
Einas Madi

By Einas Madi

Intermediate Technical Developer

View bio
July 18, 2018

Jonah's AI Lab - Machine Learning for Anomaly Detection

This is first in a series of Jonah Lab articles. Jonah Group provides free technology labs, providing 4-6 weeks of consultant time to solve your business problem with leading edge AI/Machine Learning, Blockchain or Big Data technologies. If you are interested in participating, Contact Us to let us know!

Business technology systems are frequently complex and involve many sources of data. As system complexity increases, errors with data sources are increasingly more difficult to identify. Anomaly detection is an important tool to quickly identify and resolve potential problems without causing significant damage and costs to the business.

Anomaly detection is an important business problem; especially, for a critical service that requires fixing potential problems quickly to avoid extended damage to the business. The difficulty in anomaly detection is actually defining what an anomaly is in the business context. Traditional methods such as rules-based anomaly detection, suffer from having to be prescriptive; you have to define the rules in advance of infinitely many unforeseen events or variations in data.

This makes it an ideal problem for machine learning (ML) because of ML's adaptivity to learn directly from the relevant data set and its potential for on-going improvements adapting to real-time pattern changes.

Anomaly Detection - Proof of Concept

For Jonah's AI Lab, we acquired one of our client's production transaction data sets to demonstrate the use of ML as a solution. The data was fully anonymized to allow us to analyze and report on techniques and findings without risking disclosure of confidential or private data.

The emphasis of the PoC is a realistic data set where the anomaly is unknown initially to our Lab team. Our goals were to:

i) describe the steps to analyze this transaction data set,

ii) discover those anomalous points that significantly deviated from the regular transactional patterns

iii) provide an initial solution to allow the client to be alerted immediately for detected system faults.

We proceeded by taking the following steps:

AI Lab Anomaly Detection Workflow
Figure 1: Jonah's AI Lab - Anomaly Detection Workflow

 

  • Step 1: Prepare Data: Analyze data by breaking it down into categories and understanding it in order to know what an anomaly is and what we're looking for.
  • Step 2: Label Data: Understanding the data lets us define labels for specific data characteristics by applying unsupervised learning algorithms on the data and visually scoring it.
  • Step 3: Train Model: With labelled data, we can then train our supervised learning algorithms with the labelled data and develop an algorithm to accurately score the results.
  • Step 4: Deploy Model: When the algorithm is “trained”, we deploy our model to the machine learning server (Apache PredicitionIO) and predict anomalies in real time. The results are made available via RESTful web services.
  • Step 5: Monitor Model: When our detection algorithms are put into production, we publish the ongoing results via a dashboard that allows users to monitor incoming transactions and be alerted when a possible anomaly is detected.

TL; DR

Our lab was able to meet all project objectives. We analyzed the data for a single transactional data source (aka. contributor) and were able to achieve the desired results. An anomaly was classified either as sudden spikes in the received data or data that is outright missing (such as truncated files). By incorporating both unsupervised and supervised machine learning algorithms, we used the results from the unsupervised algorithms to label data in order to train the supervised learning algorithm to generate a model. We then deployed the trained model to a server to predict real-time anomalies with an accuracy above 95%.

A more detailed explanation of each step follows:

Step 1: Prepare Data

The first step is to understand the data itself, how to break it down, analyze and visualize the data set. This analysis allows us to explore the best method of structuring the detection approach. Since the client data set is a discrete time series (data that is received at specific points in time), this means we should predict the data as "point" anomalies within a day. In other words, we should design the prediction towards an anomalous period within a day. The anomalous period could be triggered due to lack of receiving data (3 transactions happening in an hour versus the expected 20) or suddenly receiving a large amount of data (500 transactions happening in that hour). Looking at the chart below - we see an example of receiving a large amount of data on September 11, not receiving data for a specific time on September 18 or not receiving much data on October 16.

Step 1: Prepare Data
Figure 2: Cumulative number of transactions that occurred each Monday from September 4 to January 29

 

Data analytic techniques guided our exploration to clearly visualize the detection goal; subsequently, we chose to break down the data set by transaction data source, since each contributor is unique with its own daily pattern of timing and frequency with which the contributor sends its transactions. From that discovery, we perform Feature Engineering - extracting the useful features from the data set. For example, befitting a non-continuous time series, we extracted the day of the week the transaction occurred, the transaction time and the cumulative total number of transactions. The engineered feature set helps the machine learning on the anomalous events.

 

Step 2: Label Data

After constructing the useful feature set, this step automatically assigns labels to the data points, whether they are anomalous or not. Since the given client data is unlabelled, we applied Unsupervised Learning algorithms in order to predict the anomalies. We experimented with 5 different unsupervised learning algorithms, the LocalOutlierFactor (LOF) produced the best scores when detecting anomalies. In order to determine the best labelled data, we decided to label a data point as an anomaly based on the consensus of 5 algorithms. If more than 3 algorithms identified the same data point as an anomaly, we kept that result. With this consensus rule, we obtained the following labelled data set:

Step 2: Label Data
Figure 3: Cumulative number of transactions that occurred each Monday from September 4 to January 29 with created labels

 

Step 3: Train Model

We proceeded onto Supervised Learning algorithm training with the labelled data. Because it is unreasonable to pinpoint the anomalies at the exact time, we scored the algorithm's accuracy based on how many detected anomalies in each hour. For example, because we featured the hourly data, if a specific hour has 10 anomalies, we expect our algorithm to produce that number of anomalies within that hour. With this scoring method, we split the data into training and testing data sets. We used the training data set to train and used the testing data set to evaluate using the designed scoring method.

Of the 5 algorithms we tried, the RandomForest Classifier and AdaBoost Classifier algorithms showed the most promising results. In particular, RandomForest Classifier shows the best result with the larger dataset. Classification Report, showed an above 95% accuracy rate, illustrated in the image below:

Step 3: Train Model
Figure 4: Cumulative number of transactions that occurred each Monday from September 4 to January 29 with RandomForest algorithm results
 

The colored points are:

  • TN: True Negative - how many points the algorithm correctly determined were not anomalous (represented by the green line).
  • TP: True Positive - how many points the algorithm correctly determined were anomalous (represented by the red dots).
  • FN: False Negative - how many points the algorithm incorrectly determined were not anomalous (represented by the black dots).
  • FP: False Positive - how many points the algorithm incorrectly determined were anomalous (represented by the yellow dots).

Step 4: Deploy Model

After obtaining the trained model with high confidence, we can deploy the model as a real-time RESTful detection service. We have the option to deploy locally or on a Cloud Cluster. We use the high-performance framework, such as PredicitionIO, to implement the best machine learning algorithm running with the trained model. After the deployment, the detection web service can predict whether the incoming real-time data stream is anomalous or not.

Step 4: Deploy Model
Figure 5: Sample request sent to the deployed RESTful web service with a response

 

Step 5: Monitor Model

The final step is to provide a user friendly UI dashboard; allowing the user to monitor the transaction stream. We can graphically annotate the anomalous points on the incoming transaction data and alert the user when an anomaly is detected. The UI dashboard can be customized according to the user's monitoring preferences.

What We Learned

A key learning for anomaly detection algorithms is the value of the upfront analysis of the data understand where anomalies can occur. Since the data set does not display strong time-series seasonality, we framed the problem as a point anomaly detection. The methodology is using multiple unsupervised learning algorithms to label the data set. Consensus from multiple algorithms was used to categorize data points as anomalous.

Subsequently, the labelled points will be fed into supervised learning to train the detection model with 95% or better accuracy. After deployment, the incoming transaction data will be streamed through the high-performance RESTful detection web service, to predict the anomaly. The research shows that each transaction stream should run with it's unique model, which is a good finding according to the unique business practice from our real clients.

References

  1. Varun Chandola, Arindam Banerjee, and Vipin Kumar, Anomaly Detection : A Survey, ACM Computing Survey, September 2009.
  2. Feature Engineering: Extracting useful information from data - https://en.wikipedia.org/wiki/Feature\_engineering
  3. Unsupervised Learning: Determine function of data without labels - https://en.wikipedia.org/wiki/Unsupervised\_learning
  4. LocalOutlierFactor: Unsupervised learning algorithm - http://scikit-learn.org/stable/auto\_examples/neighbors/plot\_lof.html
  5. Supervised Learning: Predict outcome of labelled data - https://en.wikipedia.org/wiki/Supervised\_learning
  6. RandomForest Classifier: Supervised learning algorithm - https://en.wikipedia.org/wiki/Random\_forest
  7. AdaBoost Classifier: Supervised learning algorithm - https://en.wikipedia.org/wiki/AdaBoost
  8. Classification Report: Scoring a supervised learning algorithm's performance - https://en.wikipedia.org/wiki/Precision\_and\_recall
  9. Cloud Cluster: Container that runs a group of specified tasks - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ECS\_clusters.html
  10. PredicitionIO: Machine learning server that allows a user to deploy a built model - https://predictionio.apache.org/

About Jonah Group

Jonah Group is a digital consultancy the designs and builds high-performance software applications for the enterprise. Our industry is constantly changing, so we help our clients keep pace by making them aware of the possibilities of digital technology as it relates to their business.

  • 24,465
    sq. ft office in downtown Toronto
  • 160
    team members in our close-knit group
  • 21
    years in business, and counting