k-mean clustering and its real use-case in the security domain : detecting a ddos attack on apache server live
In this article, I will be using k-means to detect the DDOS attack on one of my personally deployed apache webserver in website stjgps.org by analyzing it’s access log file and then integrating it with the Jenkins for automation , to show its real use case in the security domain
Firstly, let me define what clustering is
Clustering is an unsupervised classification technique widely used for web usage mining with primary objective to group a given collection of unlabeled objects into meaningful clusters
To do the process of clustering we have n-number of algorithms but one of the famous algorithm is K-means which is explained as :
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.
Since, in this article prime focus is in explaining the security use-case and to do that log files are very helpful and in enterprise related servers we perform log analysis to uncover certain valuable insights that unless analyzed would remain hidden in the log files. The goal is to enhance the business understanding of what is taking place in the log file, but also to discover the potential competitive advantages. That said, the most important first step is to outline the organizational goal of the log file analysis.
What is a DDoS Attack? — DDoS Meaning
Distributed Network Attacks are often referred to as Distributed Denial of Service (DDoS) attacks. This type of attack takes advantage of the specific capacity limits that apply to any network resources — such as the infrastructure that enables a company’s website. The DDoS attack will send multiple requests to the attacked web resource — with the aim of exceeding the website’s capacity to handle multiple requests… and prevent the website from functioning correctly.
How a DDoS attack works
Network resources — such as web servers — have a finite limit to the number of requests that they can service simultaneously. In addition to the capacity limit of the server, the channel that connects the server to the Internet will also have a finite bandwidth / capacity. Whenever the number of requests exceeds the capacity limits of any component of the infrastructure, the level of service is likely to suffer in one of the following ways:
- The response to requests will be much slower than normal.
- Some — or all — users’ requests may be totally ignored
Usually, the attacker’s ultimate aim is the total prevention of the web resource’s normal functioning — a total ‘denial of service’. The attacker may also request payment for stopping the attack. In some cases, a DDoS attack may even be an attempt to discredit or damage a competitor’s business
So, let’ s get started
What our objective will be:
- Pull the log file and put it to any centralized storage like aws s3
- create, analyse and the put the code to the SCM tool like GitHub
- further we can use automation tools like jenkins to pull the code and the log file and then find the vulnerable IPs which may cause D-DOS attack and take necessary actions like mailing, using the AWS API to further block the IP address by updating the firewall or any other research work
First thing first we need log file which we can pull from /var/log/apache2/access_log and then for my convenience I have put it on AWS S3 manually but it can be also automated using aws, if unaware of aws you can normally download the log file or use the same machine to perform the practical
Import the necessary header files
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from datetime import datetime
Since, our apache log file is not in proper dataset we need to convert it into pandas data frame for us to perform
I have used following code to parse it,
which will give the following dataset
now, since wee need to find out the success visits to our website , we will use the following columns
Now , lets count the visits that the unique ip has made
since ip is an object type so it cannot be used for ml training which , so let’s drop it
Scaling the data for better performance
Now, lets find the best value of k for the dataset
From the graph we can easily see that the accurate value of k which is 5
Now, its time to train the model using K-means algorithm
to visualize it let’s first plot in bar graph
to visualize it much better , i have used the scatter graph which is displaying the clusters , we can easily figure out the ips which have visited
we can easily figure out that those clusters which are on top may have caused a ddos attack to verify, i have used two approaches to do that, first with the custom python code and second selecting the item from the cluster 2
From the above code it is very clear that the IP’s belonging to cluster 2 may have caused a DDOS attack, hence now we can easily get the ip addresses
What’s Next ?
we can use the jenkins to automate and run the jobs regularly, to know how to intergrate it with jenkins do like support and follow me
The above mentioned is very simple,easy to use and yet powerful. We can easily set up the code and then automate it with the automation tools like Jenkins , so that it runs on regular intervals and give us the report so that the authorities take necessary actions.
!! Thanks for Reading !!