K-Means Clustering and its Real Use-Case in the Security Domain

Aakash Choudhary
8 min readAug 12, 2021

Clustering :

Clustering is the assignment of objects to homogeneous groups (called clusters) while making sure that objects in different groups are not similar. Clustering is considered an unsupervised task as it aims to describe the hidden structure of the objects.

Each object is described by a set of characters called features. The first step of dividing objects into clusters is to define the distance between the different objects. Defining an adequate distance measure is crucial for the success of the clustering process.

K-Means Clustering :

There are many clustering algorithms, each has its advantages and disadvantages. A popular algorithm for clustering is K-means, which aims to identify the best k cluster centers in an iterative manner. Cluster centers are served as “representatives” of the objects associated with the cluster.

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster.

The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.

k-mean’s key features are also its drawbacks :

  • The number of clusters (k) must be given explicitly. In some cases, the number of different groups is unknown.
  • k-means iterative nature might lead to an incorrect result due to convergence to a local minimum.
  • The clusters are assumed to be spherical.

The outputs of executing a k-means on a dataset are :

  • k centroids: centroids for each of the k clusters identified from the dataset.
  • complete dataset labeled to ensure each data point is assigned to one of the clusters.

Work Flow Of K-Means Algorithm :

  1. Collecting dataset.
  2. Identifying the number of clusters (k).
  3. Initializing the k centroids (k-means) for the data.
  4. Determining the distance from each centroid and the cluster with the centroid closest to it.
  5. Recounting the centroids for each cluster.
  6. Steps 4 and 5 are repeated until there is no change in cluster centroids.
  7. If formed clusters do not look reasonable, repeat steps 1–6 for different numbers of clusters.

The goal of the K-Means algorithm is to find clusters in the given input data. There are a couple of ways to accomplish this. As we progress, we keep changing the value until we get the best clusters.

Applications Of K-Means :

k-means algorithm is very popular and used in a variety of applications such as market segmentation, document clustering, image segmentation and image compression, etc. The goal usually when we undergo a cluster analysis is either:

  1. Get a meaningful intuition of the structure of the data we’re dealing with.
  2. Cluster-then-predict where different models will be built for different subgroups if we believe there is a wide variation in the behaviors of different subgroups. An example of that is clustering patients into different subgroups and build a model for each subgroup to predict the probability of the risk of having a heart attack.

Use-Cases in the Security Domain :

k-means can typically be applied to data that has a smaller number of dimensions, is numeric, and is continuous. think of a scenario in which you want to make groups of similar things from a randomly distributed collection of things k-means is very suitable for such scenarios.

Clustering Analysis for Malware Behavior Detection in Cyber Crime

Cyber-attacks become the biggest threat in computer and networks system around the world. Because of that, it is important to merge IDS that can detect and analyze the data with high accuracy (i.e., true positives and negative) and low false detection (i.e., false-positive and negative) in the minimum detection time. So, K-Means clustering detection model with appointing of data mining, peculiarly clustering method is a notable field that can be explored to overcome this matter. It is a need to have continuous IDS improvement in terms of the accuracy of malware analysis, the detection time, and the suitable detection approach; are the motivations for this research.

Malware Detection :

Malware interrupts the file registry when entering a computer and basically malware tend to create and modify computer files system and Windows registry entries besides the computer inter-process communication and basic network interaction. Intrusion attacks such as malware are known to breach the network security policy in organizations and continuously try to interrupt the core fundamentals of cybersecurity: Confidential, Integrity, and Availability, or known as CIA.

Therefore, previous cybersecurity researcher has proposed detection-based for malware intrusion, which is a framework that monitors the behavior of system activity. Then, the behavior will be analyzed by the framework and notify the users if there is a sign of intrusion.

Analysis of Intrusion Detection System :

It divides the data into certain polymerization classes according to the attribute of the data. Network intrusion detection is the process of monitoring the events occurring in a computing system or network and analyzing them for signs of intrusions, defined as attempts to compromise confidentiality.

The intrusion attacks can be divided into four categories: Probe (e.g. IP sweep, vulnerability scanning), denial of service (DoS) (e.g. mail bomb, UDP storm), user-to-root (U2R) (e.g. buffer overflow attacks, rootkits) and remote-to-local (R2L) (e.g. password guessing, worm attack)

Clustering is the method of grouping objects into meaningful subclasses so that the members from the same cluster are quite similar, and the members from different clusters are quite different from each other. Therefore clustering methods can be useful for classifying log data and detecting intrusion.

Cyber Profiling using Log Analysis and K-Means Clustering :

The Activities of Internet users are increasing from year to year and have impacted the behavior of the users themselves. Assessment of user behavior is often only based on interaction across the Internet without knowing any others activities. The log activity can be used as another way to study the behavior of the user. The Log Internet activity is one of the types of big data so that the use of data mining with the K-Means technique can be used as a solution for the analysis of user behavior.

In general, cyber profiling analysis is the exploration of data to determine what user activity at the time of internet access. One method that can be used to support the profiling process is K-Means clustering. Through these algorithms, the data can be grouped by the number of websites visited. This grouping aims to see what the user frequently accesses websites.

Identify Outlier Access :

The average user has more than 100 entitlements and that can be very difficult to manage manually. Through the use of the Clustering and K-Means machine learning model, we can detect access outliers by analyzing what’s going on with dynamic peer groups of users.

Let’s look at an example.

On a Saturday afternoon, the company access data shows an employee from IT working on your production finance system. This is seemingly an outlier activity for an IT employee, as it’s not typical for someone in this role to be accessing a production finance system, much less on a Saturday afternoon. So, is this risky activity? As well, at the exact same time and on the same day, you have a business analyst accessing and working on that same production finance application.

If we examine these two access activities individually, we might perceive a problem. Yet, if we combine these two access data points dynamically, the situation may appear to be less risky. Read on.

Now, let’s add an additional person from the Finance organization, a financial analyst, and they are also accessing the same production finance application and on the same Saturday. We have three instances of three different people, from different workgroups, all accessing the production finance system at the same time and on the same day. So, what’s going on?

What’s most likely taking place in this scenario is these employees are working together to perform a system upgrade or are resolving a production issue occurring in the financial system. From a real-world viewpoint, where we can examine traditional static data attributes such as job title or department number, these three employees would not be considered a relevant peer group. From a behavioral analytics standpoint, these three employees do comprise a dynamically generated peer group, as there is system data logging their actions of accessing the same production finance system at the same time.

Dynamic peer groups are clusters of users that are created as Risk Analytics ingests log data, in near real-time, all internal to the machine learning algorithms. Dynamic peer groups are fairly transient, yet they can be retained for future reference.

Automatic clustering of it alerts :

Large enterprise infrastructure technology components such as network, storage, or database generate large volumes of alert messages. because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes. Clustering of data can provide insight into categories of alerts and mean time to repair, and help in failure predictions.

Rideshare data analysis

The publicly available uber ride information dataset provides a large amount of valuable data around traffic, transit time, peak pickup localities, and more. analyzing this data is useful not just in the context of uber but also in providing insight into urban traffic patterns and helping us plan for the cities of the future.

Conclusion

Kmeans clustering is one of the most popular clustering algorithms and usually, the first thing practitioners apply when solving clustering tasks to get an idea of the structure of the dataset. The goal of kmeans is to group data points into distinct non-overlapping subgroups. It does a very good job when the clusters have a kind of spherical shape. However, it suffers as the geometric shapes of clusters deviate from spherical shapes. Moreover, it also doesn’t learn the number of clusters from the data and requires it to be pre-defined. To be a good practitioner, it’s good to know the assumptions behind algorithms/methods so that you would have a pretty good idea about the strength and weaknesses of each method. This will help you decide when to use each method and under what circumstances. In this post, we covered both strengths, weaknesses, and some evaluation methods related to kmeans.

So , that’s all about K-means Cluster. I hope you get something from it.

I hope you like this blog, For next blog topic suggestions or queries you can DM me on LinkedIn. If you like this blog make sure you clap it.

Thank You

Signing off 🙌

--

--