k-means Method
K-means is a clustering method widely recognized in the literature and used to subdivide a set of data into distinct groups called clusters. The purpose of this algorithm is to assign each element of the data set to clusters so that elements within the same cluster are more similar to each other than to elements in other clusters.
The K-means process can be summarized as follows:
In the IPB application, it is possible for the user to enter their own data by embedding a database in .xlsx or .csv formats. After data insertion, it is feasible to select 2 or 3 feature to generate clusters.
It is necessary to pay attention that the applet accepts features in numerical and categorical format, however the k-means algorithm supports only numerical variables for creating clusters. If the user does not have a database available, there is an option called example, which generates a default data set, which can be used to handle the applet's functionalities.
After reading the data set, the user must define the K input parameter and the desired number of iterations of the k-means algorithm. At the end, the algorithm returns the graphic visualization, in two or three dimensions, depen-ding on the number of features selected, where it is possible to visualize the generated clusters and the centroids, as well as their respective coordinates.
The K-means process can be summarized as follows:
- Initialization: The algorithm begins by randomly selecting K centroids, where K is the number given by the user, which represents the number of clusters to be formed.
- Point Assignment: Each data element is assigned to the cluster whose centroid is closest in terms of Euclidean distance.
- Centroid Update: The cluster centroids are recalculated based on the elements assigned to them.
- Repetition: Steps 2 and 3 are repeated iteratively until convergence is achieved, that is, until there are no significant changes in the centroids or assigned elements.
In the IPB application, it is possible for the user to enter their own data by embedding a database in .xlsx or .csv formats. After data insertion, it is feasible to select 2 or 3 feature to generate clusters.
It is necessary to pay attention that the applet accepts features in numerical and categorical format, however the k-means algorithm supports only numerical variables for creating clusters. If the user does not have a database available, there is an option called example, which generates a default data set, which can be used to handle the applet's functionalities.
After reading the data set, the user must define the K input parameter and the desired number of iterations of the k-means algorithm. At the end, the algorithm returns the graphic visualization, in two or three dimensions, depen-ding on the number of features selected, where it is possible to visualize the generated clusters and the centroids, as well as their respective coordinates.
Scientific Area:
Learning
Language/Environments:
Python
Target Group:
Basic
Keywords:
Unsupervised, Clustering, Association, Dataset, k-means
Start the applet!