Smiley Face Clustering
This interactive demonstration shows how a simple form of clustering can be visualized. Initially, a smiley face is plotted using predetermined data points. By clicking the "Cluster" button, the data points are re-colored based on their proximity to predefined centroids, simulating a basic clustering process.
How It Works
Plotting Portion
The smiley face is generated using a JavaScript function that plots points in specific patterns to form a face, eyes, and a smile. For the face and eyes, we use the circle equation centered at x=4,5,6 and of radius 0.5, 3, 0.5, respectively. For the smile, we have used part of an ellipse having center at (5,3), major axis 2, and minor axis 1.
Coloring Portion
The clustering is simulated by assigning each point to the nearest of three centroids (4,7), (6,7), and (5,4). Here we chose the centroids by our own but there is a mathematical code behind it discussed in the next paragraph. Then we changed the color of the points who are close to these three centroids changing the point's color as red, green, and blue respectively.
Clustering Portion
Next, we clubbed all the points of the same color close to each other.
What is K_Means?
K-Means is a technique that we use to cluster a group of data.
Objective
The goal of K-means clustering is to partition
n observations into k clusters in which each observation belongs to the cluster with the nearest mean,
serving as a prototype of the cluster.
Steps and Math Involved
Initialization:
Choose k initial centroids (mean points) for the clusters. This can be done randomly or by using more sophisticated methods to help improve convergence and results.
(In some portion we can start with random clusters as well, the results should be almost identical)
Assignment Step:
Assign each data point xi to the nearest centroid. The "nearest" is usually determined by the Euclidean distance, though other distances can be used depending on the context.
The Euclidean distance between two points \(x\) and \(y\) in a plane is calculated as: \(d(x,y) = ||x - y||\). If \(x=(x_1,x_2)\) and \(y=(y_1,y_2)\) then
\(d(x,y) = \sqrt{(x_1-y_1)^2+(x_2-y_2)^2}\).
In multidimensional space, this extends to:
Update Step: \(d(\bar x, \bar y) = \sqrt{\sum_{i=1}^n(x_i-y_i)^2}\) where \(\bar x =(x_1,\cdots, x_n)\).
After all points have been assigned to clusters, recalculate the centroids of each cluster as the mean of all points in the cluster.
The new centroid of a cluster is calculated as: \(c_j:=\text{centriod}(x_j)=\frac 1{|S_j|} \sum_{x_i \in S_j} x_i \)
Here, \(S_j\) represents the set of all points assigned to cluster \(j\) and \(|S_j|\) is the no of points in \(S_j\).
Iteration:
Repeat the assignment and update steps until the centroids no longer change significantly, indicating that the algorithm has converged, or until a specified number of iterations has been reached.
Objective Function
Objective Function:
K-means aims to minimize the within-cluster sum of squares (WCSS), which is the sum of squared distances between each point and its centroid within a cluster. The objective function, also known as the cost function, to be minimized is:
\(J=\sum_{j=1}^k\sum_{x_i \in S_j}||x_i-c_j||^2\).
Challenges and Considerations
Choosing k: Determining the optimal number of clusters \(k\) is a critical step and often not straightforward. Methods like the "elbow method" or the "silhouette score" can help in choosing k.
Sensitivity to Initial Centroids: The final clusters can be significantly influenced by the initial choice of centroids. Multiple runs with different initializations or using methods like k-means++ can mitigate this issue.
Convergence to Local Minima: K-means may converge to a local minimum of the cost function, which may not be the global minimum. Again, multiple runs and comparisons can help find better solutions.
The K-means algorithm is a powerful tool for exploratory data analysis and pattern recognition, with applications ranging from image compression to market segmentation. Its simplicity and efficiency make it widely used, although it is essential to understand its limitations and the underlying mathematical principles to apply it effectively.
Reference
- I want to thank my wife first as she gave me this idea of making some interactive notes. She saw it in some other portfolios of quant people. This is my first writing in html and Java.
- Being that said ChatGPT and Google made my life easier by telling me the syntax of html and Java.
Thanks a lot to all of you for reading this.