Deprecated: $wgMWOAuthSharedUserIDs=false is deprecated, set $wgMWOAuthSharedUserIDs=true, $wgMWOAuthSharedUserSource='local' instead [Called from MediaWiki\HookContainer\HookContainer::run in /var/www/html/w/includes/HookContainer/HookContainer.php at line 135] in /var/www/html/w/includes/Debug/MWDebug.php on line 372
Optimizing MSE for clustering with balanced size constraints - MaRDI portal

Optimizing MSE for clustering with balanced size constraints (Q2335035)

From MaRDI portal
scientific article
Language Label Description Also known as
English
Optimizing MSE for clustering with balanced size constraints
scientific article

    Statements

    Optimizing MSE for clustering with balanced size constraints (English)
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    13 November 2019
    0 references
    Summary: Clustering is to group data so that the observations in the same group are more similar to each other than to those in other groups. \(k\)-means is a popular clustering algorithm in data mining. Its objective is to optimize the mean squared error (MSE). The traditional \(k\)-means algorithm is not suitable for applications where the sizes of clusters need to be balanced. Given \(n\) observations, our objective is to optimize the MSE under the constraint that the observations need to be evenly divided into \(k\) clusters. In this paper, we propose an iterative method for the task of clustering with balanced size constraints. Each iteration can be split into two steps, namely an assignment step and an update step. In the assignment step, the data are evenly assigned to each cluster. The balanced assignment task here is formulated as an integer linear program (ILP), and we prove that the constraint matrix of this ILP is totally unimodular. Thus the ILP is relaxed as a linear program (LP) which can be efficiently solved with the simplex algorithm. In the update step, the new centers are updated as the centroids of the observations in the clusters. Assuming that there are \(n\) observations and the algorithm needs \(m\) iterations to converge, we show that the average time complexity of the proposed algorithm is \(O(m n^{1.65})\)--\(O(m n^{1.70})\). Experimental results indicate that, comparing with state-of-the-art methods, the proposed algorithm is efficient in deriving more accurate clustering.
    0 references
    clustering
    0 references
    constrained clustering
    0 references
    balanced size constraints
    0 references
    mean squared error
    0 references
    linear program
    0 references
    0 references
    0 references

    Identifiers