Questions tagged [k-means]

k-means is a clustering algorithm, implemented in popular data science tools. Use this tag for questions related to the k-means clustering algorithm itself, or to its use with the tools that implement it (alongside other tags specific to those tools).

k-means
Filter by
Sorted by
Tagged with
463 votes
8 answers
284k views

Cluster analysis in R: determine the optimal number of clusters

How can I choose the best number of clusters to do a k-means analysis. After plotting a subset of below data, how many clusters will be appropriate? How can I perform cluster dendro analysis? n = 1000 ...
user2153893's user avatar
  • 4,667
234 votes
11 answers
129k views

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
bmasc's user avatar
  • 2,470
154 votes
20 answers
126k views

How do I determine k when using k-means clustering?

I've been studying about k-means clustering, and one thing that's not clear is how you choose the value of k. Is it just a matter of trial and error, or is there more to it?
Jason Baker's user avatar
121 votes
3 answers
186k views

Will scikit-learn utilize GPU?

Reading implementation of scikit-learn in TensorFlow: http://learningtensorflow.com/lesson6/ and scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html I'm ...
blue-sky's user avatar
  • 52.9k
60 votes
6 answers
3k views

Branchless K-means (or other optimizations)

Note: I'd appreciate more of a guide to how to approach and come up with these kinds of solutions rather than the solution itself. I have a very performance-critical function in my system showing up ...
user avatar
60 votes
18 answers
58k views

K-means algorithm variation with equal cluster size

I'm looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promising, but does not produce ...
pixelistik's user avatar
  • 7,740
55 votes
3 answers
75k views

Scikit Learn - K-Means - Elbow - criterion

Today i'm trying to learn something about K-means. I Have understand the algorithm and i know how it works. Now i'm looking for the right k... I found the elbow criterion as a method to detect the ...
Linda's user avatar
  • 2,395
50 votes
7 answers
76k views

How to get the samples in each cluster?

I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it? Say I had 100 data points and KMeans gave me 5 cluster....
user77005's user avatar
  • 1,819
49 votes
8 answers
91k views

Python k-means algorithm

I am looking for Python implementation of k-means algorithm with examples to cluster and cache my database of coordinates.
Eeyore's user avatar
  • 2,126
48 votes
3 answers
47k views

Simple approach to assigning clusters for new data after k-means clustering

I'm running k-means clustering on a data frame df1, and I'm looking for a simple approach to computing the closest cluster center for each observation in a new data frame df2 (with the same variable ...
josliber's user avatar
  • 44.1k
46 votes
4 answers
38k views

kmeans: Quick-TRANSfer stage steps exceeded maximum

I am running k-means clustering in R on a dataset with 636,688 rows and 7 columns using the standard stats package: kmeans(dataset, centers = 100, nstart = 25, iter.max = 20). I get the following ...
Anna Dunietz's user avatar
42 votes
7 answers
32k views

Kmeans without knowing the number of clusters? [duplicate]

I am attempting to apply k-means on a set of high-dimensional data points (about 50 dimensions) and was wondering if there are any implementations that find the optimal number of clusters. I ...
Legend's user avatar
  • 115k
41 votes
3 answers
34k views

How Could One Implement the K-Means++ Algorithm?

I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means ...
Anton Andreev's user avatar
40 votes
2 answers
52k views

Calculating the percentage of variance measure for k-means?

On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the ...
Legend's user avatar
  • 115k
37 votes
2 answers
66k views

Will pandas dataframe object work with sklearn kmeans clustering?

dataset is pandas dataframe. This is sklearn.cluster.KMeans km = KMeans(n_clusters = n_Clusters) km.fit(dataset) prediction = km.predict(dataset) This is how I decide which entity belongs to ...
Dark Knight's user avatar
36 votes
3 answers
36k views

What makes the distance measure in k-medoid "better" than k-means?

I am reading about the difference between k-means clustering and k-medoid clustering. Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the ...
tumultous_rooster's user avatar
34 votes
1 answer
37k views

Cluster one-dimensional data optimally? [closed]

Does anyone have a paper that explains how the Ckmeans.1d.dp algorithm works? Or: what is the most optimal way to do k-means clustering in one-dimension?
Laciel's user avatar
  • 367
32 votes
2 answers
42k views

Scikit-learn: How to run KMeans on a one-dimensional array?

I have an array of 13.876(13,876) values between 0 and 1. I would like to apply sklearn.cluster.KMeans to only this vector to find the different clusters in which the values are grouped. However, it ...
Irene's user avatar
  • 589
31 votes
5 answers
43k views

whats is the difference between "k means" and "fuzzy c means" objective functions?

I am trying to see if the performance of both can be compared based on the objective functions they work on?
n0ob's user avatar
  • 1,275
31 votes
3 answers
44k views

Understanding "score" returned by scikit-learn KMeans

I applied clustering on a set of text documents (about 100). I converted them to Tfidf vectors using TfIdfVectorizer and supplied the vectors as input to scikitlearn.cluster.KMeans(n_clusters=2, init='...
Prateek Dewan's user avatar
30 votes
1 answer
20k views

Online k-means clustering

Is there a online version of the k-Means clustering algorithm? By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when ...
Theodor's user avatar
  • 5,606
28 votes
5 answers
95k views

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

I have a data table ("norm") containing numeric - at least to what I can see - normalized values of the following form: When I am executing k <- kmeans(norm,center=3) I am receving the following ...
Jonathan Rhein's user avatar
27 votes
2 answers
54k views

What is the time complexity of k-means?

I was going through the k-means Wikipedia page. Based on the algorithm, I think the complexity is O(n*k*i) (n = total elements, k = number of cluster iteration) So can someone explain me this ...
parallel's user avatar
  • 313
27 votes
2 answers
22k views

Group n points in k clusters of equal size [duplicate]

Possible Duplicate: K-means algorithm variation with equal cluster size EDIT: like casperOne point it out to me this question is a duplicate. Anyways here is a more generalized question that ...
Pierre-David Belanger's user avatar
26 votes
1 answer
42k views

Clustering text documents using scikit-learn kmeans in Python

I need to implement scikit-learn's kMeans for clustering text documents. The example code works fine as it is but takes some 20newsgroups data as input. I want to use the same code for clustering a ...
Nabila Shahid's user avatar
26 votes
6 answers
18k views

Fast (< n^2) clustering algorithm

I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be bounding spheres with a ...
John Hawksley's user avatar
26 votes
3 answers
36k views

Using K-means with cosine similarity - Python

I am trying to implement Kmeans algorithm in python which will use cosine distance instead of euclidean distance as distance metric. I understand that using different distance function can be fatal ...
ise372's user avatar
  • 261
26 votes
2 answers
4k views

Estimation of number of Clusters via gap statistics and prediction strength

I am trying to translate the R implementations of gap statistics and prediction strength http://edchedch.wordpress.com/2011/03/19/counting-clusters/ into python scripts for the estimation of number of ...
Riyaz's user avatar
  • 1,450
25 votes
3 answers
72k views

kmeans scatter plot: plot different colors per cluster

I am trying to do a scatter plot of a kmeans output which clusters sentences of the same topic together. The problem i am facing is plotting points that belongs to each cluster a certain color. ...
jxn's user avatar
  • 7,865
25 votes
2 answers
25k views

K-Means: Lloyd,Forgy,MacQueen,Hartigan-Wong

I'm working with the K-Means Algorithm in R and I want to figure out the differences of the 4 Algorithms Lloyd,Forgy,MacQueen and Hartigan-Wong which are available for the function "kmeans" in the ...
user2974776's user avatar
24 votes
11 answers
124k views

setting an array element with a sequence requested array has an inhomogeneous shape after 1 dimensions The detected shape was (2,)+inhomogeneous part

import os import numpy as np from scipy.signal import * import csv import matplotlib.pyplot as plt from scipy import signal from brainflow.board_shim import BoardShim, BrainFlowInputParams, LogLevels,...
ILovePhysics's user avatar
24 votes
5 answers
33k views

Changes of clustering results after each time run in Python scikit-learn

I have a bunch of sentences and I want to cluster them using scikit-learn spectral clustering. I've run the code and get the results with no problem. But, every time I run it I get different results. ...
user3430235's user avatar
23 votes
2 answers
16k views

How does pytorch backprop through argmax?

I'm building Kmeans in pytorch using gradient descent on centroid locations, instead of expectation-maximisation. Loss is the sum of square distances of each point to its nearest centroid. To ...
jammygrams's user avatar
22 votes
7 answers
30k views

Can k-means clustering do classification?

I want to know whether the k-means clustering algorithm can do classification? If I have done a simple k-means clustering . Assume I have many data , I use k-means clusterings, then get 2 clusters A,...
Sirius Wang's user avatar
22 votes
6 answers
25k views

scikit-learn: Finding the features that contribute to each KMeans cluster

Say you have 10 features you are using to create 3 clusters. Is there a way to see the level of contribution each of the features have for each of the clusters? What I want to be able to say is that ...
cmgerber's user avatar
  • 2,229
21 votes
4 answers
38k views

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive) when using silhouette_score

I am trying to calculate silhouette score as I find the optimal number of clusters to create, but get an error that says: ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (...
Suhail Gupta's user avatar
  • 22.8k
21 votes
3 answers
19k views

How would I implement k-means with TensorFlow?

The intro tutorial, which uses the built-in gradient descent optimizer, makes a lot of sense. However, k-means isn't just something I can plug into gradient descent. It seems like I'd have to write my ...
Raphie Palefsky-Smith's user avatar
21 votes
5 answers
38k views

How can I perform K-means clustering on time series data?

How can I do K-means clustering of time series data? I understand how this works when the input data is a set of points, but I don't know how to cluster a time series with 1XM, where M is the data ...
Jaz's user avatar
  • 591
20 votes
1 answer
29k views

How to add k-means predicted clusters in a column to a dataframe in Python

I have a question about kmeans clustering in python. So I did the analysis that way: from sklearn.cluster import KMeans km = KMeans(n_clusters=12, random_state=1) new = data._get_numeric_data()....
Keithx's user avatar
  • 3,066
19 votes
3 answers
33k views

plot a document tfidf 2D graph

I would like to plot a 2d graph with the x-axis as term and y-axis as TFIDF score (or document id) for my list of sentences. I used scikit learn's fit_transform() to get the scipy matrix but i do not ...
jxn's user avatar
  • 7,865
19 votes
2 answers
38k views

Clustering geo location coordinates (lat,long pairs) using KMeans algorithm with Python

Using the following code to cluster geolocation coordinates results in 3 clusters: import numpy as np import matplotlib.pyplot as plt from scipy.cluster.vq import kmeans2, whiten ...
rokpoto.com's user avatar
  • 9,959
19 votes
5 answers
25k views

How to calculate BIC for k-means clustering in R

I've been using k-means to cluster my data in R but I'd like to be able to assess the fit vs. model complexity of my clustering using Baysiean Information Criterion (BIC) and AIC. Currently the code I'...
UnivStudent's user avatar
18 votes
3 answers
36k views

OpenCV using k-means to posterize an image

I want to posterize an image with k-means and OpenCV in C++ interface (cv namespace) and I get weird results. I need it for reduce some noise. This is my code: #include "cv.h" #include "...
nkint's user avatar
  • 11.7k
17 votes
2 answers
38k views

KMeans clustering in PySpark

I have a spark dataframe 'mydataframe' with many columns. I am trying to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values). I want to extract 7 ...
user3245256's user avatar
  • 1,918
17 votes
4 answers
23k views

Can I use K-means algorithm on a string?

I am working on a python project where I study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ...
Doni's user avatar
  • 173
17 votes
2 answers
21k views

How to set k-Means clustering labels from highest to lowest with Python?

I have a dataset of 38 apartments and their electricity consumption in the morning, afternoon and evening. I am trying to clusterize this dataset using the k-Means implementation from scikit-learn, ...
Sergio's user avatar
  • 377
17 votes
2 answers
60k views

How to identify Cluster labels in kmeans scikit learn

I am learning python scikit. The example given here displays the top occurring words in each Cluster and not Cluster name. http://scikit-learn.org/stable/auto_examples/document_clustering.html I ...
vij555's user avatar
  • 349
16 votes
1 answer
67k views

How to use silhouette score in k-means clustering from sklearn library?

I'd like to use silhouette score in my script, to automatically compute number of clusters in k-means clustering from sklearn. import numpy as np import pandas as pd import csv from sklearn.cluster ...
Jessica Martini's user avatar
16 votes
1 answer
36k views

initial centroids for scikit-learn kmeans clustering

if I already have a numpy array that can serve as the initial centroids, how can I properly initialize the kmeans algorithm? I am using the scikit-learn Kmeans class this post (k-means with selected ...
webmaker's user avatar
  • 476
16 votes
2 answers
3k views

How to detect multiple objects with OpenCV in C++?

I got inspiration from this answer here, which is a Python implementation, but I need C++, that answer works very well, I got the thought is that: detectAndCompute to get keypoints, use kmeans to ...
Suge's user avatar
  • 2,857

1
2 3 4 5
70