Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

cluster-analysis
Filter by
Sorted by
Tagged with
463 votes
8 answers
284k views

Cluster analysis in R: determine the optimal number of clusters

How can I choose the best number of clusters to do a k-means analysis. After plotting a subset of below data, how many clusters will be appropriate? How can I perform cluster dendro analysis? n = 1000 ...
user2153893's user avatar
  • 4,667
234 votes
11 answers
129k views

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
bmasc's user avatar
  • 2,470
199 votes
20 answers
235k views

Difference between classification and clustering in data mining? [closed]

Can someone explain what the difference is between classification and clustering in data mining? If you can, please give examples of both to understand the main idea.
Kristaps's user avatar
  • 2,031
154 votes
20 answers
126k views

How do I determine k when using k-means clustering?

I've been studying about k-means clustering, and one thing that's not clear is how you choose the value of k. Is it just a matter of trial and error, or is there more to it?
Jason Baker's user avatar
120 votes
8 answers
46k views

What is an intuitive explanation of the Expectation Maximization technique? [closed]

Expectation Maximization (EM) is a kind of probabilistic method to classify data. Please correct me if I am wrong if it is not a classifier. What is an intuitive explanation of this EM technique? ...
London guy's user avatar
  • 27.7k
115 votes
7 answers
92k views

1D Number Array Clustering

So let's say I have an array like this: [1,1,2,3,10,11,13,67,71] Is there a convenient way to partition the array into something like this? [[1,1,2,3],[10,11,13],[67,71]] I looked through similar ...
E.H.'s user avatar
  • 3,351
99 votes
7 answers
69k views

Unsupervised clustering with unknown number of clusters

I have a large set of vectors in 3 dimensions. I need to cluster these based on Euclidean distance such that all the vectors in any particular cluster have a Euclidean distance between each other less ...
London guy's user avatar
  • 27.7k
60 votes
18 answers
58k views

K-means algorithm variation with equal cluster size

I'm looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promising, but does not produce ...
pixelistik's user avatar
  • 7,740
55 votes
3 answers
75k views

Scikit Learn - K-Means - Elbow - criterion

Today i'm trying to learn something about K-means. I Have understand the algorithm and i know how it works. Now i'm looking for the right k... I found the elbow criterion as a method to detect the ...
Linda's user avatar
  • 2,395
55 votes
2 answers
32k views

plotting results of hierarchical clustering on top of a matrix of data

How can I plot a dendrogram right on top of a matrix of values, reordered appropriately to reflect the clustering, in Python? An example is the following figure: This is Figure 6 from: A panel of ...
user avatar
50 votes
7 answers
76k views

How to get the samples in each cluster?

I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it? Say I had 100 data points and KMeans gave me 5 cluster....
user77005's user avatar
  • 1,819
50 votes
9 answers
39k views

scikit-learn: Predicting new points with DBSCAN

I am using DBSCAN to cluster some data using Scikit-Learn (Python 2.7): from sklearn.cluster import DBSCAN dbscan = DBSCAN(random_state=0) dbscan.fit(X) However, I found that there was no built-in ...
slaw's user avatar
  • 6,727
49 votes
8 answers
91k views

Python k-means algorithm

I am looking for Python implementation of k-means algorithm with examples to cluster and cache my database of coordinates.
Eeyore's user avatar
  • 2,126
47 votes
5 answers
43k views

Plot dendrogram using sklearn.AgglomerativeClustering

I'm trying to build a dendrogram using the children_ attribute provided by AgglomerativeClustering, but so far I'm out of luck. I can't use scipy.cluster since agglomerative clustering provided in ...
Shukhrat  Khannanov's user avatar
46 votes
4 answers
38k views

kmeans: Quick-TRANSfer stage steps exceeded maximum

I am running k-means clustering in R on a dataset with 636,688 rows and 7 columns using the standard stats package: kmeans(dataset, centers = 100, nstart = 25, iter.max = 20). I get the following ...
Anna Dunietz's user avatar
42 votes
3 answers
40k views

How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

I need to cluster a simple univariate data set into a preset number of clusters. Technically it would be closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering,...
Alex Kinman's user avatar
  • 2,497
42 votes
3 answers
28k views

Grid search for hyperparameter evaluation of clustering in scikit-learn

I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using silhouette_score which ...
Jamie Bull's user avatar
  • 13.2k
41 votes
3 answers
34k views

How Could One Implement the K-Means++ Algorithm?

I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means ...
Anton Andreev's user avatar
41 votes
6 answers
74k views

Choosing eps and minpts for DBSCAN (R)?

I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me. I'm using dbscan from the fpc library in R. For example, I am looking at the USArrests data ...
Belinda Chiera's user avatar
40 votes
2 answers
52k views

Calculating the percentage of variance measure for k-means?

On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the ...
Legend's user avatar
  • 115k
38 votes
5 answers
28k views

scikit-learn DBSCAN memory usage

UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than ...
JamesT's user avatar
  • 417
37 votes
2 answers
66k views

Will pandas dataframe object work with sklearn kmeans clustering?

dataset is pandas dataframe. This is sklearn.cluster.KMeans km = KMeans(n_clusters = n_Clusters) km.fit(dataset) prediction = km.predict(dataset) This is how I decide which entity belongs to ...
Dark Knight's user avatar
37 votes
4 answers
30k views

Text clustering with Levenshtein distances

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that ...
Alexandros's user avatar
  • 2,180
37 votes
5 answers
38k views

sklearn agglomerative clustering linkage matrix

I'm trying to draw a complete-link scipy.cluster.hierarchy.dendrogram, and I found that scipy.cluster.hierarchy.linkage is slower than sklearn.AgglomerativeClustering. However, sklearn....
Presian Abarov's user avatar
36 votes
4 answers
33k views

How does clustering (especially String clustering) work?

I heard about clustering to group similar data. I want to know how it works in the specific case for String. I have a table with more than different 100,000 words. I want to identify the same word ...
Renato Dinhani's user avatar
36 votes
3 answers
36k views

What makes the distance measure in k-medoid "better" than k-means?

I am reading about the difference between k-means clustering and k-medoid clustering. Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the ...
tumultous_rooster's user avatar
36 votes
2 answers
28k views

Extracting clusters from seaborn clustermap

I am using the seaborn clustermap to create clusters and visually it works great (this example produces very similar results). However I am having trouble figuring out how to programmatically extract ...
sedavidw's user avatar
  • 11.4k
35 votes
6 answers
36k views

How to group latitude/longitude points that are 'close' to each other?

I have a database of user submitted latitude/longitude points and am trying to group 'close' points together. 'Close' is relative, but for now it seems to ~500 feet. At first it seemed I could just ...
Tim Lytle's user avatar
  • 17.5k
34 votes
17 answers
6k views

Clustering Algorithm for Paper Boys

I need help selecting or creating a clustering algorithm according to certain criteria. Imagine you are managing newspaper delivery persons. You have a set of street addresses, each of which is ...
carrier's user avatar
  • 32.6k
34 votes
3 answers
32k views

Spectral Clustering a graph in python

I'd like to cluster a graph in python using spectral clustering. Spectral clustering is a more general technique which can be applied not only to graphs, but also images, or any sort of data, ...
Alex Lenail's user avatar
  • 13.7k
34 votes
1 answer
37k views

Cluster one-dimensional data optimally? [closed]

Does anyone have a paper that explains how the Ckmeans.1d.dp algorithm works? Or: what is the most optimal way to do k-means clustering in one-dimension?
Laciel's user avatar
  • 367
34 votes
2 answers
24k views

Reordering matrix elements to reflect column and row clustering in naiive python [duplicate]

I'm looking for a way to perform clustering separately on matrix rows and than on its columns, reorder the data in the matrix to reflect the clustering and putting it all together. The clustering ...
Boris Gorelik's user avatar
33 votes
5 answers
58k views

DBSCAN for clustering of geographic location data

I have a dataframe with latitude and longitude pairs. Here is my dataframe look like. order_lat order_long 0 19.111841 72.910729 1 19.111342 72.908387 2 19.111342 72.908387 3 19....
Neil's user avatar
  • 8,057
33 votes
5 answers
20k views

Scikit Learn GridSearchCV without cross validation (unsupervised learning)

Is it possible to use GridSearchCV without cross validation? I am trying to optimize the number of clusters in KMeans clustering via grid search, and thus I don't need or want cross validation. The ...
DataMan's user avatar
  • 3,295
33 votes
6 answers
19k views

Which machine learning library to use [closed]

I am looking for a library that, ideally, has the following features: implements hierarchical clustering of multidimensional data (ideally on similiarity or distance matrix) implements support vector ...
Björn Pollex's user avatar
33 votes
4 answers
32k views

Clustering Algorithm for Mapping Application

I'm looking into clustering points on a map (latitude/longitude). Are there any recommendations as to a suitable algorithm that is fast and scalable? More specifically, I have a series of latitude/...
Codebeef's user avatar
  • 43.7k
32 votes
7 answers
24k views

Python Implementation of OPTICS (Clustering) Algorithm

I'm looking for a decent implementation of the OPTICS algorithm in Python. I will use it to form density-based clusters of points ((x,y) pairs). I'm looking for something that takes in (x,y) pairs ...
Murat Derya Özen's user avatar
31 votes
5 answers
43k views

whats is the difference between "k means" and "fuzzy c means" objective functions?

I am trying to see if the performance of both can be compared based on the objective functions they work on?
n0ob's user avatar
  • 1,275
31 votes
14 answers
13k views

How can I find the center of a cluster of data points?

Let's say I plotted the position of a helicopter every day for the past year and came up with the following map: Any human looking at this would be able to tell me that this helicopter is based out ...
Ryan's user avatar
  • 15k
31 votes
2 answers
28k views

python scikit-learn clustering with missing data

I want to cluster data with missing columns. Doing it manually I would calculate the distance in case of a missing column simply without this column. With scikit-learn, missing data is not possible. ...
Michael Hecht's user avatar
30 votes
1 answer
20k views

Online k-means clustering

Is there a online version of the k-Means clustering algorithm? By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when ...
Theodor's user avatar
  • 5,606
30 votes
1 answer
49k views

differences in heatmap/clustering defaults in R (heatplot versus heatmap.2)?

I'm comparing two ways of creating heatmaps with dendrograms in R, one with made4's heatplot and one with gplots of heatmap.2. The appropriate results depend on the analysis but I'm trying to ...
user avatar
28 votes
5 answers
95k views

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

I have a data table ("norm") containing numeric - at least to what I can see - normalized values of the following form: When I am executing k <- kmeans(norm,center=3) I am receving the following ...
Jonathan Rhein's user avatar
28 votes
1 answer
16k views

How to compute cluster assignments from linkage/distance matrices

if you have this hierarchical clustering call in scipy in Python: from scipy.cluster.hierarchy import linkage # dist_matrix is long form distance matrix linkage_matrix = linkage(squareform(...
user avatar
27 votes
3 answers
38k views

Clustering values by their proximity in python (machine learning?) [duplicate]

I have an algorithm that is running on a set of objects. This algorithm produces a score value that dictates the differences between the elements in the set. The sorted output is something like this: ...
PCoelho's user avatar
  • 7,920
27 votes
2 answers
22k views

Group n points in k clusters of equal size [duplicate]

Possible Duplicate: K-means algorithm variation with equal cluster size EDIT: like casperOne point it out to me this question is a duplicate. Anyways here is a more generalized question that ...
Pierre-David Belanger's user avatar
27 votes
1 answer
2k views

Clustering (fkmeans) with Mahout using Clojure

I am trying to write a short script to cluster my data via clojure (calling Mahout classes though). I have my input data in this format (which is an output from a php script) format: (tag) (image) (...
Jeffrey04's user avatar
  • 6,238
26 votes
1 answer
42k views

Clustering text documents using scikit-learn kmeans in Python

I need to implement scikit-learn's kMeans for clustering text documents. The example code works fine as it is but takes some 20newsgroups data as input. I want to use the same code for clustering a ...
Nabila Shahid's user avatar
26 votes
3 answers
26k views

Understanding concept of Gaussian Mixture Models

I'm trying to understand GMM by reading the sources available online. I have achieved clustering using K-Means and was seeing how GMM would compare to K-means. Here is what I have understood, please ...
StuckInPhDNoMore's user avatar
26 votes
6 answers
18k views

Fast (< n^2) clustering algorithm

I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be bounding spheres with a ...
John Hawksley's user avatar

1
2 3 4 5
125