### Question :

I have an algorithm that is running on a set of objects. This algorithm produces a score value that dictates the differences between the elements in the set.

The sorted output is something like this:

[1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]

If you lay these values down on a spreadsheet you see that they make up groups

[1,1,5,6,1,5] [10,22,23,23] [50,51,51,52] [100,112,130] [500,512,600] [12000,12230]

Is there a way to programatically get those groupings?

Maybe some clustering algorithm using a machine learning library? Or am I overthinking this?

I’ve looked at scikit but their examples are way too advanced for my problem…

##
Answer #1:

# Don’t use clustering for 1-dimensional data

Clustering algorithms are designed for multivariate data. When you have 1-dimensional data, **sort** it, and look for the **largest gaps**. This is trivial and *fast* in 1d, and not possible in 2d. If you want something more advanced, use Kernel Density Estimation (KDE) and look for local minima to split the data set.

There are a number of duplicates of this question:

##
Answer #2:

A good option if you don’t know the number of clusters is `MeanShift`

:

```
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
x = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]
X = np.array(zip(x,np.zeros(len(x))), dtype=np.int)
bandwidth = estimate_bandwidth(X, quantile=0.1)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
for k in range(n_clusters_):
my_members = labels == k
print "cluster {0}: {1}".format(k, X[my_members, 0])
```

Output for this algorithm:

```
cluster 0: [ 1 1 5 6 1 5 10 22 23 23 50 51 51 52]
cluster 1: [100 112 130]
cluster 2: [500 512]
cluster 3: [12000]
cluster 4: [12230]
cluster 5: [600]
```

Modifying `quantile`

variable you can change the clustering number selection criteria

##
Answer #3:

You can use clustering to group these. The trick is to understand that there are two dimensions to your data: the dimension you can see, and the “spatial” dimension that looks like [1, 2, 3… 22]. You can create this matrix in numpy like so:

```
import numpy as np
y = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]
x = range(len(y))
m = np.matrix([x, y]).transpose()
```

Then you can perform clustering on the matrix, with:

```
from scipy.cluster.vq import kmeans
kclust = kmeans(m, 5)
```

kclust’s output will look like this:

```
(array([[ 11, 51],
[ 15, 114],
[ 20, 12115],
[ 4, 9],
[ 18, 537]]), 21.545126372346271)
```

For you, the most interesting part is the first column of the matrix, which says what the centers are along that x dimension:

```
kclust[0][:, 0]
# [20 18 15 4 11]
```

You can then assign your points to a cluster based on which of the five centers they are closest to:

```
assigned_clusters = [abs(cluster_indices - e).argmin() for e in x]
# [3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 2, 2, 2, 2, 1, 1, 0, 0, 0]
```