The 365 Data Science team is proud to invite you to our own community forum. A very well built system to support your queries, questions and give the chance to show your knowledge and help others in their path of becoming Data Science specialists.
Anybody can ask a question
Anybody can answer
The best answers are voted up and moderated by our team

Hyper parameter tuning in dbscan clustering

Hyper parameter tuning in dbscan clustering


Hi everybody.
I read the article in “” that about
customer segmentation by dbscan clustering. In this article there is no comment about dbscan hyper parameter
tuning and the writer considered the eps = 0.5 & min-samples = 4. 
I did this example by my own way. My code is below:
STEP 1: read a csv file into a data frame (df_data).
STEP 2: Standardize data in df_data by StandardScaler.
STEP 3: Hyper parameter tuning:
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
range_min = [x for x in range(2, 51, 1)]
range_eps = [x / 100.0 for x in range(1, 51, 1)] + \
                     [y / 10.0 for y in range(1, 51, 1)] + \
                     [round(z, 2) for z in np.arange(1.10, 1.31, 0.01)]
dic = {}
for m in range_min:

    for e in range_eps:

        model_1 = DBSCAN(eps = e, min_samples = m).fit(df_data)

        core_samples_mask = np.zeros_like(model_1.labels_, dtype = bool)

        core_samples_mask[model_1.core_sample_indices_] = True

        labels = model_1.labels_

        if len( set(labels) ) > 1:

            silhouette_Avg = silhouette_score(df_data, labels)

            if silhouette_Avg > 0:

                dic[str(m) + ” – ” + str(e)] = silhouette_Avg

                print(“min-sample value is: ” + str(m) + ” eps value is: ” + str(e) , “The avearge silhouette_score is :”,                               silhouette_Avg)
max_key = max(dic, key = dic.get)
print(“parameter values are: “, max_key)
print(“maximum silhouette score value is: “, dic[max_key])

For eps = 0.5 & min-samples = 4 the silhouette_score is : 0.15198104684634364


For eps = 1.22 & min-samples = 13 the silhouette score value is: 0.2912153228904898

As we can see 0.29 is greater than 0.15 so I consider eps = 1.22 & min-samples = 13 then run dbscan
clustering with these parameter values.
STEP 4: 
from sklearn import metrics
model = DBSCAN(eps = 1.22, min_samples = 13).fit(df_data)
core_samples_mask = np.zeros_like(model.labels_, dtype = bool)

core_samples_mask[model.core_sample_indices_] = True

labels = model.labels_
# Get number of clusters in labels, ignoring noise if present.
n_clusters = len(set(labels)) – (1 if -1 in labels else 0)
# Get number of outliers
n_outlier = list(labels).count(-1)


print(“Estimated number of clusters: %d” % n_clusters)
print(“Estimated number of outlier points: %d” % n_outlier)


{0, -1} Estimated number of clusters: 1 Estimated number of outlier points: 3 Silhouette Coefficient: 0.291

In that article number of clusters are 4 and number of outliers are: 54

that are different with my result.

Could you please check my code & show my mistake?

If you know a better way for hyper parameter tuning in dbscan please let me know.

Thank you.


1 Answer

365 Team

Hi nazanin, 
thanks for reaching out! 
In our course on customer segmentation we use a combination of a dimensionality reduction technique – PCA and a clustering technique – K-means to find customers’ clusters. 
You can check out the section on K-means and then K-means Clustering based on Principal Components Analysis. Here is a link to the first video:
As far as your code, as it is outside the scope of our courses, it’d be hard to assist you on the specifics. You could try and ask other members of the kaggle community, who’d likely be better able to assist you there.
Hope this helps!
365 Eli

Online Data Science Training
SAVE 60%