Last answered:

18 Nov 2020

Posted on:

14 Nov 2020

0

Hyper parameter tuning in dbscan clustering

Hi everybody. I read the article in "https://www.kaggle.com/aeyjpn/women-love-shopping-clustering-with-dbscan" that about customer segmentation by dbscan clustering. In this article there is no comment about dbscan hyper parameter tuning and the writer considered the eps = 0.5 & min-samples = 4.  I did this example by my own way. My code is below: STEP 1: read a csv file into a data frame (df_data). STEP 2: Standardize data in df_data by StandardScaler. STEP 3: Hyper parameter tuning: from sklearn.cluster import DBSCAN from sklearn.metrics import silhouette_score range_min = [x for x in range(2, 51, 1)] range_eps = [x / 100.0 for x in range(1, 51, 1)] + \
                     [y / 10.0 for y in range(1, 51, 1)] + \
                     [round(z, 2) for z in np.arange(1.10, 1.31, 0.01)] dic = {} for m in range_min:

    for e in range_eps:

        model_1 = DBSCAN(eps = e, min_samples = m).fit(df_data)

        core_samples_mask = np.zeros_like(model_1.labels_, dtype = bool)

        core_samples_mask[model_1.core_sample_indices_] = True

        labels = model_1.labels_

        if len( set(labels) ) > 1:

            silhouette_Avg = silhouette_score(df_data, labels)

            if silhouette_Avg > 0:

                dic[str(m) + " - " + str(e)] = silhouette_Avg

                print("min-sample value is: " + str(m) + " eps value is: " + str(e) , "The avearge silhouette_score is :",                               silhouette_Avg)   max_key = max(dic, key = dic.get) print("parameter values are: ", max_key) print("maximum silhouette score value is: ", dic[max_key]) For eps = 0.5 & min-samples = 4 the silhouette_score is : 0.15198104684634364
BUT

For eps = 1.22 & min-samples = 13 the silhouette score value is: 0.2912153228904898
As we can see 0.29 is greater than 0.15 so I consider eps = 1.22 & min-samples = 13 then run dbscan clustering with these parameter values. STEP 4:  from sklearn import metrics model = DBSCAN(eps = 1.22, min_samples = 13).fit(df_data) core_samples_mask = np.zeros_like(model.labels_, dtype = bool)

core_samples_mask[model.core_sample_indices_] = True

labels = model.labels_ # Get number of clusters in labels, ignoring noise if present. n_clusters = len(set(labels)) - (1 if -1 in labels else 0) # Get number of outliers n_outlier = list(labels).count(-1)

print(set(labels))

print("\n") print("Estimated number of clusters: %d" % n_clusters) print("\n") print("Estimated number of outlier points: %d" % n_outlier) print("\n")
OUTPUT:

{0, -1} Estimated number of clusters: 1 Estimated number of outlier points: 3 Silhouette Coefficient: 0.291

In that article number of clusters are 4 and number of outliers are: 54

that are different with my result.

Could you please check my code & show my mistake?

If you know a better way for hyper parameter tuning in dbscan please let me know.

Thank you.
 
1 answers ( 0 marked as helpful)
Instructor
Posted on:

18 Nov 2020

0

Hi nazanin, 
thanks for reaching out! 
In our course on customer segmentation we use a combination of a dimensionality reduction technique - PCA and a clustering technique - K-means to find customers' clusters. 
You can check out the section on K-means and then K-means Clustering based on Principal Components Analysis. Here is a link to the first video: https://learn.365datascience.com/courses/customer-analytics-in-python/k-means-clustering-background As far as your code, as it is outside the scope of our courses, it'd be hard to assist you on the specifics. You could try and ask other members of the kaggle community, who'd likely be better able to assist you there.
Hope this helps!

Best, 
365 Eli

Submit an answer