Hyper parameter tuning in dbscan clustering
Hi everybody.
I read the article in "https://www.kaggle.com/aeyjpn/women-love-shopping-clustering-with-dbscan" that about
customer segmentation by dbscan clustering. In this article there is no comment about dbscan hyper parameter
tuning and the writer considered the eps = 0.5 & min-samples = 4.
I did this example by my own way. My code is below:
STEP 1: read a csv file into a data frame (df_data).
STEP 2: Standardize data in df_data by StandardScaler.
STEP 3: Hyper parameter tuning:
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
range_min = [x for x in range(2, 51, 1)]
range_eps = [x / 100.0 for x in range(1, 51, 1)] + \
[y / 10.0 for y in range(1, 51, 1)] + \
[round(z, 2) for z in np.arange(1.10, 1.31, 0.01)] dic = {} for m in range_min:
for e in range_eps:
model_1 = DBSCAN(eps = e, min_samples = m).fit(df_data)
core_samples_mask = np.zeros_like(model_1.labels_, dtype = bool)
core_samples_mask[model_1.core_sample_indices_] = True
labels = model_1.labels_
if len( set(labels) ) > 1:
silhouette_Avg = silhouette_score(df_data, labels)
if silhouette_Avg > 0:
dic[str(m) + " - " + str(e)] = silhouette_Avg
print("min-sample value is: " + str(m) + " eps value is: " + str(e) , "The avearge silhouette_score is :", silhouette_Avg) max_key = max(dic, key = dic.get) print("parameter values are: ", max_key) print("maximum silhouette score value is: ", dic[max_key]) For eps = 0.5 & min-samples = 4 the silhouette_score is : 0.15198104684634364
core_samples_mask[model.core_sample_indices_] = True
labels = model.labels_ # Get number of clusters in labels, ignoring noise if present. n_clusters = len(set(labels)) - (1 if -1 in labels else 0) # Get number of outliers n_outlier = list(labels).count(-1)
print(set(labels))
print("\n") print("Estimated number of clusters: %d" % n_clusters) print("\n") print("Estimated number of outlier points: %d" % n_outlier) print("\n")
[y / 10.0 for y in range(1, 51, 1)] + \
[round(z, 2) for z in np.arange(1.10, 1.31, 0.01)] dic = {} for m in range_min:
for e in range_eps:
model_1 = DBSCAN(eps = e, min_samples = m).fit(df_data)
core_samples_mask = np.zeros_like(model_1.labels_, dtype = bool)
core_samples_mask[model_1.core_sample_indices_] = True
labels = model_1.labels_
if len( set(labels) ) > 1:
silhouette_Avg = silhouette_score(df_data, labels)
if silhouette_Avg > 0:
dic[str(m) + " - " + str(e)] = silhouette_Avg
print("min-sample value is: " + str(m) + " eps value is: " + str(e) , "The avearge silhouette_score is :", silhouette_Avg) max_key = max(dic, key = dic.get) print("parameter values are: ", max_key) print("maximum silhouette score value is: ", dic[max_key]) For eps = 0.5 & min-samples = 4 the silhouette_score is : 0.15198104684634364
BUTAs we can see 0.29 is greater than 0.15 so I consider eps = 1.22 & min-samples = 13 then run dbscan clustering with these parameter values. STEP 4: from sklearn import metrics model = DBSCAN(eps = 1.22, min_samples = 13).fit(df_data) core_samples_mask = np.zeros_like(model.labels_, dtype = bool)
For eps = 1.22 & min-samples = 13 the silhouette score value is: 0.2912153228904898
core_samples_mask[model.core_sample_indices_] = True
labels = model.labels_ # Get number of clusters in labels, ignoring noise if present. n_clusters = len(set(labels)) - (1 if -1 in labels else 0) # Get number of outliers n_outlier = list(labels).count(-1)
print(set(labels))
print("\n") print("Estimated number of clusters: %d" % n_clusters) print("\n") print("Estimated number of outlier points: %d" % n_outlier) print("\n")
OUTPUT:
{0, -1} Estimated number of clusters: 1 Estimated number of outlier points: 3 Silhouette Coefficient: 0.291
In that article number of clusters are 4 and number of outliers are: 54
that are different with my result.
Could you please check my code & show my mistake?
If you know a better way for hyper parameter tuning in dbscan please let me know.
Thank you.
1 answers ( 0 marked as helpful)
Hi nazanin,
thanks for reaching out!
In our course on customer segmentation we use a combination of a dimensionality reduction technique - PCA and a clustering technique - K-means to find customers' clusters.
You can check out the section on K-means and then K-means Clustering based on Principal Components Analysis. Here is a link to the first video: https://learn.365datascience.com/courses/customer-analytics-in-python/k-means-clustering-background As far as your code, as it is outside the scope of our courses, it'd be hard to assist you on the specifics. You could try and ask other members of the kaggle community, who'd likely be better able to assist you there.
Hope this helps!
Best,
365 Eli