ValueError: Found unknown categories [nan] in column 2 during fit
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import precision_score, recall_score,auc
from sklearn.metrics import roc_curve,roc_auc_score,plot_roc_curve
from io import StringIO
path = r"C:\Users\thund\Downloads\Boat.csv"
data = pd.read_csv(path) # pip install xlrd
print(data.shape)
print(data.columns)
print(data.isnull().sum())
print (data.dropna(axis=0)) #dropping rows that have missing values
print (data['Class'].value_counts())
print(data['Class'].value_counts().plot(kind = 'bar'))
#plt.show()
data['safety'].value_counts().plot(kind = 'bar')#plt.show()
import seaborn as snssns.countplot(data['demand'], hue = data['Class'])
#plt.show()
X = data.drop(['Class'], axis = 1)y = data['Class']
from sklearn.preprocessing import OrdinalEncoder
demand_category = ['low', 'med', 'high', 'vhigh']
maint_category = ['low', 'med', 'high', 'vhigh']
seats_category = ['2', '3', '4', '5more']
passenger_category = ['2', '4', 'more']
storage_category = ['Nostorage', 'small', 'med']
safety_category = ['poor', 'good', 'vgood']
all_categories = [demand_category, maint_category,seats_category,passenger_category,storage_category,safety_category]
oe = OrdinalEncoder(categories= all_categories)
X = oe.fit_transform( data[['demand','maint', 'seats', 'passenger', 'storage', 'safety']])
Dataset: https://drive.google.com/file/d/1O0sYZGJep4JkrSgGeJc5e_Nlao2bmegV/view?usp=sharing
I don't know why I keep encountering the 'ValueError: Found unknown categories [nan] in column 2 during fit' since I need it to PreProcess the data. I am quite new to python, would appreciate an in-depth explanation for this, please
Hey Edward,
Thank you for your question!
Unfortunately, since the database doesn't seem to be part of the course, I am not able to run the code and reproduce the error you are getting.
Kind regards,
365 Hristina
Hey again Edward,
Thank you for sharing the database.
What I notice is that you never drop the datapoints which have missing values. This is, in fact, the reason you are getting the error. What I suggest is to substitute the following line in your code:
print (data.dropna(axis=0))
with the following:
data = data.dropna(axis=0)
Doing so, you will notice differences when printing our data.shape
before and after performing this operation. For me, these values changed from (1724, 7) to (1714, 7). This suggests that 10 datapoints have been removed.
One thing I would like to mention is that you might have gotten a warning by executing the following line of code:
sns.countplot(data['demand'], hue = data['Class'])
The reason is that from version 0.12 onwards of seaborn
, you would need to provide the keywords explicitly, as you've done with hue
, namely:
sns.countplot(x = data['demand'], hue = data['Class'])
Lastly, this problem might be a result of formatting while copy-pasting. The following lines of code
import seaborn as snssns.countplot(data['demand'], hue = data['Class'])
X = data.drop(['Class'], axis = 1)y = data['Class']
should instead read
import seaborn as sns
sns.countplot(data['demand'], hue = data['Class'])
X = data.drop(['Class'], axis = 1)
y = data['Class']
Hope this helps you run the code without any errors!
Kind regards,
365 Hristina