ValueError: Found unknown categories [nan] in column 2 during fit

Question

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import precision_score, recall_score,auc
from sklearn.metrics import roc_curve,roc_auc_score,plot_roc_curve
from io import StringIO


path = r"C:\Users\thund\Downloads\Boat.csv"
data = pd.read_csv(path)  # pip install xlrd

print(data.shape)
print(data.columns)

print(data.isnull().sum())
print (data.dropna(axis=0))  #dropping rows that have missing values

print (data['Class'].value_counts())
print(data['Class'].value_counts().plot(kind = 'bar'))
#plt.show()

data['safety'].value_counts().plot(kind = 'bar')#plt.show()


import seaborn as snssns.countplot(data['demand'], hue = data['Class'])
#plt.show()

X = data.drop(['Class'], axis = 1)y = data['Class']

from sklearn.preprocessing import OrdinalEncoder
demand_category = ['low', 'med', 'high', 'vhigh']
maint_category = ['low', 'med', 'high', 'vhigh']
seats_category = ['2', '3', '4', '5more']
passenger_category = ['2', '4', 'more']
storage_category = ['Nostorage', 'small', 'med']
safety_category = ['poor', 'good', 'vgood']
all_categories = [demand_category, maint_category,seats_category,passenger_category,storage_category,safety_category]


oe = OrdinalEncoder(categories= all_categories)
X = oe.fit_transform( data[['demand','maint', 'seats', 'passenger', 'storage', 'safety']])

Dataset: https://drive.google.com/file/d/1O0sYZGJep4JkrSgGeJc5e_Nlao2bmegV/view?usp=sharing

I don't know why I keep encountering the 'ValueError: Found unknown categories [nan] in column 2 during fit' since I need it to PreProcess the data. I am quite new to python, would appreciate an in-depth explanation for this, please

Answer 1

Hey Edward,

Thank you for your question!

Unfortunately, since the database doesn't seem to be part of the course, I am not able to run the code and reproduce the error you are getting.

Kind regards,
365 Hristina

Answer 2

Hey again Edward,

Thank you for sharing the database.

What I notice is that you never drop the datapoints which have missing values. This is, in fact, the reason you are getting the error. What I suggest is to substitute the following line in your code:

print (data.dropna(axis=0))

with the following:

data = data.dropna(axis=0)

Doing so, you will notice differences when printing our data.shape before and after performing this operation. For me, these values changed from (1724, 7) to (1714, 7). This suggests that 10 datapoints have been removed.

One thing I would like to mention is that you might have gotten a warning by executing the following line of code:

sns.countplot(data['demand'], hue = data['Class'])

The reason is that from version 0.12 onwards of seaborn, you would need to provide the keywords explicitly, as you've done with hue, namely:

sns.countplot(x = data['demand'], hue = data['Class'])

Lastly, this problem might be a result of formatting while copy-pasting. The following lines of code

import seaborn as snssns.countplot(data['demand'], hue = data['Class'])
X = data.drop(['Class'], axis = 1)y = data['Class']

should instead read

import seaborn as sns
sns.countplot(data['demand'], hue = data['Class'])
X = data.drop(['Class'], axis = 1)
y = data['Class']

Hope this helps you run the code without any errors!

Kind regards,
365 Hristina

ValueError: Found unknown categories [nan] in column 2 during fit

Submit an answer

related questions