Resolved: How can we be sure that missing values are not all from different IDs?

Question

Good morning. You said that we can drop the rows with missing values if they are less than 5% of the total rows. 172 are from the Price and 150 from the EngineV. If some of these two groups are from the same row, maybe the number of rows with missing values are less than 5%, but how can we be sure of that? At worst, they could be all from different rows, that means that we dropped 172+150=322 rows, but 322/4345 ~ 7%, so maybe we shouldn't drop these rows because they are more than 5%. Thank you.

Answer 1

Hey Alessandro,

Thank you for your question, that is an excellent point to make.

By applying the .describe(include='all') method to data and then to data_no_mv, we can see that 320 points have been removed - there are 4345 datapoints in the former DataFrame, and 4025 in the latter. That indeed means that we are removing more than 7% of the datapoints. Doing this is fine for the purpose of the exercise. It is, however, still incorrect to say that we are removing less than 5% of the observations.

A good question that you pose is 'how can we know beforehand the number of observations containing a null value?' One solution I have come up with is to extract (as an ndarray) the indices of all observations whose Price column is null and concatenate this array with one containing the indices of all observations whose EngineV column is null, thereby creating a single array with 322 numbers. This array is then converted to a set - sets have the property of removing duplicate values. The length of the set is 320, indicating that there are indeed 320 observations containing at least one null value.

Below is the piece of code I've used:

set_with_nulls = set(np.concatenate((data[data['Price'].isnull() == True].index.to_numpy(),
                                     data[data['EngineV'].isnull() == True].index.to_numpy()), 
                                    axis = 0))

len(set_with_nulls)

I'd be happy to learn if there are any alternative and more straightforward methods which would arrive at the same result.

Removing the observations that contain missing values is only one way of solving the problem. There are various techniques that could handle the problem better. Such a discussion is made in Section 8 of the Data Preprocessing with NumPy course. Additionally, there is a comprehensive article on the pandas website where you can get acquainted with the techniques implemented in the library.

Hope this helps!

Kind regards,
365 Hristina

Answer 2

Alessandro Imbrìaco

Posted on:

26 Sept 2022

0

Really exhaustive answer. Thank you!

Answer 3

Can't we like this?

total_nulls = data[(data['Price'].isna()) | (data['EngineV'].isna())].

I think that will also give the same result.

Resolved: How can we be sure that missing values are not all from different IDs?

Submit an answer

related questions