Use pandas

Replace Missing Data using Pandas

Replace Missing Data using Pandas

When working with real-world data sometimes you'll find that the data have missing values. This can happen for a lot of reasons: Errors in the ETL, the user did not provide that information, a new field was introduce to the database and old rows have no values, etc.

Suppose you have this dataset and you want to replace all the NaN values in the Bedrooms column with non-NaN values.

homes_sorted.csv

AddressPriceBedrooms
992 Settled St823,0494.0
1506 Guido St784,0493.0
247 Fort St299,238NaN
132 Walrus Ave299,0012.0
491 Python St293,923NaN
4981 Anytown Rd199,0004.0
938 Zeal Rd148,398NaN
123 Main St99,0001.0

Let's try the simplest method first.

Method 1: Replace NaN with constant value

import pandas as pd
df_homes = pd.read_csv("C:/Users/kennethcassel/homes_sorted.csv")
# To fill NaN values from a column use pandas fillna() function
# and pass it the value with which you want to replace the missing values
df_homes['Bedrooms'] = df_homes['Bedrooms'].fillna(1)
df_homes.to_csv('homes_imputed.csv', index=False)

After that, your dataset should look like this:

homes_imputed.csv

AddressPriceBedrooms
992 Settled St823,0494.0
1506 Guido St784,0493.0
247 Fort St299,2381.0
132 Walrus Ave299,0012.0
491 Python St293,9231.0
4981 Anytown Rd199,0004.0
938 Zeal Rd148,3981.0
123 Main St99,0001.0

Method 2: Replace NaN with most common value

import pandas as pd
df_homes = pd.read_csv("C:/Users/kennethcassel/homes_sorted.csv")
# Instead of filling missing values with a random number
# we fill them with the most common value.
# The [0] after function mode() is necessary because mode()
# returns a dataset insted of an int/float
df_homes['Bedrooms'] = df_homes['Bedrooms'].fillna(
df_home['Bedrooms'].mode()[0]
)
df_homes.to_csv('homes_imputed.csv', index=False)
df = pd.DataFrame()

After that, all NaN values should be replaced with 4.0

AddressPriceBedrooms
992 Settled St823,0494.0
1506 Guido St784,0493.0
247 Fort St299,2384.0
132 Walrus Ave299,0012.0
491 Python St293,9234.0
4981 Anytown Rd199,0004.0
938 Zeal Rd148,3984.0
123 Main St99,0001.0
Edit this page on GitHub