#### World's Best AI Learning Platform with profoundly Demanding Certification Programs

Designed by IITian's, only for AI Learners.

Download our e-book of Introduction To Python

How to leave/exit/deactivate a Python virtualenvironment Exception Type: JSONDecodeError at /update/ Exception Value: Expecting value: line 1 column 1 (char 0) How to Partitioning a dataset in training and test sets using Scikit-learn? What is Ensemble Learning? ValueError: Found input variables with inconsistent numbers of samples: [143, 426] How do you identify important variables while working on a data set in machine learning? Why not getting result of a as [4, 7,8,3,2] Which are different modes to open a file ? Join Discussion

4 (4,001 Ratings)

218 Learners

Mohit Sharma

a year ago

Python Pandas Working With Missing Data | Insideaiml

Missing data is always a problem in real life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid.

Let us consider an online survey for a product. Many a times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way a part of data is always missing, and this is very common in real time.

Let us now see how we can handle missing values (say NA or NaN) using Pandas.

```
# import the pandas library
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)
```

Its **output **is as follows −

```
one two three
a 0.077988 0.476149 0.965836
b NaN NaN NaN
c -0.390208 -0.551605 -2.301950
d NaN NaN NaN
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.146615
g NaN NaN NaN
h 0.085100 0.532791 0.887415
```

Using reindexing, we have created a DataFrame with missing values. In the output, NaN means Not a Number.

To make detecting missing values easier (and across different array dtypes), Pandas provides the isnone() and notnone() functions, which are also methods on Series and DataFrame objects

```
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df['one'].isnone())
```

Its **output **is as follows −

```
a false
b true
c false
d true
e false
f false
g true
h false
Name: one, dtype: bool
```

```
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df['one'].notnone())
```

Its **output **is as follows −

```
a true
b false
c true
d false
e true
f true
g false
h true
Name: one, dtype: bool
```

- When summing data, NA will be treated as Zero
- If the data are all NA, then the result will be NA

```
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df['one'].sum())
```

Its **output **is as follows −

`2.02357685917`

```
import pandas as pd
import numpy as np
df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])
print(df['one'].sum())
```

Its **output **is as follows −

`nan`

Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-none data in a couple of ways, which we have illustrated in the following sections.

The following program shows how you can replace "NaN" with "0".

```
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print(df)
print ("NaN replaced with '0':")
print(df.fillna(0))
```

Its **output **is as follows −

```
one two three
a -0.576991 -0.741695 0.553172
b NaN NaN NaN
c 0.744328 -1.735166 1.749580
NaN replaced with '0':
one two three
a -0.576991 -0.741695 0.553172
b 0.000000 0.000000 0.000000
c 0.744328 -1.735166 1.749580
```

Here, we are filling with value zero; instead we can also fill with any other value.

Using the concepts of filling discussed in the ReIndexing Chapter we will fill the missing values.

- pad/fill

- Fill methods Forward

- bfill/backfill

- Fill methods Backward

```
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df.fillna(method='pad'))
```

Its **output **is as follows −

```
one two three
a 0.077988 0.476149 0.965836
b 0.077988 0.476149 0.965836
c -0.390208 -0.551605 -2.301950
d -0.390208 -0.551605 -2.301950
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.146615
g -0.930230 -0.670473 1.146615
h 0.085100 0.532791 0.887415
```

```
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df.fillna(method='backfill'))
```

Its **output **is as follows −

```
one two three
a 0.077988 0.476149 0.965836
b -0.390208 -0.551605 -2.301950
c -0.390208 -0.551605 -2.301950
d -2.000303 -0.788201 1.510072
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.146615
g 0.085100 0.532791 0.887415
h 0.085100 0.532791 0.887415
```

If you want to simply exclude the missing values, then use the dropna function along with the axis argument. By default, axis=0, i.e., along the row, which means that if any value within a row is NA then the whole row is excluded.

```
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df.dropna())
```

Its **output **is as follows −

```
one two three
a 0.077988 0.476149 0.965836
c -0.390208 -0.551605 -2.301950
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.146615
h 0.085100 0.532791 0.887415
```

```
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df.dropna(axis=1))
```

Its **output **is as follows −

```
Empty DataFrame
Columns: [ ]
Index: [a, b, c, d, e, f, g, h]
```

Many times, we have to replace a generic value with some specific value. We can achieve this by applying the replacement method.

Replacing NA with a scalar value is the equivalent behavior of the fillna() function.

```
import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})
print(df.replace({1000:10,2000:60}))
```

Its **output **is as follows −

```
one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60
```

```
import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})
print(df.replace({1000:10,2000:60}))
```

Its **output **is as follows −

```
one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60
```

I hope you enjoyed reading this article and finally, you came
to know about **Python Pandas - Missing Data**

For more such blogs/courses on data science, machine
learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.

Thanks for reading…

Happy Learning…