Pandas Handling Duplicate Values

In large datasets, we often encounter duplicate entries in tables. These duplicate entries can throw off our analysis and skew the results.

Pandas provides several methods to find and remove duplicate entries in DataFrames.

Find Duplicate Entries

We can find duplicate entries in a DataFrame using the duplicated() method. It returns True if a row is duplicated and returns False otherwise.

import pandas as pd

# create dataframe
data = {
    'Name': ['John', 'Anna', 'John', 'Anna', 'John'],
    'Age': [28, 24, 28, 24, 19],
    'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# check for duplicate entries
print(df.duplicated())

Output

0    False
1    False
2     True
3     True
4    False
dtype: bool

In the above example, we checked for duplicate entries in df using the duplicated() method. It returned a series with boolean values indicating if an entry is a duplicate.

Here, we got True in the third and the fourth rows because they are duplicates of the first and the second rows respectively.

Example: Find Duplicates Based on Columns

By default, duplicated() considers all columns. To find duplicates based on certain columns, we can pass them as a list to the duplicated() function.

import pandas as pd

# create dataframe
data = {
    'Name': ['John', 'Anna', 'Johnny', 'Anna', 'John'],
    'Age': [28, 24, 28, 24, 19],
    'City': ['New York', 'Las Vegas', 'New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# check for duplicate entries in columns Name and Age
print(df.duplicated(subset=['Name', 'Age']))

Output

0    False
1    False
2    False
3     True
4    False
dtype: bool

In this example, we checked the duplicate entries based on Name and Age columns only.

If you look at the Name and Age columns, the fourth row is a duplicate of the second row. Hence, the boolean value of the fourth row is True in the output.

Remove Duplicate Entries

We can remove duplicate entries in Pandas using the drop_duplicates() method. For example,

import pandas as pd

# create dataframe
data = {
    'Name': ['John', 'Anna', 'John', 'Anna', 'John'],
    'Age': [28, 24, 28, 24, 19],
    'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# remove duplicates
df.drop_duplicates(inplace=True)

print(df)

Output

   Name  Age         City
0  John   28     New York
1  Anna   24  Los Angeles
4  John   19      Chicago

In this example, we removed duplicate entries from df using drop_duplicates().

Here, inplace=True specifies that the changes are to be made in the original dataframe.

Notice that the drop_duplicates() function keeps the first duplicate entry and removes the last by default. Here, the first and the second rows are kept while the third and the fourth rows are removed.

To keep the last entry, we can pass the keep='last' argument. For example,

import pandas as pd

# create dataframe
data = {
    'Name': ['John', 'Anna', 'John', 'Anna', 'John'],
    'Age': [28, 24, 28, 24, 19],
    'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# remove duplicates, keep last entries
df.drop_duplicates(keep='last', inplace=True)

print(df)

Output

   Name  Age         City
2  John   28     New York
3  Anna   24  Los Angeles
4  John   19      Chicago

Introduction
Find Duplicate Entries
Example: Find Duplicates Based on Columns
Remove Duplicate Entries

Our premium learning platform, created with over a decade of experience and thousands of feedbacks.

Learn and improve your coding skills like never before.

Try Programiz PRO

Interactive Courses
Certificates
AI Help
2000+ Challenges

Popular Tutorials

Popular Examples

Reference Materials

Learn Python Interactively

Courses

Learn Python practically
and Get Certified.

Popular Tutorials

Reference Materials

Popular Examples

Introduction

Dataframe Operations and Manipulations

Data Import and Export

Data Cleaning

Data Analysis and Aggregation

Data Visualization

Pandas Handling Duplicate Values

Find Duplicate Entries

Example: Find Duplicates Based on Columns

Remove Duplicate Entries

Table of Contents

Popular Tutorials

Popular Examples

Reference Materials

Learn Python Interactively

Courses

Learn Python practically and Get Certified.

Popular Tutorials

Reference Materials

Popular Examples

Introduction

Dataframe Operations and Manipulations

Data Import and Export

Data Cleaning

Data Analysis and Aggregation

Data Visualization

Pandas Handling Duplicate Values

Find Duplicate Entries

Example: Find Duplicates Based on Columns

Remove Duplicate Entries

Table of Contents

Learn Python practically
and Get Certified.