Pandas get_dummies()

The get_dummies() method in Pandas is used to convert categorical variables into dummy variables.

Each category is transformed into a new column with binary value (1 or 0) indicating the presence of the category in the original data.

Example

import pandas as pd

# create a Series
data = pd.Series(['A', 'B', 'A', 'C', 'B'])

# use get_dummies on the Series dummies = pd.get_dummies(data)
print(dummies) ''' Output A B C 0 1 0 0 1 0 1 0 2 1 0 0 3 0 0 1 4 0 1 0 '''

get_dummies() Syntax

The syntax of the get_dummies() method in Pandas is:

get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, drop_first=False)

get_dummies() Arguments

The get_dummies() method takes following arguments:

  • data - the input data to be transformed
  • prefix (optional) - string to append DataFrame column names
  • prefix_sep (optional) - separator for the prefix and the dummy column name
  • dummy_na (optional) - add a column to indicate NaNs, if False NaNs are ignored.
  • drop_first (optional) - whether to remove first level or not

get_dummies() Return Value

The get_dummies() method returns a DataFrame where the value in the input becomes a separate column filled with binary values (1s and 0s), indicating the presence or absence of that value in each row of the original data.


Example 1: Grouping by a Single Column in Pandas

import pandas as pd

# create a Series
data = pd.Series(['apple', 'orange', 'apple', 'banana'])

# use get_dummies() to convert the series into dummy variables dummy_data = pd.get_dummies(data)
print(dummy_data)

Output

   apple  banana  orange
0      1       0       0
1      0       0       1
2      1       0       0
3      0       1       0

In the above example, we have created the data Series with fruit names.

We then applied get_dummies() which creates a new DataFrame where each fruit name becomes a column.

And for each row in the data Series, the corresponding column in the new DataFrame will have a 1 if the fruit name was present in that row, and 0 otherwise.


Example 2: Apply get_dummies() With Prefix

import pandas as pd

# sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']}

# create a DataFrame
df = pd.DataFrame(data)

# get dummies with a specified prefix dummies = pd.get_dummies(df['Color'], prefix='Color')
print(dummies)

Output

    Color     Color_Blue    Color_Green  Color_Red
0    Red          0           0            1
1  Green          0           1            0
2   Blue          1           0            0
3  Green          0           1            0
4    Red          0           0            1

Here, we have passed the prefix='Color' argument to get_dummies(), so the new dummy variable columns are prefixed with Color_.

Hence, the resulting DataFrame contains columns Color_Blue, Color_Green, and Color_Red, representing the presence or absence of the respective color categories.


Example 3: Get Dummies With Specified Prefix and Prefix Separator

import pandas as pd

# sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']}

# create a DataFrame
df = pd.DataFrame(data)

# get dummies with a specified prefix and prefix separator dummies = pd.get_dummies(df['Color'], prefix='Color', prefix_sep='--')
print(dummies)

Output

              Color--Blue      Color--Green  Color--Red
0                 0                  0            1
1                 0                  1            0
2                 1                  0            0
3                 0                  1            0
4                 0                  0            1

In this example, the prefix_sep='--' argument means that the prefix and the original category name will be separated by --.

So, for a color like Blue, the resulting column name in the dummies DataFrame would be Color--Blue and so on.


Example 4: Use dummy_na to Manage Missing Data

import pandas as pd

# sample data with a missing value
data = {'Color': ['Red', 'Green', 'Blue', None, 'Red']}

# create a DataFrame
df = pd.DataFrame(data)

# get dummies without considering NaN dummies_without_nan = pd.get_dummies(df['Color'])
# get dummies considering NaN dummies_with_nan = pd.get_dummies(df['Color'], dummy_na=True)
print("Dummies without NaN handling:\n", dummies_without_nan) print("\nDummies with NaN handling:\n", dummies_with_nan)

Output

Dummies without NaN handling:
       Blue  Green  Red
0       0      0    1
1       0      1    0
2       1      0    0
3       0      0    0
4       0      0    1

Dummies with NaN handling:
     Blue      Green     Red  NaN
0     0          0        1    0
1     0          1        0    0
2     1          0        0    0
3     0          0        0    1
4     0          0        1    0

Here,

  1. get_dummies(df['Color']) - generates columns for Red, Green, and Blue, but no indication of the NaN value.
  2. get_dummies(df['Color'], dummy_na=True) - generates the same columns and an additional one called NaN indicating where NaN values were present in the original data.

Example 5: Specifying Columns for Dummy Encoding

import pandas as pd

# sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']}

# creating a DataFrame
df = pd.DataFrame(data)

# getting dummies without dropping any columns dummies_all = pd.get_dummies(df['Color'])
print("DataFrame with all dummy columns:") print(dummies_all) print("\n")
# getting dummies and dropping the first category column ('Blue' in this case) dummies = pd.get_dummies(df['Color'], drop_first=True)
print("DataFrame after dropping 'Blue':") print(dummies)

Output

DataFrame with all dummy columns:
   Color  Blue  Green   Red
0    Red     0      0    1
1  Green     0      1    0
2   Blue     1      0    0
3  Green     0      1    0
4    Red     0      0    1


DataFrame after dropping 'Blue':
   Color  Green   Red
0    Red      0   1
1  Green      1   0
2   Blue      0   0
3  Green      1   0
4    Red      0   1

Here, the drop_first=True argument is passed to get_dummies() to indicate that the first category should be dropped.

Hence the resulting DataFrame contains two columns Green and Red. The category named Blue is not represented in these columns because it was dropped.

Our premium learning platform, created with over a decade of experience and thousands of feedbacks.

Learn and improve your coding skills like never before.

Try Programiz PRO
  • Interactive Courses
  • Certificates
  • AI Help
  • 2000+ Challenges