Q: How do I remove duplicate rows from a DataFrame based on all columns?
A: You can use the drop_duplicates() method to remove duplicate rows based on all columns in a Pandas DataFrame.
Here's an example:
import pandas as pd
# Sample DataFrame
data = {'A': [1, 2, 2, 3, 4],
'B': ['X', 'Y', 'Y', 'Z', 'X']}
df = pd.DataFrame(data)
# Remove duplicate rows based on all columns
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)
Q: How can I remove duplicates based on a specific column in a DataFrame?
A: You can specify the subset of columns to consider when removing duplicates using the subset parameter in the drop_duplicates() method.
Here's an example:
import pandas as pd
# Sample DataFrame
data = {'A': [1, 2, 2, 3, 4],
'B': ['X', 'Y', 'Y', 'Z', 'X']}
df = pd.DataFrame(data)
# Remove duplicate rows based on column 'A'
df_no_duplicates = df.drop_duplicates(subset=['A'])
print(df_no_duplicates)
Q: How can I keep the last occurrence of a duplicate row instead of the first one?
A: You can use the keep parameter in the drop_duplicates() method to specify whether to keep the first occurrence or the last occurrence of duplicate rows. To keep the last occurrence, set keep='last'.
Here's an example:
import pandas as pd
# Sample DataFrame
data = {'A': [1, 2, 2, 3, 4],
'B': ['X', 'Y', 'Y', 'Z', 'X']}
df = pd.DataFrame(data)
# Keep the last occurrence of duplicate rows based on column 'A'
df_no_duplicates = df.drop_duplicates(subset=['A'], keep='last')
print(df_no_duplicates)
Q: How do I remove duplicates and reset the index of the resulting DataFrame?
A: After removing duplicates, you can use the reset_index() method to reset the index of the DataFrame.
Here's an example:
import pandas as pd
# Sample DataFrame
data = {'A': [1, 2, 2, 3, 4],
'B': ['X', 'Y', 'Y', 'Z', 'X']}
df = pd.DataFrame(data)
# Remove duplicate rows based on all columns and reset the index
df_no_duplicates = df.drop_duplicates().reset_index(drop=True)
print(df_no_duplicates)
Important Interview Questions and Answers on Pandas - Removing Duplicates
Q: How can you check for duplicate rows in a Pandas DataFrame?
You can check for duplicate rows in a Pandas DataFrame using the duplicated() method. This method returns a boolean Series that indicates whether each row is a duplicate or not.
Example Code :
import pandas as pd
# Create a sample DataFrame
data = {'A': [1, 2, 2, 3, 4],
'B': ['X', 'Y', 'Y', 'Z', 'X']}
df = pd.DataFrame(data)
# Check for duplicate rows
duplicates = df.duplicated()
print(duplicates)
Q: How can you remove duplicate rows from a Pandas DataFrame?
You can remove duplicate rows from a Pandas DataFrame using the drop_duplicates() method. This method returns a new DataFrame with duplicate rows removed.
Example Code :
import pandas as pd
# Create a sample DataFrame
data = {'A': [1, 2, 2, 3, 4],
'B': ['X', 'Y', 'Y', 'Z', 'X']}
df = pd.DataFrame(data)
# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)
Q: How can you remove duplicate rows based on a specific column in a Pandas DataFrame?
You can remove duplicate rows based on a specific column in a Pandas DataFrame by specifying that column as an argument to the subset parameter in the drop_duplicates() method.
Example Code :
import pandas as pd
# Create a sample DataFrame
data = {'A': [1, 2, 2, 3, 4],
'B': ['X', 'Y', 'Y', 'Z', 'X']}
df = pd.DataFrame(data)
# Remove duplicate rows based on column 'A'
df_no_duplicates = df.drop_duplicates(subset='A')
print(df_no_duplicates)
Q: How can you keep the first occurrence of duplicate rows and remove subsequent duplicates in a Pandas DataFrame?
You can keep the first occurrence of duplicate rows and remove subsequent duplicates using the keep parameter in the drop_duplicates() method.
Example Code :
import pandas as pd
# Create a sample DataFrame
data = {'A': [1, 2, 2, 3, 4],
'B': ['X', 'Y', 'Y', 'Z', 'X']}
df = pd.DataFrame(data)
# Keep the first occurrence of duplicates
df_no_duplicates = df.drop_duplicates(keep='first')
print(df_no_duplicates)
Q: How can you remove all occurrences of duplicate rows in a Pandas DataFrame?
You can remove all occurrences of duplicate rows using the keep parameter with the value 'False' in the drop_duplicates() method.
Example Code :
import pandas as pd
# Create a sample DataFrame
data = {'A': [1, 2, 2, 3, 4],
'B': ['X', 'Y', 'Y', 'Z', 'X']}
df = pd.DataFrame(data)
# Remove all occurrences of duplicates
df_no_duplicates = df.drop_duplicates(keep=False)
print(df_no_duplicates)