Q: How do I handle missing data (NaN) in a DataFrame?
A: You can handle missing data in Pandas using the fillna() method to replace NaN values with a specified value or strategy.
Here's an example:
import pandas as pd
import numpy as np
data = {'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]}
df = pd.DataFrame(data)
# Replace NaN values with 0
df.fillna(0, inplace=True)
Q: How can I remove rows or columns with missing data?
A: You can use the dropna() method to remove rows or columns containing NaN values.
Here's an example:
# Remove rows with NaN values
df.dropna(axis=0, inplace=True)
# Remove columns with NaN values
df.dropna(axis=1, inplace=True)
Q: How do I correct data types in a DataFrame?
A: You can use the astype() method to convert the data types of columns in a DataFrame. For example, to convert a column to the integer data type:
df['Column_Name'] = df['Column_Name'].astype(int)
Q: How can I replace incorrect values in a DataFrame?
A: You can use the replace() method to replace specific values in a DataFrame. For example, to replace all occurrences of a value with another value:
df['Column_Name'].replace(old_value, new_value, inplace=True)
Q: How can I correct inconsistent string data in a DataFrame column?
A: You can use the str.replace() method to replace or correct string data in a column. For instance, to replace substring occurrences:
df['Column_Name'] = df['Column_Name'].str.replace('incorrect', 'correct')
Q: How can I handle outliers in numerical data?
A: You can handle outliers by applying statistical methods like z-score or IQR (Interquartile Range) and then filtering or transforming the data. Here's an example using the IQR method:
Q1 = df['Column_Name'].quantile(0.25)
Q3 = df['Column_Name'].quantile(0.75)
IQR = Q3 - Q1
# Filter outliers
df = df[(df['Column_Name'] >= Q1 - 1.5 * IQR) & (df['Column_Name'] <= Q3 + 1.5 * IQR)]
Q: How can I handle duplicate rows in a DataFrame?
A: You can remove duplicate rows using the drop_duplicates() method.
For example:
df.drop_duplicates(inplace=True)
Important Interview Questions and Answers on Pandas - Fixing Wrong Data
Q: How do you identify and handle missing values in a Pandas DataFrame?
To identify and handle missing values in a Pandas DataFrame, you can use the isna() or isnull() method to identify missing values, and then use methods like fillna(), dropna(), or imputation techniques to handle them.
Example Code:
import pandas as pd
import numpy as np
# Create a DataFrame with missing values
data = {'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8]}
df = pd.DataFrame(data)
# Identify missing values
print(df.isna())
# Handling missing values by filling with a specific value (e.g., 0)
df.fillna(0, inplace=True)
# Handling missing values by dropping rows with missing values
df.dropna(inplace=True)
Q: How can you replace incorrect or wrong data in a DataFrame with the correct values?
You can use the replace() method to replace incorrect or wrong data in a DataFrame with the correct values.
Example Code:
import pandas as pd
data = {'A': ['apple', 'banana', 'cherry', 'apple', 'cherry'],
'B': [5, 10, 15, 20, 25]}
df = pd.DataFrame(data)
# Replace 'apple' with 'orange' in column 'A'
df['A'].replace('apple', 'orange', inplace=True)
Q: How do you handle outliers in a Pandas DataFrame?
Outliers can be handled in several ways, such as removing them, transforming them, or capping them. You can use techniques like the Z-score or the IQR (Interquartile Range) method to identify and handle outliers.
Example Code (IQR method):
import pandas as pd
# Create a DataFrame with outliers
data = {'A': [1, 2, 3, 100, 5, 6, 7, 200]}
df = pd.DataFrame(data)
# Calculate the IQR (Interquartile Range)
Q1 = df['A'].quantile(0.25)
Q3 = df['A'].quantile(0.75)
IQR = Q3 - Q1
# Define lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Replace outliers with the median
df['A'] = df['A'].apply(lambda x: df['A'].median() if x < lower_bound or x > upper_bound else x)
Q: How can you handle inconsistent or misspelled data in a Pandas DataFrame?
You can handle inconsistent or misspelled data using the replace() method or by using string manipulation functions like str.lower(), str.strip(), or regular expressions.
Example Code (correcting capitalization):
import pandas as pd
data = {'City': ['New York', 'Los Angeles', 'Chicago', 'san Francisco', 'Miami']}
df = pd.DataFrame(data)
# Convert all city names to title case
df['City'] = df['City'].str.title()
Q: How do you handle duplicate data in a DataFrame?
You can handle duplicate data using the drop_duplicates() method to remove duplicate rows or by using the duplicated() method to identify duplicates.
Example Code:
import pandas as pd
data = {'A': [1, 2, 2, 3, 4],
'B': ['X', 'Y', 'Y', 'Z', 'X']}
df = pd.DataFrame(data)
# Remove duplicate rows
df.drop_duplicates(inplace=True)
# Identify duplicate rows based on a subset of columns (column 'B' in this case)
duplicates = df[df.duplicated(subset=['B'])]