Q: What are descriptive statistics?
A: Descriptive statistics summarize and describe the main features of a dataset using measures such as mean, median, mode, standard deviation, and range. They provide a basic understanding of the data.
Q: What are inferential statistics?
A: Inferential statistics use sample data to make predictions or inferences about a larger population. It involves hypothesis testing, confidence intervals, and regression analysis.
Q: What is the difference between a population and a sample?
A: A population is the entire group of individuals or items of interest in a study, while a sample is a smaller subset of that population. Statistics are often calculated on a sample to make inferences about the larger population.
Q: What is a probability distribution?
A: A probability distribution is a mathematical function that describes the likelihood of different outcomes occurring in a random experiment. Common examples include the normal distribution, binomial distribution, and Poisson distribution.
Q: How to calculate the mean, median, and mode in Python?
A: Here's an example code to calculate the mean, median, and mode using Python:
import statistics
data = [10, 15, 20, 25, 30, 35, 40]
mean_value = statistics.mean(data)
median_value = statistics.median(data)
mode_value = statistics.mode(data)
print("Mean:", mean_value)
print("Median:", median_value)
print("Mode:", mode_value)
Q: What is correlation?
A: Correlation measures the strength and direction of the linear relationship between two continuous variables. It is denoted by the correlation coefficient, which can range from -1 to 1, where -1 indicates a perfect negative correlation, 1 indicates a perfect positive correlation, and 0 indicates no correlation.
Q: How to calculate the correlation coefficient in Python?
A: You can use the numpy library to calculate the correlation coefficient in Python:
import numpy as np
# Example data
x = [1, 2, 3, 4, 5]
y = [5, 7, 9, 11, 13]
correlation_coefficient = np.corrcoef(x, y)[0, 1]
print("Correlation Coefficient:", correlation_coefficient)
Q: How to perform a t-test in Python?
A: The t-test is used to compare the means of two groups. Here's an example code using the scipy library in Python:
from scipy.stats import ttest_ind
# Example data for two groups
group1 = [20, 25, 30, 35, 40]
group2 = [15, 18, 22, 27, 32]
t_stat, p_value = ttest_ind(group1, group2)
print("T-statistic:", t_stat)
print("P-value:", p_value)
Important Interview Questions and Answers on Data Science - Intro to Statistics
Q: What is the difference between descriptive statistics and inferential statistics?
Descriptive statistics involves summarizing and presenting data in a meaningful way to describe its main features, such as measures of central tendency (mean, median, mode) and measures of dispersion (standard deviation, range). Inferential statistics, on the other hand, uses data from a sample to make inferences or predictions about a larger population.
Example Code (Descriptive Statistics):
import numpy as np
data = [12, 18, 21, 15, 24, 30, 16, 20]
mean = np.mean(data)
median = np.median(data)
std_dev = np.std(data)
print("Mean:", mean)
print("Median:", median)
print("Standard Deviation:", std_dev)
Q: What is the Central Limit Theorem, and why is it important in statistics?
The Central Limit Theorem states that regardless of the shape of the population distribution, the sampling distribution of the sample mean will be approximately normally distributed if the sample size is large enough. It is essential in statistics because it allows us to make inferences about a population using sample data and apply various statistical tests reliably.
Q: Explain the concept of p-value in hypothesis testing.
The p-value is the probability of observing the obtained results or more extreme results, given that the null hypothesis is true. In hypothesis testing, if the p-value is smaller than a chosen significance level (often 0.05), we reject the null hypothesis in favor of the alternative hypothesis. A smaller p-value indicates stronger evidence against the null hypothesis.
Example Code (Hypothesis Testing - t-test):
from scipy.stats import ttest_ind
# Sample data for two groups (e.g., exam scores of two classes)
group1_scores = [85, 88, 92, 78, 80]
group2_scores = [76, 82, 90, 70, 85]
# Perform independent two-sample t-test
t_statistic, p_value = ttest_ind(group1_scores, group2_scores)
print("T-statistic:", t_statistic)
print("P-value:", p_value)
Q: What is correlation, and how is it different from causation?
Correlation measures the statistical relationship between two variables, indicating how they tend to vary together. It does not imply causation, which means that a correlation between two variables does not necessarily imply that one causes the other. There might be other hidden factors (confounding variables) influencing both variables.
Q: What are outliers, and how can they affect statistical analysis?
Outliers are data points that deviate significantly from the rest of the data in a dataset. They can affect statistical analysis by skewing measures of central tendency and dispersion, leading to inaccurate conclusions and predictions. It is essential to detect and handle outliers appropriately to avoid bias in statistical analyses.
Example Code (Detecting Outliers using Z-score):
import numpy as np
from scipy.stats import zscore
data = [12, 18, 21, 15, 24, 30, 16, 20, 100]
z_scores = zscore(data)
outliers_indices = np.where(np.abs(z_scores) > 3)
print("Outliers indices:", outliers_indices)