Data and Sampling Distributions¶
Random sampling and sample bias¶
Random sampling is the process of selecting a subset of individuals or items from a larger population in such a way that each member of the population has an equal chance of being chosen.Sample bias occurs when the individuals or elements selected for a sample are not representative of the larger population, leading to inaccurate and non-generalizable results.
Terms:
- Sample: A subset of larger dataset.
- population: The larger dataset.
- N(n): Saze of population (sample).
- Random Sampling: Picking examples or elements into sample at random from sample.
- Stratified sampling: Stratified sampling is a sampling method where the population is divided into distinct subgroups, or strata, and random samples are independently collected from each stratum to ensure representation from all segments of the population.
- Sample bias: A sample that misrepresents the data.
Notes:
- Random sample is very important in data science.
- Data quality is often more important than data quantity, and random sampling can reduce bias and facilitate quality improvement that would be prohibitively expensive.
Selection Bais¶
Selection bias refers to a systematic error introduced into a study or sample when the selection of participants is not random or representative, leading to a distorted or unrepresentative sample that may not accurately reflect the larger population.
Terms:
- Bias: Systematic error or deviation from the true value in a measurement or estimate, often resulting from flaws in the sampling, data collection, or analysis process, and potentially leading to inaccurate and non-representative results.
- Data snooping: Data snooping refers to the practice of repeatedly analyzing data or testing hypotheses until a desired result is found, potentially leading to false or overly optimistic conclusions.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.colors
import scipy.stats as stats
data_path = f"./data/"
raw_players_data = pd.read_csv(f'{data_path}players.csv')
players_data = raw_players_data.copy()
print(players_data.shape)
players_data.head(15)
(1683, 7)
nflId | height | weight | birthDate | collegeName | position | displayName | |
---|---|---|---|---|---|---|---|
0 | 25511 | 6-4 | 225 | 1977-08-03 | Michigan | QB | Tom Brady |
1 | 29550 | 6-4 | 328 | 1982-01-22 | Arkansas | T | Jason Peters |
2 | 29851 | 6-2 | 225 | 1983-12-02 | California | QB | Aaron Rodgers |
3 | 30842 | 6-6 | 267 | 1984-05-19 | UCLA | TE | Marcedes Lewis |
4 | 33084 | 6-4 | 217 | 1985-05-17 | Boston College | QB | Matt Ryan |
5 | 33099 | 6-6 | 245 | 1985-01-16 | Delaware | QB | Joe Flacco |
6 | 33107 | 6-4 | 315 | 1985-08-30 | Virginia Tech | T | Duane Brown |
7 | 33130 | 5-10 | 175 | 1986-12-01 | California | WR | DeSean Jackson |
8 | 33131 | 6-8 | 300 | 1986-09-01 | Miami | DE | Calais Campbell |
9 | 33138 | 6-3 | 222 | 1985-07-02 | Michigan | QB | Chad Henne |
10 | 34452 | 6-3 | 220 | 1988-02-07 | Georgia | QB | Matthew Stafford |
11 | 34754 | 6-0 | 229 | 1986-10-07 | Missouri | QB | Chase Daniel |
12 | 34843 | 6-2 | 215 | 1985-10-13 | Michigan State | QB | Brian Hoyer |
13 | 35443 | 6-5 | 320 | 1988-07-19 | Oklahoma | T | Trent Williams |
14 | 35449 | 6-3 | 304 | 1987-05-12 | California | NT | Tyson Alualu |
players_data.dtypes.sort_values()
nflId int64 weight int64 height object birthDate object collegeName object position object displayName object dtype: object
players_data["height"].unique()
array(['6-4', '6-2', '6-6', '5-10', '6-8', '6-3', '6-0', '6-5', '6-1', '5-9', '5-11', '5-8', '6-7', '6-9', '5-6', '5-7'], dtype=object)
def height_to_inches(h):
feet, inches = h.split("-")
try:
inches = int(feet) * 12 + int(inches)
return inches
except:raise Exception("height should not be cahracter or string")
players_data["height"] = players_data["height"].apply(height_to_inches)
players_data['height'].sample()
1540 77 Name: height, dtype: int64
# Random sampling the data
r_sample = players_data.sample(frac=0.3, random_state=45)
print(f'random sample size: {r_sample.shape}')
r_sample.head()
random sample size: (505, 7)
nflId | height | weight | birthDate | collegeName | position | displayName | |
---|---|---|---|---|---|---|---|
780 | 47801 | 75 | 305 | 1995-06-20 | North Carolina State | C | Garrett Bradbury |
378 | 43510 | 76 | 305 | 1993-03-15 | Illinois | C | Ted Karras |
1034 | 52457 | 76 | 227 | 1998-07-07 | Notre Dame | WR | Chase Claypool |
690 | 46284 | 77 | 317 | 1994-11-24 | Alabama | C | Bradley Bozeman |
1570 | 54607 | 71 | 175 | NaN | South Carolina State | CB | Cobie Durant |
# stratifying sample
def stratify_sample(stratum): return stratum.sample(frac=0.3, random_state=45)
def biased_strat_sample(stratum):
# only taking greater or equal to mean i.e. 74.5
stratum = stratum.loc[stratum["height"] > 74, "height"]
return stratum.sample(frac=0.3, random_state=45)
strat_sample = players_data.groupby("position").apply(stratify_sample)
biased_strat_sample = players_data.groupby("position").apply(biased_strat_sample)
print(f"stratified sample size: {strat_sample.shape}")
strat_sample.sample(10)
stratified sample size: (503, 7)
nflId | height | weight | birthDate | collegeName | position | displayName | ||
---|---|---|---|---|---|---|---|---|
position | ||||||||
TE | 345 | 43399 | 78 | 257 | 1993-01-01 | Western Kentucky | TE | Tyler Higbee |
SS | 606 | 46122 | 71 | 205 | 1995-09-16 | North Carolina | SS | M.J. Stewart |
DE | 466 | 44892 | 76 | 266 | 1994-03-18 | Ohio | DE | Tarell Basham |
G | 248 | 42477 | 76 | 310 | 1992-05-03 | West Virginia | G | Mark Glowinski |
ILB | 405 | 44174 | 73 | 236 | 1993-02-22 | Texas Tech | ILB | Sam Eguavoen |
QB | 867 | 47916 | 75 | 214 | 1996-08-08 | Auburn | QB | Jarrett Stidham |
OLB | 1270 | 53481 | 73 | 215 | NaN | Notre Dame | OLB | Jeremiah Owusu-Koramoah |
T | 843 | 47874 | 78 | 307 | 1996-09-05 | Sioux Falls | T | Trey Pipkins |
G | 1114 | 52557 | 76 | 300 | 1996-06-19 | Ball State | G | Danny Pinter |
OLB | 139 | 41269 | 75 | 250 | 1991-03-26 | Brigham Young | OLB | Kyle Van Noy |
# Create histograms using Plotly Express
fig_pop = px.histogram(players_data, x="height", nbins=30, color_discrete_sequence=['orange'])
fig_random = px.histogram(r_sample, x='height', nbins=30, color_discrete_sequence=['purple'])
fig_stratified = px.histogram(strat_sample, x='height', nbins=30, color_discrete_sequence=['green'])
fig_biased = px.histogram(biased_strat_sample, x="height", nbins=30)
#add legends
fig_pop.update_traces(name='Population', showlegend=True)
fig_random.update_traces(name='Random', showlegend=True)
fig_stratified.update_traces(name='Stratified', showlegend=True)
fig_biased.update_traces(name='Biased Stratified', showlegend=True)
# Combine the two histograms into one figure
fig_combined = go.Figure(data=[fig_pop['data'][0], fig_random['data'][0], fig_stratified['data'][0], fig_biased["data"][0]])
# Update layout for better visualization
fig_combined.update_layout(title_text='Distribution Comparison: Population vs. Random vs. Stratified vs. Biased Stratified Sample', xaxis_title='Heights in inches', yaxis_title='Frequency')
# Show the plot
fig_combined.show()
The distribution looks good for height with mean of ~75 inches. But if you notice in biased stratified sampling where we only took heights which are greater than mean. Hence, the mean distribution of biased stratified sample looks skewed to right.
Sampling Distribution of a Statistics¶
- Data Distribution: The frequency distribution of individual values in a data set. It provides insights into the spread, central tendency, and shape of the data, helping to understand its characteristics and make informed statistical analyses.
- Sampling Distribution: Sampling distribution is the distribution of a statistic (such as the mean or standard deviation) calculated from multiple samples drawn from the same population, providing insights into the variability of the statistic and aiding in statistical inference.
- Central limit theorem: The Central Limit Theorem states that, for a sufficiently large sample size, the distribution of the sample mean of a random variable will be approximately normally distributed, regardless of the shape of the original population distribution.
- Standard error: Standard error is a measure of the variability or spread of sample statistics, representing the standard deviation of the sampling distribution for a particular statistic (e.g., the mean), providing an estimate of how much the sample statistic is expected to vary from the true population parameter.
$$ Standard error (SE) = \frac{\sigma}{\sqrt{n}}$$ $\sigma$ is sample standard deviation
Note: Standard error refers to sample statistics where as standard deviation to population statistics.
samples = range(10, 150, 25)
sample_means = {}
for size in samples:
sample_mean = []
for i in range(1000): sample_mean.append(round(players_data["height"].sample(size).mean(), 2))
sample_means[size] = sample_mean
plt.hist(sample_mean)
plt.legend([f'{str(lbl)} samples' for lbl in samples])
plt.title("Central Limit Theorem")
Text(0.5, 1.0, 'Central Limit Theorem')
# compute standard error for each samples
error = {}
for k, v in sample_means.items():
error[str(k)] = pd.Series(v).sem()
# Create a bar chart
fig = go.Figure()
fig.add_trace(go.Bar(x=list(error.keys()), y=list(error.values()), marker_color='orange'))
# Update layout
fig.update_layout(title='Standard Error for each sample sizes',
xaxis_title='Sample size',
yaxis_title='Error')
# Show the plot
fig.show()
In this figure, Smaller population has high standard error, and vice-versa.
Confidence intervals¶
A confidence interval is a statistical range that expresses the uncertainty or margin of error around an estimated value, providing a level of confidence that the true value lies within that range.
Process:
Understand the Type of Data:
- Identify whether you are working with a sample or the entire population.
- Ensure that the data is approximately normally distributed or the sample size is sufficiently large for the Central Limit Theorem to apply.
Choose the Confidence Level:
- Decide on the confidence level you want (common choices are 95%, 99%).
Collect and Analyze the Data:
- Collect a sample of data.
Calculate the Sample Mean or of your choice (Median) (($\bar{X}$)):
- Compute the mean of the sample data.
Determine the Sample Standard Deviation (s):
- Calculate the standard deviation of the sample data.
Find the Standard Error of the Mean (SE):
- Use the formula $(SE = \frac{s}{\sqrt{n}})$, where $(s)$ is the sample standard deviation and $(n)$ is the sample size.
Determine the Critical Value (z or t):
- Based on the chosen confidence level and sample size, find the critical value from the standard normal distribution (z) or t-distribution.
Calculate the Margin of Error (ME):
- Multiply the standard error by the critical value to obtain the margin of error: $ME = \text{Critical Value} \times SE)$.
Compute the Confidence Interval:
- Use the formula: $(\text{Confidence Interval} = \bar{X} \pm \text{Margin of Error})$.
Interpret the Results:
- Communicate the confidence interval along with its interpretation. For example, if you have a 95% confidence interval, you can say, "We are 95% confident that the true population mean falls within this interval."
Notice: that the choice between using the z-score or t-score depends on the sample size. For large sample sizes $n \geq 30$, the z-score is often used. For smaller sample sizes, the t-distribution is more appropriate. But, in our code, we choose z-score if the sample size id grater than thirty percent of population.
def compute_plot_ci(df:pd.Series, sample_size:float, ci:float):
if sample_size > 1 or ci > 1: raise Exception("check either sample size or CI value")
sample_data = df.sample(frac=sample_size, random_state=45)
sample_mean = sample_data.mean()
# standard error
se = sample_data.sem() if sample_size < 0.3 else sample_data.std()
# Determine critical value (z for population, t for sample)
# t.ppf (percent point function) for student's t-distribution
# norm.ppf (percent point function) for Normal (Gaussian) distribution
critical_value = stats.t.ppf((1 + ci) / 2, df=len(sample_data)-1) if sample_size < 0.3 else stats.norm.ppf((1 + ci) / 2)
# compute margin of error
margin_of_error = critical_value * se
#compute CI
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)
# plotting
# Create a histogram
fig = go.Figure()
fig.add_trace(go.Histogram(x=sample_data, nbinsx=20, opacity=0.7, marker_color='blue', name=sample_data.name))
# Add vertical line for the mean
fig.add_trace(go.Scatter(x=[sample_mean, sample_mean], y=[0, 12], mode='lines', line=dict(color='red'), name='Mean'))
# Add shaded region for the confidence interval
fig.add_trace(go.Scatter(x=[confidence_interval[0], confidence_interval[1], confidence_interval[1], confidence_interval[0]],
y=[0, 0, 12, 12], fill='toself', fillcolor='rgba(255,0,0,0.3)', line=dict(color='rgba(255,0,0,0)'),
name=f'{ci*100}% Confidence Interval'))
# Update layout
fig.update_layout(title=f'Height Confidence Interval, sample: {int(sample_size * len(df))} CI: ({round(confidence_interval[0], 2)} -{round(confidence_interval[1], 2)})',
xaxis_title=sample_data.name,
yaxis_title='Frequency')
# Show the plot
fig.show()
compute_plot_ci(players_data["height"], 0.2, 0.98)
- CI is a way to present estimates as an interval range.
- Lower the confidence, narrower the CI will be.
Normal Distribution | Gaussian¶
A symmetric probability distribution that forms a bell-shaped curve. It is defined by two parameters: the mean (center of the curve) and the standard deviation (spread of the curve). In a standard normal distribution, the mean is 0, and the standard deviation is 1. The distribution is characterized by its bell-shaped, continuous curve.
- Error: The difference between a data point and a average or predicted value.
- Standardize: Substract the mean and divide by the standard deviation.
- z-score: A z-score is a statistical measure that quantifies the number of standard deviations a data point is from the mean of a dataset, providing a standardized way to assess the position of an individual data point within the distribution.
data = players_data["height"]
# compute mean and standard deviation
mean_val = np.mean(data)
std_dev = np.std(data)
# Set up figure and axis
fig, ax = plt.subplots(figsize=(15, 6))
# histogram
counts, bins, _ = plt.hist(data, bins=30, density=True, alpha=0.7, color='gray', label='Histogram')
# normal distribution curve
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mean_val, std_dev)
plt.plot(x, p, 'k', linewidth=2, label='Normal Distribution')
# standard deviations
for i, color in zip(range(-3, 4), ['red', 'orange', 'green', 'blue', 'purple', 'brown', 'pink']):
plt.axvline(mean_val + i * std_dev, color=color, linestyle='dashed', linewidth=1, alpha=0.7)
# Markings for first, second, and third standard deviations
if i == 1:
plt.text(mean_val + i * std_dev, 0.25, '1st Std', color=color, rotation=90, verticalalignment='bottom')
elif i == 2:
plt.text(mean_val + i * std_dev, 0.25, '2nd Std', color=color, rotation=90, verticalalignment='bottom')
elif i == 3:
plt.text(mean_val + i * std_dev, 0.25, '3rd Std', color=color, rotation=90, verticalalignment='bottom')
# confidence intervals (68% and 95%)
conf_interval_68 = stats.norm.interval(0.68, loc=mean_val, scale=std_dev)
conf_interval_95 = stats.norm.interval(0.95, loc=mean_val, scale=std_dev)
plt.axvline(conf_interval_68[0], color='violet', linestyle='dashed', linewidth=2, label='68% Confidence Interval')
plt.axvline(conf_interval_68[1], color='violet', linestyle='dashed', linewidth=2)
plt.axvline(conf_interval_95[0], color='green', linestyle='dashed', linewidth=2, label='95% Confidence Interval')
plt.axvline(conf_interval_95[1], color='green', linestyle='dashed', linewidth=2)
# labels and legend
plt.xlabel('Height')
plt.ylabel('Density')
plt.title('Height Distribution with Standard Deviations and Confidence Intervals')
plt.legend()
plt.show()
QQ Plot¶
A QQ-Plot is used to visually determine how close a sample is to the normal distribution. The QQ-Plot orders the z-scores from low to high, and plots each value’s z-score on the y-axis; the x-axis is the corresponding quantile of a normal distribution for that value’s rank. Since the data is normalized, the units correspond to the number of standard deviations away of the data from the mean. If the points roughly fall on the diagonal line, then the sample distribution can be considered close to normal.
# QQ plot
stats.probplot(players_data["height"], dist="norm", plot=plt)
plt.title('Q-Q Plot')
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Sample Quantiles')
# Show the plot
plt.show()
- Gaussian distribution is essential, as it allows mathematical approximation of uncertainity and variability.
- Raw data are not normally distributed and often errornous.
student's t-distribution¶
Student's t-distribution, in short, is a probability distribution used for estimating population parameters when the sample size is small and the population standard deviation is unknown; it is similar to the normal distribution but has heavier tails.
- $n$: sample size
- Degrees of freedom: A parameter that allows the t-distribution to adjust to different sample sizes, statistics, and number of groups.
Binomial Distribution¶
It models the probability of obtaining a specific number of successes in a fixed number of independent and identical Bernoulli trials, where each trial has two possible outcomes (success or failure) with a constant probability of success.
- Trial: Event with doscrete outcome.
- Success: Outcome of interest.
- Binomial distribution: Distribution of number of successes in x trials. aka Bernouli distribution
$probability$ $p (positive)$ and $1-p (neagitve)$ or vice-versa.
Poisson Distribution¶
(Events spread overt time)
The probability of a specific number of events occurring in a fixed interval of time or space, given the average rate of occurrence, with events happening independently and at a constant average rate. For instance, visitors arriving at a website, and number of emails a person receives in an hour.
- Lambda ($\lambda)$: the average rate of occurance (average number of emails per hour), also a variation.
- Poisson distribution: The frequency distribution of the number of events in sampled units of time or space.
- Exponential distribution: The frequency distribution of the time or distance from one event to the next event. It is commonly used to model the time until the next event in a sequence of events, and it is characterized by its memoryless property, meaning that the probability of an event occurring in the future is independent of the past.
- Weibull distribution: A generalized version of the exponential, in which the event rate is allowed to shift over time. The Weibull distribution is a continuous probability distribution that is often used to model the distribution of lifetimes of objects or the time until an event occurs. It is characterized by its flexibility to capture a variety of shapes, including exponential, normal, and heavy-tailed distributions.
from scipy.stats import poisson
# Average rate
lambda_value = 3
# Values of k (number of events)
k_values = np.arange(0, 11)
# PMF for each k
pmf_values = poisson.pmf(k_values, lambda_value)
# Plotting
plt.bar(k_values, pmf_values, color='skyblue', edgecolor='black')
plt.title('Example Poisson Distribution')
plt.xlabel('Number of Events (k)')
plt.ylabel('Probability')
plt.xticks(k_values)
plt.show()
from scipy.stats import expon
# Rate parameter
lambda_value = 0.5
# Values of x (time)
x_values = np.linspace(0, 10, 1000)
# Calculate PDF for each x
pdf_values = expon.pdf(x_values, scale=1/lambda_value)
# Plotting
plt.plot(x_values, pdf_values, color='orange', label=f'Exponential Distribution (lambda={lambda_value})')
plt.title('Example Exponential Distribution')
plt.xlabel('Time (x)')
plt.ylabel('Probability Density')
plt.legend()
plt.show()
Estimating the failure rate¶
Estimating the failure rate involves determining the rate at which a system or process experiences failures over time. The failure rate is a crucial measure in reliability engineering and is often used to assess the reliability and maintainability of systems.
The failure rate ($\lambda$ )) is typically estimated using historical data or through observations during the operation of a system. There are several methods to estimate the failure rate:
Historical Data Analysis: Analyzing data from past failures can provide valuable insights into the frequency and pattern of failures. The failure rate can be estimated by dividing the number of observed failures by the total operating time.
$$ \lambda = \frac{\text{Number of Failures}}{\text{Total Operating Time}} $$
Maximum Likelihood Estimation (MLE): MLE is a statistical method that seeks the parameter values that maximize the likelihood of observing the given data. For failure rate estimation, MLE involves finding the rate that maximizes the likelihood function based on observed failure times.
Survival Analysis: Survival analysis techniques, such as the Kaplan-Meier estimator or Cox proportional hazards model, can be used to estimate the failure rate. These methods consider the time until an event (failure) occurs and account for censoring (incomplete observations).
Understanding the failure rate is crucial for predicting the reliability of systems, planning maintenance activities, and making informed decisions about system design and operation. It is a key parameter in reliability engineering and plays a vital role in ensuring the optimal performance and longevity of systems.
from lifelines import KaplanMeierFitter
# Generate data for time until failure (in hours)
np.random.seed(42)
failure_times = np.random.exponential(scale=50, size=100)
# survival function (complementary cumulative distribution function)
kmf = KaplanMeierFitter()
kmf.fit(durations=failure_times, event_observed=np.ones_like(failure_times))
kmf.plot_survival_function(show_censors=True)
plt.title('Survival Function (Time until Failure)')
plt.xlabel('Time (hours)')
plt.ylabel('Survival Probability')
plt.show()
# Estimate failure rate using MLE
lambda_mle = 1 / np.mean(failure_times)
print(f"Estimated Failure Rate (MLE): {lambda_mle:.4f} per hour")
# Alternatively, fit an exponential distribution and extract the failure rate
_, lambda_fit = expon.fit(failure_times, floc=0)
print(f"Estimated Failure Rate (Exponential Fit): {lambda_fit:.4f} per hour")
Estimated Failure Rate (MLE): 0.0219 per hour Estimated Failure Rate (Exponential Fit): 45.7374 per hour
from scipy.stats import weibull_min
# Example data for Weibull distribution
x_values = np.linspace(0, 5, 1000)
# Plotting with different (lambda, k) combinations
plt.figure(figsize=(10, 6))
for lambda_val, k in [(1, 0.5), (1, 1), (0.5, 1.5), (2, 5)]:
beta = lambda_val**(-1/k)
# Probability density function (PDF) for Weibull distribution
pdf_values = weibull_min.pdf(x_values, k, scale=beta)
# Plotting
plt.plot(x_values, pdf_values, label=f'(lambda, k) = ({lambda_val}, {k})')
plt.title('Weibull Distribution with Different (lambda, k) Combinations')
plt.xlabel('x')
plt.ylabel('Probability Density Function (PDF)')
plt.legend()
plt.show()
In this example, we use different combinations of $\lambda$ and $k$, while keeping $\beta = \lambda^{-1/k}$ constant for each combination. Adjust the $\lambda$ and $k$ values to explore different shapes and scales of the Weibull distribution.
The plot illustrates the Weibull distribution with different combinations of scale parameter ($\lambda\)) and shape parameter ($k)).
($\lambda$), ($k$) = (1, 0.5):
- The distribution has a moderate scale ($\lambda$) and a shape parameter ($k$) that results in a decreasing hazard function. The distribution has a gradual increase in probability density.
($\lambda$, $k$) = (1, 1):
- The scale and shape parameters result in the standard exponential distribution ($k = 1$). The distribution is often used as a baseline for comparison.
($\lambda$, $k$) = (0.5, 1.5):
- A larger scale parameter ($\lambda$) and a shape parameter ($k$) greater than 1 lead to a distribution with a peak and a slower decrease in probability density. This indicates a higher probability of events occurring later in time.
($\lambda$, $k$) = (2, 5):
- A larger scale parameter ($\lambda$) and a shape parameter ($k$) much greater than 1 result in a distribution with a pronounced peak and heavy tails. This suggests a higher probability of events occurring later, with an extended tail.
In summary, the plot demonstrates how different combinations of $\lambda$ and $k$ values influence the shape and scale of the Weibull distribution. Adjusting these parameters allows you to model distributions with varying characteristics, providing flexibility in reliability and survival analysis.
Summary¶
- For events that occur at a constant rate, the number of events per unit of time or space can be modeled as a Poisson distribution.
- In this scenario, you can also model the time or distance between one event and the next as an exponential distribution.
- A changing event rate over time (e.g., an increasing probability of device failure) can be modeled with the Weibull distribution.