Data and Sampling Distributions¶

Random sampling and sample bias¶

Random sampling is the process of selecting a subset of individuals or items from a larger population in such a way that each member of the population has an equal chance of being chosen.Sample bias occurs when the individuals or elements selected for a sample are not representative of the larger population, leading to inaccurate and non-generalizable results.

Terms:

  • Sample: A subset of larger dataset.
  • population: The larger dataset.
  • N(n): Saze of population (sample).
  • Random Sampling: Picking examples or elements into sample at random from sample.
  • Stratified sampling: Stratified sampling is a sampling method where the population is divided into distinct subgroups, or strata, and random samples are independently collected from each stratum to ensure representation from all segments of the population.
  • Sample bias: A sample that misrepresents the data.

Notes:

  • Random sample is very important in data science.
  • Data quality is often more important than data quantity, and random sampling can reduce bias and facilitate quality improvement that would be prohibitively expensive.

Selection Bais¶

Selection bias refers to a systematic error introduced into a study or sample when the selection of participants is not random or representative, leading to a distorted or unrepresentative sample that may not accurately reflect the larger population.

Terms:

  • Bias: Systematic error or deviation from the true value in a measurement or estimate, often resulting from flaws in the sampling, data collection, or analysis process, and potentially leading to inaccurate and non-representative results.
  • Data snooping: Data snooping refers to the practice of repeatedly analyzing data or testing hypotheses until a desired result is found, potentially leading to false or overly optimistic conclusions.
In [1]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import plotly.express as px
import plotly.graph_objects as go
import plotly.colors
import scipy.stats as stats
In [2]:
data_path = f"./data/"
In [3]:
raw_players_data = pd.read_csv(f'{data_path}players.csv')
players_data = raw_players_data.copy()
In [4]:
print(players_data.shape)
players_data.head(15)
(1683, 7)
Out[4]:
nflId height weight birthDate collegeName position displayName
0 25511 6-4 225 1977-08-03 Michigan QB Tom Brady
1 29550 6-4 328 1982-01-22 Arkansas T Jason Peters
2 29851 6-2 225 1983-12-02 California QB Aaron Rodgers
3 30842 6-6 267 1984-05-19 UCLA TE Marcedes Lewis
4 33084 6-4 217 1985-05-17 Boston College QB Matt Ryan
5 33099 6-6 245 1985-01-16 Delaware QB Joe Flacco
6 33107 6-4 315 1985-08-30 Virginia Tech T Duane Brown
7 33130 5-10 175 1986-12-01 California WR DeSean Jackson
8 33131 6-8 300 1986-09-01 Miami DE Calais Campbell
9 33138 6-3 222 1985-07-02 Michigan QB Chad Henne
10 34452 6-3 220 1988-02-07 Georgia QB Matthew Stafford
11 34754 6-0 229 1986-10-07 Missouri QB Chase Daniel
12 34843 6-2 215 1985-10-13 Michigan State QB Brian Hoyer
13 35443 6-5 320 1988-07-19 Oklahoma T Trent Williams
14 35449 6-3 304 1987-05-12 California NT Tyson Alualu
In [5]:
players_data.dtypes.sort_values()
Out[5]:
nflId           int64
weight          int64
height         object
birthDate      object
collegeName    object
position       object
displayName    object
dtype: object
In [6]:
players_data["height"].unique()
Out[6]:
array(['6-4', '6-2', '6-6', '5-10', '6-8', '6-3', '6-0', '6-5', '6-1',
       '5-9', '5-11', '5-8', '6-7', '6-9', '5-6', '5-7'], dtype=object)
In [7]:
def height_to_inches(h):
    feet, inches = h.split("-")
    try:
        inches = int(feet) * 12 + int(inches)
        return inches
    except:raise Exception("height should not be cahracter or string")
In [8]:
players_data["height"] = players_data["height"].apply(height_to_inches)
players_data['height'].sample()
Out[8]:
1540    77
Name: height, dtype: int64
In [9]:
# Random sampling the data 
r_sample = players_data.sample(frac=0.3, random_state=45)
print(f'random sample size: {r_sample.shape}')
r_sample.head()
random sample size: (505, 7)
Out[9]:
nflId height weight birthDate collegeName position displayName
780 47801 75 305 1995-06-20 North Carolina State C Garrett Bradbury
378 43510 76 305 1993-03-15 Illinois C Ted Karras
1034 52457 76 227 1998-07-07 Notre Dame WR Chase Claypool
690 46284 77 317 1994-11-24 Alabama C Bradley Bozeman
1570 54607 71 175 NaN South Carolina State CB Cobie Durant
In [10]:
# stratifying sample 
def stratify_sample(stratum): return stratum.sample(frac=0.3, random_state=45)

def biased_strat_sample(stratum):
    # only taking greater or equal to mean i.e. 74.5
    stratum = stratum.loc[stratum["height"] > 74, "height"]
    return stratum.sample(frac=0.3, random_state=45)
In [11]:
strat_sample = players_data.groupby("position").apply(stratify_sample)
biased_strat_sample = players_data.groupby("position").apply(biased_strat_sample)
print(f"stratified sample size: {strat_sample.shape}")
strat_sample.sample(10)
stratified sample size: (503, 7)
Out[11]:
nflId height weight birthDate collegeName position displayName
position
TE 345 43399 78 257 1993-01-01 Western Kentucky TE Tyler Higbee
SS 606 46122 71 205 1995-09-16 North Carolina SS M.J. Stewart
DE 466 44892 76 266 1994-03-18 Ohio DE Tarell Basham
G 248 42477 76 310 1992-05-03 West Virginia G Mark Glowinski
ILB 405 44174 73 236 1993-02-22 Texas Tech ILB Sam Eguavoen
QB 867 47916 75 214 1996-08-08 Auburn QB Jarrett Stidham
OLB 1270 53481 73 215 NaN Notre Dame OLB Jeremiah Owusu-Koramoah
T 843 47874 78 307 1996-09-05 Sioux Falls T Trey Pipkins
G 1114 52557 76 300 1996-06-19 Ball State G Danny Pinter
OLB 139 41269 75 250 1991-03-26 Brigham Young OLB Kyle Van Noy
In [12]:
# Create histograms using Plotly Express
fig_pop = px.histogram(players_data, x="height", nbins=30, color_discrete_sequence=['orange'])
fig_random = px.histogram(r_sample, x='height', nbins=30,  color_discrete_sequence=['purple'])
fig_stratified = px.histogram(strat_sample, x='height', nbins=30, color_discrete_sequence=['green'])
fig_biased = px.histogram(biased_strat_sample, x="height", nbins=30)

#add legends 
fig_pop.update_traces(name='Population', showlegend=True)
fig_random.update_traces(name='Random', showlegend=True)
fig_stratified.update_traces(name='Stratified', showlegend=True)
fig_biased.update_traces(name='Biased Stratified', showlegend=True)

# Combine the two histograms into one figure
fig_combined = go.Figure(data=[fig_pop['data'][0], fig_random['data'][0], fig_stratified['data'][0], fig_biased["data"][0]])

# Update layout for better visualization
fig_combined.update_layout(title_text='Distribution Comparison: Population vs. Random vs. Stratified vs. Biased Stratified Sample', xaxis_title='Heights in inches', yaxis_title='Frequency')

# Show the plot
fig_combined.show()

The distribution looks good for height with mean of ~75 inches. But if you notice in biased stratified sampling where we only took heights which are greater than mean. Hence, the mean distribution of biased stratified sample looks skewed to right.

Sampling Distribution of a Statistics¶

  • Data Distribution: The frequency distribution of individual values in a data set. It provides insights into the spread, central tendency, and shape of the data, helping to understand its characteristics and make informed statistical analyses.
  • Sampling Distribution: Sampling distribution is the distribution of a statistic (such as the mean or standard deviation) calculated from multiple samples drawn from the same population, providing insights into the variability of the statistic and aiding in statistical inference.
  • Central limit theorem: The Central Limit Theorem states that, for a sufficiently large sample size, the distribution of the sample mean of a random variable will be approximately normally distributed, regardless of the shape of the original population distribution.
  • Standard error: Standard error is a measure of the variability or spread of sample statistics, representing the standard deviation of the sampling distribution for a particular statistic (e.g., the mean), providing an estimate of how much the sample statistic is expected to vary from the true population parameter.

$$ Standard error (SE) = \frac{\sigma}{\sqrt{n}}$$ $\sigma$ is sample standard deviation

Note: Standard error refers to sample statistics where as standard deviation to population statistics.

In [13]:
samples = range(10, 150, 25)
sample_means = {}
for size in samples:
    sample_mean = []
    for i in range(1000): sample_mean.append(round(players_data["height"].sample(size).mean(), 2))
    sample_means[size] = sample_mean
    plt.hist(sample_mean)
plt.legend([f'{str(lbl)} samples' for lbl in samples])
plt.title("Central Limit Theorem")
Out[13]:
Text(0.5, 1.0, 'Central Limit Theorem')
No description has been provided for this image
In [14]:
# compute standard error for each samples 
error = {}
for k, v in sample_means.items():
    error[str(k)] = pd.Series(v).sem()


# Create a bar chart
fig = go.Figure()

fig.add_trace(go.Bar(x=list(error.keys()), y=list(error.values()), marker_color='orange'))

# Update layout
fig.update_layout(title='Standard Error for each sample sizes',
                  xaxis_title='Sample size',
                  yaxis_title='Error')

# Show the plot
fig.show()

In this figure, Smaller population has high standard error, and vice-versa.

Confidence intervals¶

A confidence interval is a statistical range that expresses the uncertainty or margin of error around an estimated value, providing a level of confidence that the true value lies within that range.

Process:

  1. Understand the Type of Data:

    • Identify whether you are working with a sample or the entire population.
    • Ensure that the data is approximately normally distributed or the sample size is sufficiently large for the Central Limit Theorem to apply.
  2. Choose the Confidence Level:

    • Decide on the confidence level you want (common choices are 95%, 99%).
  3. Collect and Analyze the Data:

    • Collect a sample of data.
  4. Calculate the Sample Mean or of your choice (Median) (($\bar{X}$)):

    • Compute the mean of the sample data.
  5. Determine the Sample Standard Deviation (s):

    • Calculate the standard deviation of the sample data.
  6. Find the Standard Error of the Mean (SE):

    • Use the formula $(SE = \frac{s}{\sqrt{n}})$, where $(s)$ is the sample standard deviation and $(n)$ is the sample size.
  7. Determine the Critical Value (z or t):

    • Based on the chosen confidence level and sample size, find the critical value from the standard normal distribution (z) or t-distribution.
  8. Calculate the Margin of Error (ME):

    • Multiply the standard error by the critical value to obtain the margin of error: $ME = \text{Critical Value} \times SE)$.
  9. Compute the Confidence Interval:

    • Use the formula: $(\text{Confidence Interval} = \bar{X} \pm \text{Margin of Error})$.
  10. Interpret the Results:

    • Communicate the confidence interval along with its interpretation. For example, if you have a 95% confidence interval, you can say, "We are 95% confident that the true population mean falls within this interval."

Notice: that the choice between using the z-score or t-score depends on the sample size. For large sample sizes $n \geq 30$, the z-score is often used. For smaller sample sizes, the t-distribution is more appropriate. But, in our code, we choose z-score if the sample size id grater than thirty percent of population.

In [35]:
def compute_plot_ci(df:pd.Series, sample_size:float, ci:float):
    if sample_size > 1 or ci > 1: raise Exception("check either sample size or CI value")
    sample_data = df.sample(frac=sample_size, random_state=45)
    sample_mean = sample_data.mean()
    # standard error 
    se = sample_data.sem() if sample_size < 0.3 else sample_data.std()
    
    # Determine critical value (z for population, t for sample)
    # t.ppf (percent point function) for student's t-distribution 
    # norm.ppf (percent point function) for Normal (Gaussian) distribution 
    critical_value = stats.t.ppf((1 + ci) / 2, df=len(sample_data)-1) if sample_size < 0.3 else  stats.norm.ppf((1 + ci) / 2)
    
    # compute margin of error
    margin_of_error = critical_value * se
    
    #compute CI
    confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)
    
    # plotting 

    # Create a histogram
    fig = go.Figure()

    fig.add_trace(go.Histogram(x=sample_data, nbinsx=20, opacity=0.7, marker_color='blue', name=sample_data.name))

    # Add vertical line for the mean
    fig.add_trace(go.Scatter(x=[sample_mean, sample_mean], y=[0, 12], mode='lines', line=dict(color='red'), name='Mean'))

    # Add shaded region for the confidence interval
    fig.add_trace(go.Scatter(x=[confidence_interval[0], confidence_interval[1], confidence_interval[1], confidence_interval[0]],
                            y=[0, 0, 12, 12], fill='toself', fillcolor='rgba(255,0,0,0.3)', line=dict(color='rgba(255,0,0,0)'),
                            name=f'{ci*100}% Confidence Interval'))

    # Update layout
    fig.update_layout(title=f'Height Confidence Interval, sample: {int(sample_size * len(df))} CI: ({round(confidence_interval[0], 2)} -{round(confidence_interval[1], 2)})',
                    xaxis_title=sample_data.name,
                    yaxis_title='Frequency')

    # Show the plot
    fig.show()
    
    
In [37]:
compute_plot_ci(players_data["height"], 0.2, 0.98)
  • CI is a way to present estimates as an interval range.
  • Lower the confidence, narrower the CI will be.

Normal Distribution | Gaussian¶

A symmetric probability distribution that forms a bell-shaped curve. It is defined by two parameters: the mean (center of the curve) and the standard deviation (spread of the curve). In a standard normal distribution, the mean is 0, and the standard deviation is 1. The distribution is characterized by its bell-shaped, continuous curve.

  • Error: The difference between a data point and a average or predicted value.
  • Standardize: Substract the mean and divide by the standard deviation.
  • z-score: A z-score is a statistical measure that quantifies the number of standard deviations a data point is from the mean of a dataset, providing a standardized way to assess the position of an individual data point within the distribution.
In [63]:
data = players_data["height"]

# compute mean and standard deviation
mean_val = np.mean(data)
std_dev = np.std(data)

# Set up figure and axis
fig, ax = plt.subplots(figsize=(15, 6))

#  histogram
counts, bins, _ = plt.hist(data, bins=30, density=True, alpha=0.7, color='gray', label='Histogram')

# normal distribution curve
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mean_val, std_dev)
plt.plot(x, p, 'k', linewidth=2, label='Normal Distribution')

# standard deviations
for i, color in zip(range(-3, 4), ['red', 'orange', 'green', 'blue', 'purple', 'brown', 'pink']):
    plt.axvline(mean_val + i * std_dev, color=color, linestyle='dashed', linewidth=1, alpha=0.7)
    # Markings for first, second, and third standard deviations
    if i == 1:
        plt.text(mean_val + i * std_dev, 0.25, '1st Std', color=color, rotation=90, verticalalignment='bottom')
    elif i == 2:
        plt.text(mean_val + i * std_dev, 0.25, '2nd Std', color=color, rotation=90, verticalalignment='bottom')
    elif i == 3:
        plt.text(mean_val + i * std_dev, 0.25, '3rd Std', color=color, rotation=90, verticalalignment='bottom')

# confidence intervals (68% and 95%)
conf_interval_68 = stats.norm.interval(0.68, loc=mean_val, scale=std_dev)
conf_interval_95 = stats.norm.interval(0.95, loc=mean_val, scale=std_dev)
plt.axvline(conf_interval_68[0], color='violet', linestyle='dashed', linewidth=2, label='68% Confidence Interval')
plt.axvline(conf_interval_68[1], color='violet', linestyle='dashed', linewidth=2)
plt.axvline(conf_interval_95[0], color='green', linestyle='dashed', linewidth=2, label='95% Confidence Interval')
plt.axvline(conf_interval_95[1], color='green', linestyle='dashed', linewidth=2)

# labels and legend
plt.xlabel('Height')
plt.ylabel('Density')
plt.title('Height Distribution with Standard Deviations and Confidence Intervals')
plt.legend()

plt.show()
No description has been provided for this image

QQ Plot¶

A QQ-Plot is used to visually determine how close a sample is to the normal distribution. The QQ-Plot orders the z-scores from low to high, and plots each value’s z-score on the y-axis; the x-axis is the corresponding quantile of a normal distribution for that value’s rank. Since the data is normalized, the units correspond to the number of standard deviations away of the data from the mean. If the points roughly fall on the diagonal line, then the sample distribution can be considered close to normal.

In [68]:
# QQ plot 
stats.probplot(players_data["height"],  dist="norm", plot=plt)
plt.title('Q-Q Plot')
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Sample Quantiles')

# Show the plot
plt.show()
No description has been provided for this image
  • Gaussian distribution is essential, as it allows mathematical approximation of uncertainity and variability.
  • Raw data are not normally distributed and often errornous.

student's t-distribution¶

Student's t-distribution, in short, is a probability distribution used for estimating population parameters when the sample size is small and the population standard deviation is unknown; it is similar to the normal distribution but has heavier tails.

  • $n$: sample size
  • Degrees of freedom: A parameter that allows the t-distribution to adjust to different sample sizes, statistics, and number of groups.

Binomial Distribution¶

It models the probability of obtaining a specific number of successes in a fixed number of independent and identical Bernoulli trials, where each trial has two possible outcomes (success or failure) with a constant probability of success.

  • Trial: Event with doscrete outcome.
  • Success: Outcome of interest.
  • Binomial distribution: Distribution of number of successes in x trials. aka Bernouli distribution

$probability$ $p (positive)$ and $1-p (neagitve)$ or vice-versa.

Poisson Distribution¶

(Events spread overt time)

The probability of a specific number of events occurring in a fixed interval of time or space, given the average rate of occurrence, with events happening independently and at a constant average rate. For instance, visitors arriving at a website, and number of emails a person receives in an hour.

  • Lambda ($\lambda)$: the average rate of occurance (average number of emails per hour), also a variation.
  • Poisson distribution: The frequency distribution of the number of events in sampled units of time or space.
  • Exponential distribution: The frequency distribution of the time or distance from one event to the next event. It is commonly used to model the time until the next event in a sequence of events, and it is characterized by its memoryless property, meaning that the probability of an event occurring in the future is independent of the past.
  • Weibull distribution: A generalized version of the exponential, in which the event rate is allowed to shift over time. The Weibull distribution is a continuous probability distribution that is often used to model the distribution of lifetimes of objects or the time until an event occurs. It is characterized by its flexibility to capture a variety of shapes, including exponential, normal, and heavy-tailed distributions.
In [70]:
from scipy.stats import poisson

# Average rate
lambda_value = 3

# Values of k (number of events)
k_values = np.arange(0, 11)

# PMF for each k
pmf_values = poisson.pmf(k_values, lambda_value)

# Plotting
plt.bar(k_values, pmf_values, color='skyblue', edgecolor='black')
plt.title('Example Poisson Distribution')
plt.xlabel('Number of Events (k)')
plt.ylabel('Probability')
plt.xticks(k_values)
plt.show()
No description has been provided for this image
In [73]:
from scipy.stats import expon

# Rate parameter
lambda_value = 0.5

# Values of x (time)
x_values = np.linspace(0, 10, 1000)

# Calculate PDF for each x
pdf_values = expon.pdf(x_values, scale=1/lambda_value)

# Plotting
plt.plot(x_values, pdf_values, color='orange', label=f'Exponential Distribution (lambda={lambda_value})')
plt.title('Example Exponential Distribution')
plt.xlabel('Time (x)')
plt.ylabel('Probability Density')
plt.legend()
plt.show()
No description has been provided for this image

Estimating the failure rate¶

Estimating the failure rate involves determining the rate at which a system or process experiences failures over time. The failure rate is a crucial measure in reliability engineering and is often used to assess the reliability and maintainability of systems.

The failure rate ($\lambda$ )) is typically estimated using historical data or through observations during the operation of a system. There are several methods to estimate the failure rate:

  1. Historical Data Analysis: Analyzing data from past failures can provide valuable insights into the frequency and pattern of failures. The failure rate can be estimated by dividing the number of observed failures by the total operating time.

    $$ \lambda = \frac{\text{Number of Failures}}{\text{Total Operating Time}} $$

  2. Maximum Likelihood Estimation (MLE): MLE is a statistical method that seeks the parameter values that maximize the likelihood of observing the given data. For failure rate estimation, MLE involves finding the rate that maximizes the likelihood function based on observed failure times.

  3. Survival Analysis: Survival analysis techniques, such as the Kaplan-Meier estimator or Cox proportional hazards model, can be used to estimate the failure rate. These methods consider the time until an event (failure) occurs and account for censoring (incomplete observations).

Understanding the failure rate is crucial for predicting the reliability of systems, planning maintenance activities, and making informed decisions about system design and operation. It is a key parameter in reliability engineering and plays a vital role in ensuring the optimal performance and longevity of systems.

In [76]:
from lifelines import KaplanMeierFitter

# Generate  data for time until failure (in hours)
np.random.seed(42)
failure_times = np.random.exponential(scale=50, size=100)

# survival function (complementary cumulative distribution function)
kmf = KaplanMeierFitter()
kmf.fit(durations=failure_times, event_observed=np.ones_like(failure_times))
kmf.plot_survival_function(show_censors=True)
plt.title('Survival Function (Time until Failure)')
plt.xlabel('Time (hours)')
plt.ylabel('Survival Probability')
plt.show()

# Estimate failure rate using MLE
lambda_mle = 1 / np.mean(failure_times)
print(f"Estimated Failure Rate (MLE): {lambda_mle:.4f} per hour")

# Alternatively, fit an exponential distribution and extract the failure rate
_, lambda_fit = expon.fit(failure_times, floc=0)
print(f"Estimated Failure Rate (Exponential Fit): {lambda_fit:.4f} per hour")
No description has been provided for this image
Estimated Failure Rate (MLE): 0.0219 per hour
Estimated Failure Rate (Exponential Fit): 45.7374 per hour
In [85]:
from scipy.stats import weibull_min

# Example data for Weibull distribution
x_values = np.linspace(0, 5, 1000)

# Plotting with different (lambda, k) combinations
plt.figure(figsize=(10, 6))

for lambda_val, k in [(1, 0.5), (1, 1), (0.5, 1.5), (2, 5)]:
    beta = lambda_val**(-1/k)
    
    # Probability density function (PDF) for Weibull distribution
    pdf_values = weibull_min.pdf(x_values, k, scale=beta)

    # Plotting
    plt.plot(x_values, pdf_values, label=f'(lambda, k) = ({lambda_val}, {k})')

plt.title('Weibull Distribution with Different (lambda, k) Combinations')
plt.xlabel('x')
plt.ylabel('Probability Density Function (PDF)')
plt.legend()
plt.show()
No description has been provided for this image

In this example, we use different combinations of $\lambda$ and $k$, while keeping $\beta = \lambda^{-1/k}$ constant for each combination. Adjust the $\lambda$ and $k$ values to explore different shapes and scales of the Weibull distribution.

The plot illustrates the Weibull distribution with different combinations of scale parameter ($\lambda\)) and shape parameter ($k)).

  1. ($\lambda$), ($k$) = (1, 0.5):

    • The distribution has a moderate scale ($\lambda$) and a shape parameter ($k$) that results in a decreasing hazard function. The distribution has a gradual increase in probability density.
  2. ($\lambda$, $k$) = (1, 1):

    • The scale and shape parameters result in the standard exponential distribution ($k = 1$). The distribution is often used as a baseline for comparison.
  3. ($\lambda$, $k$) = (0.5, 1.5):

    • A larger scale parameter ($\lambda$) and a shape parameter ($k$) greater than 1 lead to a distribution with a peak and a slower decrease in probability density. This indicates a higher probability of events occurring later in time.
  4. ($\lambda$, $k$) = (2, 5):

    • A larger scale parameter ($\lambda$) and a shape parameter ($k$) much greater than 1 result in a distribution with a pronounced peak and heavy tails. This suggests a higher probability of events occurring later, with an extended tail.

In summary, the plot demonstrates how different combinations of $\lambda$ and $k$ values influence the shape and scale of the Weibull distribution. Adjusting these parameters allows you to model distributions with varying characteristics, providing flexibility in reliability and survival analysis.

Summary¶

  • For events that occur at a constant rate, the number of events per unit of time or space can be modeled as a Poisson distribution.
  • In this scenario, you can also model the time or distance between one event and the next as an exponential distribution.
  • A changing event rate over time (e.g., an increasing probability of device failure) can be modeled with the Weibull distribution.
In [ ]:
 
In [ ]: