Machine Learning Foundations: Statistics

Machine Learning Foundations: Statistics

Below are important statistics:

bias: the difference between an estimator’s expected value and the true value of the parameter it is trying to estimate (example: a model that always predicts the average price for every house, regardless of size or location, has high bias because it’s too simple to capture important patterns)

correlation: the statistical measure that describes the linear relationship between two variables (example: time spent on a website might be positively correlated with purchases)

covariance: the degree to which two random variables change together; a positive covariance indicates that the variables tend to increase or decrease together, while a negative covariance indicates an inverse relationship (example: if the number of products viewed and the total purchase amount both increase in user data, the covariance between these variables will be positive)

dependent variable: a variable your model is trying to predict and it is affected by other variables (example: in predicting house prices, the price is the dependent variable)

hypothesis testing: a method to decide if a result is meaningful or just random chance (example: A/B testing in model deployment helps decide if version B leads to significantly more clicks than version A)

independent variable: a variable that helps predict the dependent variable (example: square footage is an independent variable when predicting house prices)

interquartile range (IQR): the difference between the upper and lower quartile values in the set of data; covers the middle 50% of your data, from 25th to 75th percentile (example: used to detect outliers in income prediction datasets)

least squares method: a standard approach for estimating the parameters of a model by minimizing the sum of the squared differences between the observed values and the model's predicted values (example: in predicting product prices based on features like size and weight, least squares adjusts the model so that the predicted prices are as close as possible to real prices)

linear regression model: a statistical model that expresses the relationship between a dependent variable and one or more independent variables as a linear equation (example: in modeling user engagement on a website, linear regression can predict session duration based on features like time of day and number of clicks)

mean: the total sum of values in a sample divided by the number of values in your sample.
median: the middle value when the numbers are sorted
mode: the most frequently occurring value

multinoulli distribution: a probability distribution over discrete outcomes; only one outcome happens at a time (example: a softmax output in a classifier that predicts one label) (example: dog, cat, or bird)

outlier: a data point that is very different from others (example: a customer who spends 100× more than others might be noise or might be a VIP)

percentile: the value below which a certain percentage of data falls (example: the 90th percentile prediction error means 90% of errors are smaller than that value)

percentage change: how much a value increases or decreases, expressed as a percent (example: “model accuracy improved by 10%” means the new accuracy is 10% higher than before)

quantiles: cut points dividing the data into equal-sized chunks

range: the difference between the largest and smallest value

sample: a smaller subset of data used to make general conclusions (example: a training set is a sample of the whole dataset used to learn patterns)

sampling bias: when the data sample doesn’t reflect the real population (example: a model trained only on young users might fail on older ones)

standard deviation: a measure of how spread out values are around the mean; it’s the square root of the variance

standard error: the standard deviation of the sample population; it tells how accurate your estimate is

uniform distribution: a probability distribution where all outcomes in a given interval or finite set are equally likely

variance: measures how spread out a set of numbers is; tells you how much the values differ from the average

Comments

ANKUSHAugust 7, 2025 at 12:03 AM
standard error in this content there is error so you want to
ReplyDelete
Replies

Post a Comment