Machine Learning Foundations: Statistics
Below are important statistics:
- bias: the difference between an estimator’s expected value and the true value of the parameter it is trying to estimate (example: a model that always predicts the average price for every house, regardless of size or location, has high bias because it’s too simple to capture important patterns)
- correlation: the statistical measure that describes the linear relationship between two variables (example: time spent on a website might be positively correlated with purchases)
- covariance: the degree to which two random variables change together; a positive covariance indicates that the variables tend to increase or decrease together, while a negative covariance indicates an inverse relationship (example: if the number of products viewed and the total purchase amount both increase in user data, the covariance between these variables will be positive)
- dependent variable: a variable your model is trying to predict and it is affected by other variables (example: in predicting house prices, the price is the dependent variable)
- hypothesis testing: a method to decide if a result is meaningful or just random chance (example: A/B testing in model deployment helps decide if version B leads to significantly more clicks than version A)
- independent variable: a variable that helps predict the dependent variable (example: square footage is an independent variable when predicting house prices)
- interquartile range (IQR): the difference between the upper and lower quartile values in the set of data; covers the middle 50% of your data, from 25th to 75th percentile (example: used to detect outliers in income prediction datasets)
- least squares method: a standard approach for estimating the parameters of a model by minimizing the sum of the squared differences between the observed values and the model's predicted values (example: in predicting product prices based on features like size and weight, least squares adjusts the model so that the predicted prices are as close as possible to real prices)
- linear regression model: a statistical model that expresses the relationship between a dependent variable and one or more independent variables as a linear equation (example: in modeling user engagement on a website, linear regression can predict session duration based on features like time of day and number of clicks)
- mean: the total sum of values in a sample divided by the number of values in your sample.
- median: the middle value when the numbers are sorted
- mode: the most frequently occurring value
- multinoulli distribution: a probability distribution over discrete outcomes; only one outcome happens at a time (example: a softmax output in a classifier that predicts one label) (example: dog, cat, or bird)
- outlier: a data point that is very different from others (example: a customer who spends 100× more than others might be noise or might be a VIP)
- percentile: the value below which a certain percentage of data falls (example: the 90th percentile prediction error means 90% of errors are smaller than that value)
- percentage change: how much a value increases or decreases, expressed as a percent (example: “model accuracy improved by 10%” means the new accuracy is 10% higher than before)
- quantiles: cut points dividing the data into equal-sized chunks
- range: the difference between the largest and smallest value
- sample: a smaller subset of data used to make general conclusions (example: a training set is a sample of the whole dataset used to learn patterns)
- sampling bias: when the data sample doesn’t reflect the real population (example: a model trained only on young users might fail on older ones)
- standard deviation: a measure of how spread out values are around the mean; it’s the square root of the variance
- standard error: the standard deviation of the sample population; it tells how accurate your estimate is
- uniform distribution: a probability distribution where all outcomes in a given interval or finite set are equally likely
- variance: measures how spread out a set of numbers is; tells you how much the values differ from the average
standard error in this content there is error so you want to
ReplyDeleteok
Delete