Machine Learning Foundations: Statistics



 Below are important statistics:


  • bias: the difference between an estimator’s expected value and the true value of the parameter it is trying to estimate (example: a model that always predicts the average price for every house, regardless of size or location, has high bias because it’s too simple to capture important patterns)

  • correlation: the statistical measure that describes the linear relationship between two variables (example: time spent on a website might be positively correlated with purchases)

  • covariance: the degree to which two random variables change together; a positive covariance indicates that the variables tend to increase or decrease together, while a negative covariance indicates an inverse relationship (example: if the number of products viewed and the total purchase amount both increase in user data, the covariance between these variables will be positive)

  • dependent variable: a variable your model is trying to predict and it is affected by other variables (example: in predicting house prices, the price is the dependent variable)

  • hypothesis testing: a method to decide if a result is meaningful or just random chance (example: A/B testing in model deployment helps decide if version B leads to significantly more clicks than version A)

  • independent variable: a variable that helps predict the dependent variable (example: square footage is an independent variable when predicting house prices)

  • interquartile range (IQR): the difference between the upper and lower quartile values in the set of data; covers the middle 50% of your data, from 25th to 75th percentile (example: used to detect outliers in income prediction datasets)

  • least squares method: a standard approach for estimating the parameters of a model by minimizing the sum of the squared differences between the observed values and the model's predicted values (example: in predicting product prices based on features like size and weight, least squares adjusts the model so that the predicted prices are as close as possible to real prices)

  • linear regression model: a statistical model that expresses the relationship between a dependent variable and one or more independent variables as a linear equation (example: in modeling user engagement on a website, linear regression can predict session duration based on features like time of day and number of clicks)

  • mean: the total sum of values in a sample divided by the number of values in your sample.
  • median: the middle value when the numbers are sorted
  • mode: the most frequently occurring value

  • multinoulli distribution: a probability distribution over discrete outcomes; only one outcome happens at a time (example: a softmax output in a classifier that predicts one label) (example: dog, cat, or bird)

  • outlier: a data point that is very different from others (example: a customer who spends 100× more than others might be noise or might be a VIP)

  • percentile: the value below which a certain percentage of data falls (example: the 90th percentile prediction error means 90% of errors are smaller than that value)

  • percentage change: how much a value increases or decreases, expressed as a percent (example: “model accuracy improved by 10%” means the new accuracy is 10% higher than before)

  • quantiles: cut points dividing the data into equal-sized chunks

  • range: the difference between the largest and smallest value

  • sample: a smaller subset of data used to make general conclusions (example: a training set is a sample of the whole dataset used to learn patterns)

  • sampling bias: when the data sample doesn’t reflect the real population (example: a model trained only on young users might fail on older ones)

  • standard deviation: a measure of how spread out values are around the mean; it’s the square root of the variance

  • standard error: the standard deviation of the sample population; it tells how accurate your estimate is

  • uniform distribution: a probability distribution where all outcomes in a given interval or finite set are equally likely

  • variance: measures how spread out a set of numbers is; tells you how much the values differ from the average


Comments

Post a Comment

Popular Posts