Machine Learning

## Explore the Flashcards:

An area of artificial intelligence that focuses on the development of algorithms that can learn patterns from data without being explicitly programmed.

Supervised Learning

Involves training a model using a dataset where the input comes paired with the correct output.

Unsupervised Learning

Involves training a model without explicit instructions, using data that isn't labeled.

Reinforcement Learning

A type of machine learning where an agent learns to behave in an environment by performing actions and receiving rewards for them.

Population

The entire set of items or individuals of interest in a study. Denoted by N.

Linear Correlation Coefficient

A measure of of the strength and direction of a linear relationship relationship between two variables. Very useful for direct interpretation as it takes on values from [-1,1]. Denoted ρxy for a population and rxy for a sample.

Correlation

A statistical measure that describes the extent to which two variables change together. There are several ways to compute it, the most common being the linear correlation coefficient.

Critical Value

A threshold value from a statistical table (z, t, F, etc.) associated with a chosen significance level.

Degrees of Freedom

The number of values in a statistical calculation that are free to vary without violating the data's constraints.

Hypothesis

A testable proposition or assumption about a population parameter.

Null Hypothesis

A default hypothesis for testing. Whenever we are conducting a test, we are trying to reject the null hypothesis.

Alternative Hypothesis

The hypothesis that contradicts the null hypothesis. It represents the researcher's claim.

Significance Level

The probability of rejecting the null hypothesis when it's true. Denoted α. You choose the significance level. All else equal, the lower the level, the better the test.

Rejection Region

The part of the distribution, for which we would reject the null hypothesis.

P-Value

The smallest significance level at which the null hypothesis can be rejected based on the observed data.

Causation

Causation refers to a causal relationship between two variables. When one variable changes, the other changes accordingly. When we have causality, variable A affects variable B, but it is not required that B causes a change in A.

Regression Analysis

A method to model and analyze the relationships between variables. Usually, it is used for building predictive models.

Linear Regression Model

A model that describes a linear relationship between two or more variables.

Dependent Variable ( ŷ )

The outcome variable being predicted or explained. It also 'depends' on the other variables. Usually, denoted y.

Independent Variable ( xi )

The variable (s) used to predict or explain variations in the dependent variable. It is the observed data (your sample data). Usually, denoted x1, x2 to xk.

Coefficient ( βi )

A factor that quantifies the relationship between an independent variable and the dependent variable.

Constant ( βo )

A constant value, which does not affect any independent variable, but affects the dependent one in a constant manner.

Epsilon ( ε )

The error of prediction. Difference between the observed value and the (unobservable) true value.

Regression Equation

An equation representing the relationship between variables, with coefficients estimated from data. Think of it as an estimator of the linear regression model.

b0, b1,…, bk

Estimates of the coefficients βo, β1, … βk.

Regression Line

The best-fitting line through the data points.

Residual ( e )

Difference between the observed value and the estimated value by the regression line. Point estimate of the error ( ε ).

b0

The intercept of the regression line with the y-axis for a simple linear regression.

b1

The slope of the regression line for a simple linear regression.

ANOVA

Abbreviation of 'analysis of variance'. A statistical framework for analyzing variance of means.

SST

Sum of squares total. SST is the squared differences between the observed dependent variable and its mean.

SSR

Sum of squares regression. SSR is the sum of the differences between the predicted value and the mean of the dependent variable. This is the variability explained by the regression model.

SSE

Sum of squares error. SSE is the sum of the differences between the observed value and the predicted value. This is the variability that is NOT explained by the model.

R-Squared ( R2 )

A measure ranging from 0 to 1 that shows how much of the total variability of the dataset is explained by the regression model.

OLS

An abbreviation of 'ordinary least squares'. A method to estimate the coefficients of a regression model by minimizing the sum of squared residuals.

Regression Tables

Tables summarizing the results of a regression analysis.

Multivariate Linear Regression

Also known as multiple linear regression. A regression model with multiple independent variables.

Adjusted R-Squared

A version of R-squared adjusted for the number of predictors in the model. It penalizes the excessive use of independent variables.

F-Statistic

A statistic used to test the overall significance of a model. The F-statistic is connected with the F-distribution in the same way the z-statistic is related to the Normal distribution.

F-Test

A test for the overall significance of the model.

Assumptions

Preconditions required for the validity of statistical techniques, like linear regression.

Linearity

The assumption that the relationship between variables is linear.

Homoscedasticity

The assumption that the variance of residuals is constant across all levels of the independent variables.

Endogeneity

In statistics refers to a situation, where an independent variable is correlated with the error term.

Autocorrelation

The correlation of a variable with itself over successive time intervals.

Multicollinearity

A situation where two or more independent variables are highly correlated, making it difficult to isolate the effect of individual predictors.

Omitted Variable Bias

Bias introduced when a relevant variable is left out of a regression model.

Heteroscedasticity

The presence of non-constant variance in the residuals of a regression model.

Log Transformation

Applying the logarithm function to a variable to linearize relationships or stabilize variances.

Semi-Log Model

A regression model where either the dependent or independent variable is logarithmically transformed.

Log-Log Model

A regression model where both the dependent and independent variables are logarithmically transformed.

Serial Correlation

Another term for autocorrelation.

Cross-Sectional Data

Data collected at a single point in time.

Time Series Data

Data collected at regular intervals over time (e.g. stock prices).

Day of the Week Effect

A well-known phenomenon in finance. Consists in disproportionately high returns on Fridays and low returns on Mondays.

Durbin-Watson Test

A way for detecting autocorrelation (a violation of the fourth OLS assumption).

Total Variability = ? + ?

Total Variability = Explained variability + Unexplained variability.

Clustering

A technique used to group similar data points together based on certain features, without having predefined categories.

Classification

An algorithmic approach to determining which category an input belongs to out of a set of categories.

Decision Tree

A decision support tool that uses a tree-like model of decisions and their potential consequences.

Categorical Data

Data that represents categories or labels without inherent numerical value.

Dummy Variables

Also known as indicator variables. They're used to represent categorical data as a series of binary values to include in statistical models.

Overfitting Model

When a model captures noise in the data and is too complex. It will perform exceptionally well on training data but poorly on unseen data.

Underfitting Model

When a model is too simple to capture the underlying trends in the data, resulting in poor performance on both the training and testing sets.

Training Dataset

The set of data used to train a machine learning model.

Testing Dataset

After training, this dataset is used to evaluate how well a model performs on data it hasn't seen before.

Logistic Regression Model

A statistical method for predicting binary outcomes. It's used when the dependent variable is categorical and binary.

Logit Regression Model

Another term for logistic regression. It models the log odds of the probability of the event occurring.

MLE Method

A method to estimate the parameters of a model. It chooses the parameter values that maximize the likelihood of the observed data given the model..

Likelihood Function

A function which estimates how likely it is that the model at hand describes the real underlying relationship of the variables.

LL-Null

Log likelihood-null – the log-likelihood of a model which has no independent variables

LLR P-Value

Log likelihood ratio – measures if our model is statistically different from LL-null, a.k.a. a useless model.

Confusion Matrix

A table used in classification problems where the accuracy of a model's predictions is summarized. Typically a 2x2 matrix for binary classification problems.

True Positives (TP)

Correctly predicted positive observations.

False Positives (FP)

Instances falsely predicted as positive (Type I error).

True Negatives (TN)

Correctly predicted negative observations.

False Negatives (FN):

Instances falsely predicted as negative (Type II error).

Pseudo R-squared

Unlike the R-squared in linear regression, it's used in logistic regression to provide a measure of how well the model explains the variance in the dependent variable.

AIC

A measure used to compare models. A lower AIC indicates a better model, but it considers the complexity of the model.

BIC

Similar to AIC, but has a higher penalty for models with more parameters.

McFadden’s R-squared

A measure used in logistic regression to indicate the goodness of fit of the model compared to a model with no predictors.

Accuracy of Model

The ratio of correctly predicted observation to the total observations.

Cluster Analysis

A multivariate statistical technique that groups observations on the basis some of their features or variables that they are described by.

Classification

The process of predicting the class or category of given data points based on certain features.

Euclidean Distance

The 'ordinary' distance between two points in space, calculated using the Pythagorean theorem.

K-Means Clustering

An iterative algorithm that tries to partition the dataset into K pre-defined distinct non-overlapping subgroups or clusters.

WCSS (Within-Cluster Sum of Squares)

WCSS is a measure used in clustering algorithms that represents the total distance between each point and the centroid of the cluster it belongs to.

Elbow Method

A method in K-means clustering to identify the optimal number of clusters by locating the "elbow" point in a plot of WCSS against cluster count. This point reflects the most effective balance between precision and computation.

Standardized Variable

A variable which has been standardized using the z-score formula - by first subtracting the mean and then dividing by the standard deviation.

Flat Clustering

A method where the number of clusters is defined in advance, and the dataset is partitioned into the specified number of clusters.

Hierarchical Clustering

Involves creating clusters that have a predetermined ordering from top to bottom.

Divisive (Top-Down) Clustering

Begins with all data points in a single cluster and recursively divides into smaller clusters.

Agglomerative (Bottom-Up) Clustering

Starts with each data point as an individual cluster and merges them into larger clusters based on similarity.

Dendrogram

A tree-like diagram that records the sequences of merges or splits in hierarchical clustering.