Population

## Explore the Flashcards:

The entire set of items or individuals of interest in a study. Denoted By N.

Sample

A subset selected from the larger population; Denoted by n.

Parameter

A numerical value that describes a characteristic of the entire population. It is the opposite of statistic.

Statistic

A numerical value that describes a characteristic of a sample and used to estimate a population parameter. It is the opposite of a parameter.

Random Sample

A sample in which every member of the population has an equal chance of being selected.

Representative Sample

A sample that accurately mirrors the characteristics of the larger population.

Variable

A characteristic or attribute that can take on different values or categories. E.g. height, occupation, age etc.

Type of Data

The classification of data based on its nature.There are two types of data - categorical and numerical.

Categorical Data

Data that represents categories or labels without inherent numerical value.

Numerical Data

Data that represents quantifiable amounts or values. Can be further classified into discrete and continuous.

Discrete Data

Numerical data that can only take on specific, distinct values. Opposite of continuous.

Continuous Data

Numerical data that is 'infinite' and impossible to count. Opposite of discrete.

Levels of Measurement

A way to classify data. There are two levels of measurement - qualitative and quantitative.

Qualitative Data

A subgroup of levels of measurement. There are two types of qualitative data - nominal and ordinal.

Quantitative Data

A subgroup of levels of measurement. There are two types of quantitative data - ratio and interval.

Nominal Level of Measurement

Nominal level of measurement refers to variables that describe different categories or names. These categories cannot be put in any specific order.

Ordinal Level of Measurement

Ordinal level of measurement refers to variables that describe different categories, and they can be ordered.

Ratio Level of Measurement

Ratio level of measurement represents a number that has a unique and unambiguous zero point, no matter if a whole number or a fraction. For example, the temperature in Kelvin is a ratio variable.

Interval Level of Measurement

An interval variable represents a number or an interval. There isn't a unique and unambiguous zero point. For example, degrees in Celsius and Fahrenheit are interval variables.

Frequency Distribution Table

A table showing the frequency of each variable.

Frequency

The number of times a particular value or category occurs in a dataset.

Absolute Frequency

Measures the **number** of occurrences of a variable.

Relative Frequency

Measures the **relative number** of occurrences of a variable. Usually, expressed in percentages.

Cumulative Frequency

The sum of the relative frequencies of all members in a dataset up to a certain point. The cumulative frequency of all members is 100% or 1.

Pareto Diagram

A type of bar chart where frequencies are shown in descending order. There is an additional line on the chart, showing the cumulative frequency.

Histogram

A type of bar chart that represents numerical data. It is divided into intervals (or bins) that are not overlapping and span from the first observation to the last. The intervals (bins) are adjacent - where one stops, the other starts.

Cross Table (Contingency Table)

A table in a matrix format that displays the frequency distribution of the variables.

Bins (Histogram)

The intervals that are represented in a histogram.

Scatter Plot

A plot that represents numerical data. Graphically, each observation looks like a point on the scatter plot.

Measures of Central Tendency

Measures of central tendency are statistical values that represent the center or typical value of a dataset. The most common are the mean, median and mode.

Mean

The arithmetic average of all data points in a dataset.

Median

The middle number in a data set sorted in ascending or descending order.

Mode

The value that occurs most frequently in the dataset. A dataset can have one mode (unimodal), more than one mode (multimodal), or no mode at all.

Skewness

A measure which indicates whether the observations in a dataset are concentrated on one side.

Sample Formula

A formula that is calculated on a sample. The value obtained is a statistic.

Population Formula

A formula that is calculated on a population. The value obtained is a parameter.

Measures of Variability

Measures that describe the data through the level of dispersion (variability). The most common ones are variance and standard deviation.

Variance

Measures the dispersion of the dataset around its mean. It is measured in units squared. Denoted \(σ^2\) for a population and \(s^2\) for a sample.

Standard Deviation

Measures the dispersion of the dataset around its mean. It is measured in original units. Denoted σ for a population and s for a sample.

Coefficient of Variation

Measures the dispersion of the dataset around its mean. The coefficient of variation is unitless. Therefore, it is useful when comparing the dispersion across different datasets that have different units of measurement.

Univariate Measure

Univariate measure refers to the summary of a dataset that includes multiple categories of variables.

Multivariate Measure

A measure which refers to multiple variables.

Covariance

A statistical measure that quantifies the degree to which two random variables in a dataset change together. Usually, because of its scale of measurement, covariance is not directly interpretable.

Linear Correlation Coefficient

A measure of of the strength and direction of a linear relationship relationship between two variables. Very useful for direct interpretation as it takes on values from [-1,1]. Denoted \(\rho_{xy}\) for a population and \(r_{xy}\) for a sample.

Correlation

A statistical measure that describes the extent to which two variables change together. There are several ways to compute it, the most common being the linear correlation coefficient.

Distribution

A function that shows the possible values for a variable and the probability of their occurrence.

Bell Curve

A common name for the normal distribution.

Normal Distribution

A continuous, symmetric probability distribution that is completely described by its mean and its variance. Also known as the Gaussian distribution or bell curve.

Gaussian Distribution

The original name of the normal distribution. Named after the famous mathematician Gauss, who was the first to explore it through his work on the Gaussian function.

Standard Normal Distribution

A normal distribution with a mean of 0, and a standard deviation of 1

z-statistic

The cumulative frequency of a data value in a frequency distribution.

Standardized Variable

A variable which has been standardized using the z-score formula - by first subtracting the mean and then dividing by the standard deviation.

What does the Central Limit Theorem state?

The sampling distribution will approximate a normal distribution as the sample size increases. In general, a sample of at least 30 is often considered sufficient for the theorem to hold.

Sampling Distribution

The probability distribution of a given statistic (like the mean or variance) based on all possible samples of a fixed size from a population.

Standard Error

The standard deviation of the sampling distribution, which reflects the variability of sample means. It accounts for the sample size, with larger samples generally having smaller standard errors.

Estimator

Estimations we make according to a function or rule.

Estimate

The particular value that was estimated through an estimator.

Bias

The difference between an estimator's expected value and the true population parameter. An unbiased estimator has an expected value equal to the population parameter.

Efficiency (in Estimators)

Refers to an estimator's variability. An efficient estimator has minimal variability compared to others.

Point Estimator

A function or a rule, according to which we make estimations that will result in a single number.

Point Estimate

The specific numerical value obtained from a point estimator.

Interval Estimator

A function or a rule, according to which we make estimations that will result in an interval. In this course, we will only consider confidence intervals. Another instance that we don't discuss are also credible intervals (Bayesian statistics).

Interval Estimate

The categorization of data into discrete groups based on their attributes.

Confidence Interval

A confidence interval is the range within which you expect the population parameter to be. You have a certain probability of it being correct, equal to the significance level.

Reliability Factor

A singular metric that captures the entire variance of a dataset.

Level of Confidence

The probability that the population parameter lies within a given confidence interval. Denoted 1 - α. Example: 95% confidence level means that in 95% of the cases, the population parameter will fall into the specified interval.

Critical Value

A threshold value from a statistical table (z, t, F, etc.) associated with a chosen significance level.

z-table

A table showing values of the Z-statistic for various probabilities under the standard normal distribution.

t-statistic

A statistic that is generally associated with the Student's T distribution, in the same way the z-statistic is associated with the normal distribution.

t-table

A table showing t-statistic values for given probabilities and degrees of freedom.

Degrees of Freedom

The number of values in a statistical calculation that are free to vary without violating the data's constraints.

Margin of Error

The range within which the true population parameter is likely to lie, given a specific confidence level. It quantifies the uncertainty associated with a sample estimate, often expressed as a percentage of the estimate itself.

Hypothesis

A testable proposition or assumption about a population parameter.

Hypothesis Test

A test that is conducted in order to verify if a hypothesis is true or false.

Null Hypothesis

A default hypothesis for testing. Whenever we are conducting a test, we are trying to reject the null hypothesis.

Alternative Hypothesis

The hypothesis that contradicts the null hypothesis. It represents the researcher's claim.

To Accept a Hypothesis

The statistical evidence shows that the hypothesis is likely to be true.

To Reject a Hypothesis

The statistical evidence shows that the hypothesis is likely to be false.

One-Tailed (One-Sided) Test

A test that examines if a parameter is greater than or less than a specified value. In a one-tailed test, the alternative hypothesis focuses on a specific difference (higher than, lower than, or equal to).

Two-Tailed (Two-Sided) Test

A test that examines if a value is different (or equal) from a specified value. A two-tailed test considers the possibility of a difference in either direction from the null hypothesis.

Significance Level

The probability of rejecting the null hypothesis when it's true. Denoted α. You choose the significance level. All else equal, the lower the level, the better the test.

Rejection Region

The part of the distribution, for which we would reject the null hypothesis.

Type I Error (False Positive)

Rejecting a null hypothesis that is true. The probability of committing it is α, the significance level.

Type II Error (False Negative)

Accepting a null hypothesis that is false. The probability of committing it is β.

Power of the Test

The probability of correctly rejecting a false null hypothesis. (the researcher's goal). Denoted by 1- β.

z-score

A value indicating how many standard deviations an element is from the mean.

The most frequent value occurring in a population dataset.

p-value

The smallest significance level at which the null hypothesis can be rejected based on the observed data.