Data Literacy Flashcards

Author: Ivan Kitov Cards: 177

Our data literacy flashcards are essential for those keen on improving their data comprehension skills. An education in data literacy helps learners transition from a basic understanding of data to the practical application of this knowledge in numerous data-driven environments. But what is data literacy? A simple definition of data literacy is the ability to read, understand, create, and communicate data as information. This comprehensive exploration into data literacy provides insights into the skills needed to interpret and argue effectively with data. It also teaches you how to detect and combat confirmation bias, ensuring your data-driven decision-making processes are objective and based on clear evidence. The data literacy flashcards guide users through various types of data with data examples, including discrete vs continuous and structured vs unstructured data, providing you with the necessary terminology and concepts to grasp data categorization. We also explore the importance of metadata and the differences between data at rest and data in motion, offering a comprehensive understanding of data classification. In database management, the data literacy flashcards illuminate systems and frameworks like ERP, SCM, and CRM. They also cover data security management and various data management types. The deck reviews database management concepts, distinguishing between Data Lakes and Data Warehouses, and examines storage options like Cloud Systems. The deck also tackles data processing, contrasting batch and stream processing, and discussing the fog/edge computing concept. Applications like graph databases, recommendation engines, and fraud detection are also addressed. A comprehensive exploration of data analysis and analytics introduces techniques and tools for data acquisition, data filtering, data cleansing, and data visualization. The deck also explores statistics—offering clear explanations and examples for calculating measures of central tendency (mean, median, mode)—measures of spread (range, variance), and various correlation coefficients. Grasp the intricate nuances of hypothesis testing, A/B testing, and the characteristics of significance levels that form the foundation of scientific data analysis today. For those with advanced statistical knowledge, the deck explores measures like the Pearson and Spearman correlation coefficients and the concept of goodness of fit—including the interpretation of R-squared values. The deck introduces machine learning concepts for more in-depth analysis, such as supervised vs unsupervised learning, reinforcement learning, regression, classification, and clustering methods. It also covers Deep Learning, highlighting Artificial Neural Networks (ANN) and aspects of Natural Language Processing (NLP). The data literacy flashcards emphasize the importance of data quality—illustrating data quality dimensions, including the distinction between 'acceptable' and 'poor quality data.' They also present validation techniques like out-of-sample Validation and conclude with such performance metrics as accuracy, recall, and precision—crucial for evaluating data models. In summary, these data literacy flashcards are a valuable resource for individuals aiming to enhance their data literacy skills. They encompass fundamental principles of advanced data analysis and machine learning techniques. It's an excellent tool for students, professionals, or anyone interested in leveraging data to improve decision-making processes. Start improving your data literacy skills today! Study with our comprehensive data literacy flashcards and learn how to analyze and interpret data effectively, helping you make smarter decisions faster.

Go Back Review deck

1 of 177

Data Literacy

The ability to read, understand, create, and communicate data as information.

2 of 177

Data Literate Person

A data-literate person can articulate a problem that can potentially be solved using data.

3 of 177

Confirmation bias

The tendency to search for, interpret, favor

and recall information that supports your views while dismissing non-supportive data.

4 of 177

Data-driven Decision Making

The process of making organizational decisions based on actual data rather than intuition or observation alone.

This approach aims to make the decision-making process more objective and fact-based.

5 of 177

Automated Manufacturing

A method of manufacturing that uses automation to control production processes and equipment with minimal human intervention.

It enhances efficiency, consistency, and quality in the production process.

6 of 177

Data

Factual information (such as measurements or statistics) used as a bias for reasoning, discussion, or calculation.

7 of 177

Datum

A single value of a single variable.

8 of 177

Quantitative Data

Information collection in a numerical form.

9 of 177

Discrete Data

A type of quantitative data that can only take certain values (counts).

10 of 177

Continuous Data

A type of quantitative data that can take any value.

11 of 177

Interval Data

A type of quantitative data measured along a scale with no absolute zero.

12 of 177

Ratio Data

A type of quantitative data measured along a scale with an equal ration and an absolute zero.

13 of 177

Qualitative Data

Information collection in a non-numerical form (descriptive).

14 of 177

Nominal Data

A type of qualitative data that don't have a natural order or ranking.

15 of 177

Ordinal Data

A type of qualitative data that have ordered categories.

16 of 177

Dichotomous Data

A type of qualitative data with binary categories. Dichotomous Data can only take two values.

17 of 177

Structured Data

Data that can conform to a predefined data model.

18 of 177

Semi-structured Data

Data that do not conform to a tabular data model, yet they have some structure.

19 of 177

Unstructured Data

Data that are not organized in a predefined manner.

20 of 177

Metadata

Data about the data.

21 of 177

Data at Rest

Data that are stored physically on a computer data storage like

cloud servers, laptops, hard drives, USB sticks, etc.

22 of 177

Data in Motion

Data that are flowing through a network of two or more systems.

23 of 177

Transactional Data

Data recorded from transactions.

24 of 177

Master Data

Data that describe core entities around which a business is conducted.

Although master data pertain to transactions, they are not transactional in nature.

25 of 177

Big Data

Data that are too large or too complex to be handled by data-processing techniques and software.

Usually described by their volume, variety, velocity, and veracity.

26 of 177

Volume

A characteristic of big data referring to the vast amount of data generated every second.

27 of 177

RAM (Random Access Memory)

A type of computer memory that can be accessed randomly. Determines how much memory the operating system and open applications can use.

RAM is volatile and temporary.

28 of 177

Variety

The multiplicity of types, formats, and sources of data available.

29 of 177

ERP (Enterprise Resource Planning)

A type of software that organizations use to manage and integrate the important parts of their businesses.

An ERP software system can integrate planning, purchasing inventory, sales, marketing, finance, human resources, and more.

30 of 177

SCM (Supply Chain Management)

The management of the flow of goods and services, involving the movement and storage of raw materials, work-in-process inventory,

and of finished goods from the point of origin to the point of consumption.

31 of 177

CRM(Customer Relationship Management)

A technology for managing all your company's relationships and interactions with customers and potential customers.

It helps improve business relationships to grow your business.

32 of 177

Velocity

The frequency of incoming data.

33 of 177

Veracity

The accuracy or truthfulness of a data set.

34 of 177

Data Security Management

The process of implementing controls and safeguards to protect data from unauthorized access, use, disclosure, disruption, modification,

or destruction to ensure confidentiality, integrity, and availability.

35 of 177

Master Data Management

A method of managing the organization's critical data. It provides a unified master data service that provides accurate, consistent,

and complete master data across the enterprise and to business partners.

36 of 177

Data Quality Management

The process of ensuring the accuracy, completeness, reliability, and relevance of data. It involves the establishment of processes, roles, policies,

and metrics to ensure the high quality of data.

37 of 177

Metadata Management

The administration of data that describes other data.

It provides information about a data item's content, context, and structure, essential for resource management and discovery.

38 of 177

Database

A systematic collection of data.

39 of 177

Data Model

Shows the interrelationships and data flow between different tables.

40 of 177

Relational Databases

Databases that use a relational model. They are very efficient and flexible when accessing structured information.

41 of 177

SQL (Structured Query Language)

A standard programming language used in managing and manipulating relational databases.

It allows users to query, update, insert, and modify data, as well as manage database structures.

42 of 177

Data Warehouse

A single centralized repository that integrates data from various sources and is designed to facilitate analytical reporting for workers throughout the enterprise.

43 of 177

Data Mart

A subset of a data warehouse, focusing on a single functional area, line of business,

or department of an organization.

44 of 177

ETL

Extract. Transform. Load.

45 of 177

Hadoop

A set of software utilities designed to enable the use of a network of computers to solve Big data problems.

46 of 177

Hadoop Distributed File System (HDFS)

A distributed file system designed to run on commodity hardware. It provides high throughput access to application data and is suitable for applications with large data sets.

47 of 177

Large Hadron Collider (LHC)

The world's largest and most powerful particle collider, located at CERN. It's used by physicists to study the smallest known particles and to probe the fundamental forces of nature.

48 of 177

Data Lake

A single repository of data stored in their native format.

They constitute a comparatively cost-effective solution for storing large amount of data.

49 of 177

Cloud Systems

Cloud systems store data and applications on remote servers, enabling users to access them from any internet-connected device without installation.

50 of 177

Cyber Defense Center (CDC)

A centralized unit that deals with advanced security threats and manages the organization's cyber defense technology, processes, and people.

51 of 177

Batch

A block of data points collected within a given period (like a day, a week, or a month).

52 of 177

Batch Processing

Data are processed in large volumes all at once.

53 of 177

Stream Processing

Stream processing can also support non-stop data flows, enabling instantaneous reactions.

54 of 177

Fog/ Edge Computing

The computing structure is not located in the cloud, but rather at the devices that produce and act on data.

55 of 177

Graph Database

Stores connections alongside the data in the model.

56 of 177

Graphs

The network of connections that are stored, processed, modified, and queried.

Notably used for social networks, as they allow for much faster and high-performing queries than traditional databases.

57 of 177

Edges

Describe the relationships between entities.

58 of 177

Recommendation Engines

Systems that predict and present items or content that a user may prefer based on their past behaviour, preferences, or interaction with the system.

59 of 177

Fraud Detection

The process of identifying and preventing fraudulent activities.

It involves the use of analytics, machine learning, and other techniques to detect and prevent unauthorized financial activities or transactions.

60 of 177

Analysis

A detailed examination of anything complex in order to understand its nature or

to determine its essential features.

61 of 177

Data Analysis

The in-depth study of all the components in a given data set.

62 of 177

Data Analytics

The examination, collection, organization, and storage of data.

Data Analytics implies a more systematic and scientific approach to working with data.

63 of 177

Data Acquisition

The process of collecting, measuring, and analyzing real-world physical or electrical data from various sources.

Often involves instrumentation and control systems, sensors, and data acquisition software.

64 of 177

Data Filtering

The technique of refining data sets by applying criteria or algorithms to remove unwanted or irrelevant data.

This process helps in enhancing the quality and accuracy of data for analysis.

65 of 177

Data Extraction

The process of retrieving data from various sources, which may include databases, websites, or other repositories.

The extracted data can then be processed, transformed, and stored for further use.

66 of 177

Data Aggregation

The process of gathering and summarizing information from different sources, typically for the purposes of analysis or reporting.

It often involves combining numerical data or categorical information to provide a more comprehensive view.

67 of 177

Data Validation

The practice of checking data against a set of rules or constraints to ensure it is accurate, complete, and reliable.

It is a critical step in data processing and analysis to ensure the integrity of the data.

68 of 177

Data Cleansing

The process of detecting, correcting, or removing corrupt or inaccurate records from a dataset, table, or database.

69 of 177

Data Visualization

The graphical representation of information and data.

By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

70 of 177

Statistics

The science of collecting, analyzing, presenting, and interpreting data.

It involves applying various mathematical and computational techniques to help make decisions in the presence of uncertainty.

71 of 177

Population

The entire pool from which a statistical sample is drawn.

72 of 177

Sample

A subset of a population that is statistically analyzed to estimate the characteristics of the entire population.

73 of 177

Sample size

The count of individual observations

74 of 177

Descriptive Statistics

Summarizing or describing a collection of data. Pic of questions

75 of 177

Measures of Central Tendency

Focus on the average (middle) values of the data set.

76 of 177

Mean

The average of a set of numbers. It is a non-robust measure of central tendency.

The mean is very susceptible to outliers.

77 of 177

What is the median of the data set: 1, 2, 2, 3, 4, 5, 6, 7, 9, 16?

5.5

78 of 177

Median

The middle value in a data set. The median is not affected by extreme values or outliers.

It is a robust measure of central tendency.

79 of 177

What is the median of the data set: 0, 1, 2, 2, 3, 4, 5, 6, 7, 9, 16?

80 of 177

What is the median of the data set: 1, 2, 2, 3, 4, 5, 6, 7, 9, 16?

4.5

81 of 177

Mode

The most commonly observed value in a data set. A data set can have one, several or no modes.

82 of 177

What is the mode of the data set: 0, 1, 2, 2, 4, 4, 5, 6, 7, 9, 16?

2 and 4

83 of 177

Measures of Spread

Describe the dispersion of data within the data set.

84 of 177

Minimum and maximum

The lowest and highest values of the data set.

85 of 177

What is the minimum and maximum of the data set: 0, 1, 2, 2, 3, 4, 5, 6, 7, 9, 16

0 and 16

86 of 177

Range

The difference between the maximum and minimum values in a data set.

87 of 177

What is the range of the data set: 0, 1, 2, 2, 3, 4, 5, 6, 7, 9, 16

88 of 177

Variance

How far each value lies from the mean.

89 of 177

What is the variance of the data set: 0, 1, 2, 2, 3, 4, 5, 6, 7, 9, 16

18.73

90 of 177

Standard Deviation

Measures the dispersion of a data set relative to its mean. It is expressed in the same unit as the data.

91 of 177

What is the standard deviation of the data set: 0, 1, 2, 2, 3, 4, 5, 6, 7, 9, 16

4.32

92 of 177

Correlation

Indicates the direction and the strength of the linear relationship between two variables.

Correlation analysis is only applicable to quantitative (numerical) values.

93 of 177

Correlation Coefficient

Correlation coefficient is a number between -1 and 1 that measures the strength of the relationship between two variables. -1 or 1 indicates a perfect negative or positive correlation, while 0 means no linear relationship between the variables.

94 of 177

Pearson Correlation Coefficient

A measure of the linear correlation between two variables X and Y, giving a value between +1 and −1 inclusive,

where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.

95 of 177

Spearman's Rank Correlation Coefficient

A nonparametric measure of rank correlation (statistical dependence between the rankings of two variables).

It assesses how well the relationship between two variables can be described using a monotonic function.

96 of 177

Exponential Dependency

A relationship between two variables where changes in one variable result in exponential changes in another variable.

This relationship is typically represented by an exponential function.

97 of 177

Logarithmic Dependency

A type of relationship between two variables where one variable changes logarithmically as the other variable changes linearly.

This relationship is often represented by a logarithmic function.

98 of 177

Direct Causality

A relationship between two variables where a change in one variable directly causes a change in the other.

This implies a cause-and-effect relationship where the independent variable influences the dependent variable.

99 of 177

Reverse Causality

When an effect affects its cause. It suggests that a future event can influence a past one. It is discussed in feedback loops and econometric analyses as it can complicate causal interpretations.

100 of 177

Bidirectional Causality

A relationship where two variables affect each other reciprocally.

Changes in one variable lead to changes in the other, and vice versa, indicating a feedback loop or interdependent causal relationship.

101 of 177

Pure Coincidence

An occurrence of two or more events at the same time by chance, without any causal connection.

It refers to situations where the association between the events is not based on cause and effect.

102 of 177

Goodness of Fit

A statistical measure that describes how well a model fits a set of observations.

It determines the extent to which the observed data correspond to the model's predictions.

103 of 177

R-squared (Coefficient of determination)

R-squared is a statistical measure that explains how much of the variation in a dependent variable is explained by the independent variable. It ranges from 0 to 1, where 0 means the model doesn't explain any of the variance and 1 means the model explains all the variance.

104 of 177

What is a good R-squared value 0.70, 0.50, 0.60, 0.10

0.7

105 of 177

Mean Absolute Error (MAE)

The absolute value of the difference between the predicted and actual events.

106 of 177

Mean Absolute Percentage Error (MAPE)

The sum of the individual absolute errors divided by the underlying value.

MAPE understates the influence of big and rare errors.

107 of 177

Root Mean Square Error (RMSE)

The square root of the average squared error. It gives more importance to the most significant errors.

108 of 177

Inferential Statistics

Helps us understand, realize, or know more about populations, phenomena,

and systems that we can't measure otherwise.

109 of 177

Probability

A branch of mathematics dealing with numerical descriptions of how likely an event is to occur.

110 of 177

Hypothesis

A statement that helps communicate an understanding of the question or issue at stake.

111 of 177

Statement

A proposed explanation of a problem derived from limited evidence.

112 of 177

Hypothesis Testing

Determine whether there is enough statistical evidence in favour of a certain idea, assumption (e.g. about a population parameter), or the hypothesis itself.

The process involves testing an assumption regarding a population by measuring and analyzing a random sample taken from that population.

113 of 177

Null Hypothesis (H0)

A statement or default position that there is no association between two measured phenomena or no association among groups.

In hypothesis testing, it is the hypothesis to be tested and possibly rejected in favor of an alternative hypothesis.

114 of 177

Alternative Hypothesis (H1)

What you believe to be true or hope to prove to be true.

115 of 177

P-value

A measure of the probability that an observed result occurred by chance. The smaller the P-value, the stronger the evidence that you should reject H0.

116 of 177

Significance Level

The strength of evidence that must be presented to reject the null hypothesis.

117 of 177

A/B Testing

A randomized experiment with two variants, A and B, which are the control and treatment in the experiment.

It's used to compare two versions of a product or service to determine which one performs better in terms of specific metrics.

118 of 177

Business Intelligence (BI)

Comprises the strategies and technologies used by enterprises for the data analysis of business information.

119 of 177

Enterprise Reporting

The regular creation and distribution of reports across the organization to depict business performance.

120 of 177

Dashboarding

The provision of displays that visually track key performance indicators (KPIs)

relevant to a particular objective or business process.

121 of 177

Online Analytical Processing (OLAP)

An approach that enables users to query from multiple database systems at the same time.

122 of 177

Fiancial Planning and Analysis (FP&A)

A set of activities that support an organization's health in terms of financial planning, forecasting, and budgeting.

123 of 177

Business Sphere

A visually immersive data environment that facilitates decision-making

by channelling real-time business information from around the world.

124 of 177

Machine Learning

Find hidden insights without being explicitly programmed where to look.

Make predictions based on data and findings.

125 of 177

Supervised Learning

A type of machine learning where the model is trained on a labeled dataset,

which means it learns from input data that is tagged with the correct output, facilitating the model to accurately predict the output when given new data.

126 of 177

Algorithm

A procedure for solving a mathematical problem for a finite number of steps

that frequently involves repetition of an operation. It can take several iterations for the algorithm to produce a good enough solution to the problem.

127 of 177

Training Data

Units of information that teach the machine trends and similarities derived from the data.

128 of 177

Training the Model

The process through which the algorithm goes through the training data again and again,

and helps the model make sense of the data.

129 of 177

Regression

A supervised machine learning method used for determining

the strength and character of a relationship between a dependent variable and one or more independent variables. Works with continuous data.

130 of 177

Dependent Variable

The one to be explained or predicted.

131 of 177

Independent Variable

The ones used to make a prediction. Also called "predictors".

132 of 177

Forecasting

A sub-discipline of prediction, and its predictions are made specifically about the future.

133 of 177

Polynomial Regression

Polynomial regression models the relationship between x and y as an nth-degree polynomial, fitting a nonlinear relationship between the value of x and the corresponding mean of y.

134 of 177

Multivariable Regression

A technique used to model the relationship between two or more independent variables and a single dependent variable.

It assesses how much variance in the dependent variable can be explained by the independent variables.

135 of 177

Time Series

A sequence of discrete time data, taken at particular time intervals.

136 of 177

Time Series Forecasting

Uses models to predict future values based on previously observed values.

137 of 177

Time Series Analysis

A supervised machine learning approach that predicts the future values of a series,

based on the history of that series. Input and output are arranged as a sequence of time data.

138 of 177

Classification

A supervised machine learning approach where observations are assigned to classes.

The output is categorical data where the categories are known in advance.

139 of 177

Marketing Mix Modeling

A technique used to analyze and optimize the different components of a marketing strategy

(often framed as the 4Ps: product, price, place, promotion) to understand their effect on sales or market share.

140 of 177

Unsupervised Learning

Unsupervised machine learning identifies patterns and structures without labeled responses or a guiding output.

141 of 177

Clustering

An unsupervised machine learning approach that involves splitting a data set into a number of categories,

which can be referred to as classes or labels. A clustering algorithm automatically identifies the categories..

142 of 177

Clustering Algorithm

Has the objective of grouping the data in a way that maximizes both the homogeneity within and

the heterogeneity between the clusters.

143 of 177

Association Rule Learning

An unsupervised machine learning technique used to discover interesting associations hidden in large data sets.

144 of 177

Market Basket Analysis

The exploration of customer buying patterns by finding associations between items

that shoppers put in their baskets.

145 of 177

Reinforcement Learning

A machine ("agent") learns through trial and error, using feedback from its own actions.

146 of 177

Deep Learning

Applies methods inspired by the function and structure of the brain.

147 of 177

Artificial Neural Networks (ANN)

Computing systems inspired by the biological neural networks that constitute animal brains.

These systems learn to perform tasks by considering examples, generally without being programmed with task-specific rules.

148 of 177

Natural Language Processing (NLP)

A field of artificial intelligence concerned with the interactions between computers and the human language.

149 of 177

Natural Language Understanding (NLU)

Deals with the reading comprehension of machines and sentiment analysis.

150 of 177

Natural Language Generation (NLG)

Deals with the creation of meaningful sentences in the form of natural language.

151 of 177

Data Quality

The degree to which data is accurate, complete, timely, consistent, relevant,

and reliable to meet the needs of its usage.

152 of 177

GIGO

Garbage in. Garbage out. Without meaningful data, you cannot get meaningful results.

153 of 177

Out-of-sample Validation

A method of evaluating the predictive performance of a model on a new,

unseen dataset to test the model's ability to generalize beyond the data it was trained on.

154 of 177

Confusion Matrix

Shows the actual and predicted classes of a classification problem.

155 of 177

Accuracy

The proportion of the total number of correct predictions.

156 of 177

Recall

The ability of a classification model to identify all relevant instances.

157 of 177

Precision

The ability of a classification model to return only relevant instances.

158 of 177

Data Literacy Skills

159 of 177

Benefits of Data Literacy

160 of 177

Data Literacy Journey

161 of 177

Volume Measures

162 of 177

Data warehouse Characteristics

163 of 177

Advantages of Data Marts

164 of 177

Data Lake vs. Data Warehouse

165 of 177

In Cloud Storage vs. On-Premise Storage

166 of 177

Benefits of Fog Computing

167 of 177

Data Analysis vs. Data Analytics

168 of 177

How to calculate the mode?

169 of 177

Types of Correlation Coefficient

170 of 177

Types of Causality

171 of 177

What does R-squared depend on?

172 of 177

Common Questions about Inferential Statistics

173 of 177

Steps for Hypothesis Testing

174 of 177

Significance Level Characteristics

175 of 177

Line Chart (Run Chart)

176 of 177

Reinforcement Learning vs. Supervised Learning

177 of 177

Acceptable vs. Poor Quality Data

Data Literacy Flashcards

Explore the Flashcards: