1. Introduction to Statistical Models


A larger sample is more likely to be affected by outliers

To explore more Kubicle data literacy subjects, please refer to our full library.


  1. Overview of Statistical Models (00:10)

    In analytics, a model is a simplified mathematical approximation of some real-world process. We use models to identify relationships between fields or observations in a data set. When we create a good model, we should be able to use it to analyze other data sets, or make predictions for the future.

  2. Course Overview (01:01)

    Models are used in many areas of analytics. In this course, we’ll focus on three main areas, which are applied in other Kubicle courses:

    1. Regression and Clustering

    2. Classification Models

    3. Time Series Analysis


In this course, we're going to learn about the principles of statistical data models. In analytics, a model is a simplified mathematical approximation of some real-world process.

For example, it could take the form of an equation that describes the relationship we expect to see between a set of variables. Models like this are used when we want to identify the relationships between fields or observations in our data.

When we create a model that describes our data well, it should also be able to describe other data just as well. This other data could take the form of data we previously collected but did not use to create the model. Alternatively, it could take the form of future data that had not been collected when we created the model. For this reason, models are a vital tool in predictive analytics, where we aim to predict future data and events using previous data. Models are used across a huge variety of areas involving analytics. These range from longstanding areas, like statistical analysis, to newer and trendier areas, like machine learning. In this course, we'll focus on the principles of some the statistical models that are applied in various other Kubicle courses.

We'll break these principles down to three main areas.

First, we'll look at regression and clustering.

Regression is probably the most well-known technique for modeling the relationships between variables.

We'll look at two common types of regression, linear regression and logistic regression.

Clustering focuses on understanding the relationships between points in the data and grouping similar points together. Next, we'll look at several different classification models.

These models deal with the problem of deciding how to categorize a data observation. We'll study decision trees, possibly the most common classification model, and then look at several models that work by using multiple decision trees to classify new data. Third, we'll look at time series analysis. Analyzing data collected over time allows us to clearly view trends and developments in a business. We'll take a detailed look at the ARIMA model, one of the most common models for analyzing time series data.

Understanding and modeling time series data makes it easier to forecast future observations in the series. All of the areas we've just mentioned are large areas of study in themselves. We won't cover all the details of these areas in this course. Instead, we'll simply focus on introducing some of the models that you'll see being used in other Kubicle courses. Many of the models we'll see can also get quite mathematically complicated.

We'll aim to minimize this complexity and instead focus our attention on the intuition behind how these models work.

In the next lesson, we'll start learning by looking at the concept of regression analysis.

Statistical Theory
Introduction to Predictive Modeling