5. Shaping Data

 
Subtitles Enabled

Sign up for a free trial to access more free content.

Free trial

Overview

Data often needs to be in a particular shape in order to be analyzed. In this lesson, we introduce how to get a dataset into the right format for further analysis.

Lesson Notes

Lesson Goal

The goal of this lesson is to outline the steps required to get data into an appropriate structured format.

Giving Data Structure

Unstructured data needs to be given structure before analyzing it. The techniques used to do this are advanced and beyond the scope of this course. An example of their use would be studying the words of an article to analyze if it is positive or negative in tone.

You’re more likely to encounter a situation where the data you need to analyze is not in an appropriate layout. This can happen in Excel, where a single worksheet can mix data storage and data analysis.

An Example in Excel

In small datasets, data and analysis can end up on the same worksheet. Applications like Tableau and Power BI import entire sheets, so this layout won’t work in those applications.

Data collected in Excel is often in a pivoted layout. In this layout, information of interest is spread across multiple columns. For example, a dataset of survey responses, where the response to each question is in a separate column would be pivoted. A layout like this is often described as short and fat, short and wide or other similar terms. This is because the dataset has a large number of columns and a small number of rows.

A better layout is the unpivoted layout, where the information of interest is spread over multiple rows. For example, the survey dataset should have the response to each question in a separate row. The unpivoted layout can be described as tall and thin, tall and narrow or similar terms.

When you’re preparing a dataset for analysis, getting the layout right is one of the most important steps. The details of how to create the unpivoted layout vary between software programs, so we don’t cover them in this course.

Transcript

The final step before starting an analytics project is to shape the data into the right format for analysis. In this lesson we'll outline the steps required to get data into an appropriate structured format. As we discussed in a previous lesson applications like Alteryx, Tableau, or Power BI require your data to be in a structured, tabular format. However, your data might not originally be formatted this way. We'll consider what steps can be taken to get your data into this structured format. First, let's quickly consider unstructured data. We mentioned previously that unstructured data often comes in the form of text data. However, it can also take the form of other media such as images, videos, etc.

For example, you might want to analyze news articles about a company. You could analyze the words used in the headline in the article to come up with a number that judges if the article is positive or negative.

These numbers can be stored in a structured format.

In reality the techniques you would use to obtain this structured data are quite advanced so we won't discuss their details here. Instead, we'll consider a different situation that you're more likely to encounter.

If you create or use data sets in Excel you'll often find that the layout is not suited for analysis in another program. In this context shaping the data involves transforming it into a structured tabular layout that will suit that program. We'll look at a few changes you might need to make to get your data into shape.

In Excel we often store and analyze data on the same sheet.

In this example we have a sales data set along with some calculations we've created using its data.

In this case the sales data set only had 23 rows so it probably made sense to do our analysis on the same sheet. However, if we try to import this data into an application like Power BI the summary statistics would also be imported as these applications absorb entire Excel sheets. As a result you should make sure that data and analysis are on separate sheets in your Excel files. A common issue with data sets created in Excel is that they can come in a pivoted layout. In this layout information of interest is spread across multiple columns. For example, in this data set recording survey responses each row represents a person. And their responses to each question is recorded across over 100 columns. This data is often described as short and fat because there are a lot of columns but a minimal number of rows. A better layout involves unpivoting these columns so that the responses are spread across rows instead of columns.

Here we see that each row represents a single person's response to a single question. As a result each person now has over 100 rows but each of those rows is much narrower. This data is considered tall and thin. Fortunately you can transform data sets from a pivoted to an unpivoted layout or vice versa using applications like Power Query, Tableau Prep, or Alteryx.

We'll learn about why this layout is useful in our later lesson on databases but for now you should be aware that a tall and thin layout is usually preferable. Finally, you should consider other cleaning tasks that may need to be performed on your data. There are various steps that you may need to go through to get your data in the right format. In some instances these could involve altering the structure of the data set. For example, if you collected address data you may prefer to split this into street address, city, state, country, and so on.

This would help you analyze your data at different geographic levels.

The mechanics for doing this vary between software applications so we'll explain them in our other software specific courses.

There are other cleaning tasks which will not involve altering the structure of the data set, but we won't consider them in this course.

We've now learned how to shape our data before analyzing it. Data you obtain from external sources will often need some level of transformation to get it into the right format. Once this transformation is complete your data should look a bit like a database which we'll discuss in the next lesson.