Sign in or start a free trial to avail of this feature.
4. Data Sources
In many cases, the data you need for an analytics project will already exist somewhere. This lesson discusses where you might find it, and how you might get your hands on it.
The goal of this lesson is to learn about different sources of existing data.
Internal Data Sources
In some situations, all the data you need may be available within your company, in databases or files. If your company has good policies surrounding file organization and retention, it can be easy to use internal data.
Internal data doesn’t always come in files and databases though. Individual knowledge and experience can be useful sources of data that are often overlooked. However, it’s rare for this knowledge to be stored somewhere in a formal manner. Obviously, you should be careful about overly relying on experience, but it can combine with numeric data quite effectively.
External Data Sources
In recent years, many organizations have made their data available to the public, in keeping with the principles of open data. As an example, the US census bureau provides access to a wide range of statistics concerning the US population. Other bodies, including public and private organizations, provide similar access to their data.
Many companies, particularly technology-focused companies, allow you to connect to their data through Application Programming Interfaces, or APIs. These are a method for connecting to a web-based application. Many companies, such as Google, use these to allow people and companies access to their various services. As an example, a Google Analytics API could provide useful information about traffic to your website.
In the previous lesson, we saw how to create your own data for use in an analytics project.
More often than not, you won't create new data for your projects.
Instead, you'll use information from an existing data source. In this lesson, we'll learn about different sources of existing data and discuss their advantages and disadvantages.
We'll first look at internal data sources like individual knowledge.
We'll then look at external sources coming from organizations that subscribe to principles of open data.
Finally, we'll see how you can connect to online data sources using APIs.
The easiest way to gather data is to use existing data.
This could be internal company data, such as information stored in a database, or external data, for example from a website.
In some cases, the data you need may be available within your company.
It can be located within databases or in other files.
We'll discuss databases in more detail later on in this course so we won't talk about the technical details of connecting to them right now.
If your company has good policies surrounding file organization and retention, then using internal data can be straightforward.
However, in other situations, finding and locating the relevant data within your company can be problematic.
One internal source of data that's often overlooked is individual knowledge.
While it's common for all of us to use personal experience when making decisions, it's rare that this knowledge is aggregated, recorded, and archived in a form or manner.
Basically, you shouldn't assume that the only information useful to you is found on computers.
Depending on the project you're working on, you may be able to gather useful information from talking to other people.
Of course, you should be careful about over-reliance on individual knowledge and should generally combine it with numeric data as well.
You can also obtain data from external sources.
In recent years, this has become a lot easier as various organizations subscribe to the principles of open data and share their data for public use.
For example, The U.S. Census Bureau collects a vast array of data including population statistics, economic indicators, employment and education statistics, and income data.
This information can generally be downloaded in standard formats that are easy for you to work with like CSV files.
Similar data is available from a range of other sources including statistical bodies in other countries as well as global organizations like the International Monetary Fund, or the World Health Organization.
Some private organizations also provide access to some of their data as well. This includes many of the social media companies like Facebook and Twitter, who, as we know, collect vast amounts of data from their users.
Note that data from private companies can come with some restrictions so it's important that you read all the rules and regulations before starting your analysis.
Finally, we'll look at a method of obtaining data from online sources using application programming interfaces, commonly known as APIs. APIs are tools that allow you to connect to a web-based application.
Software developers use APIs to connect to software from other companies. For example, let's say you run a website advertising a network of stores.
You want to add a map that lets people locate the stores.
The easiest way to do this is using the API of a map service like Google Maps.
Google provides an API that lets you embed a Google map on your website.
This is beneficial for you since you don't need to develop your own map, and users don't need to leave the site to view a map.
It's also good for Google as they can charge for use of the API, and they can control how their maps are presented on external websites.
If they update the appearance of their maps in the API, that update is automatically rolled out to websites using the API.
APIs can also be used for analytics.
As we mentioned previously, many of the major social networks provide access to their data.
They do this using APIs.
For example, Google Analytics provides several APIs including a real-time reporting API. This lets you analyze traffic on your website. You can use this API to identify the most popular pages, or create real-time dashboards.
As you can imagine, these APIs could be a valuable source of data.
APIs can get complex and most people won't need to know the details of how they work. However, it's good to know that they exist and provide a way of accessing data. As we've seen, there're various sources for existing data. Obviously, different sources will be useful for different projects and types of data. However, when you get that out from an external source it might not be in the ideal format for your analysis. In the next lesson, we'll look at how you can shape data into a neat, structured format.