Missing Values, Special Values & Outliers in R Programming

Instructor: Martin Gibbs

Martin has 16 years experience in Human Resources Information Systems and has a PhD in Information Technology Management. He is an adjunct professor of computer science and computer programming.

In this lesson, you will learn the consequences of missing data in R, and how to locate, transform, or ignore rows with missing values. You will also learn how special values and outliers impact data analysis.

Missing in Action

A colleague has sent you an e-mail, stating the following:

"Based on the data, it has rained every day!"

Not quite believing this, you check the data. The following table shows the results for the past 10 days of measurements:

Day 1 Precip (cm)
1 1.98
4 2.05
8 1.06
9 .005

After considering his statement, and the data (or lack thereof), you conclude that those blank measures weren't recorded or analyzed. Could this be a problem, especially with more sensitive or highly visible information? Definitely!

Depending on how missing data is handled, a completely different story is told. In fact, you may even come to a completely inaccurate correlation or inference from the data. If another researcher uses that same data but accounts for missing data, the results will conflict and may no longer be trustworthy.

Special Values and Outliers

Just as concerning are the special values (e.g., @, &, or -0), and outliers (a temperature of 134 in Anchorage, AK for example). For the most part, you will want to remove the outliers, since they skew data so much.

The topic of outliers is covered in detail in other lessons in this course.

Another scenario is special characters, such as dollar signs in your data. You'll most likely want to remove dollar signs and commas so that you can do appropriate analysis.

So consider the following vector in R:

input <- c("$3,425.34", "$256.65")

We can use the gsub function to strip out the $:

input_clean <- gsub("\\$", "", input)

Handling Missing Values in R

The path you choose will vary depending on the type of data, the story you are trying to tell, or the answer you're trying to find. Therefore, we need to have several options available for handling data. In the following sections, we'll look at ways to handle missing data in R.

First, let's look at a data set that is missing values.

Sample Data

In this example, we will be working with public-domain population data, which is available from the World Bank. You can download the full data set here Extract the zip folder and save to a location you will remember, e.g., C:\R.

The data set we will analyze is the PovStatsData.csv file. Before loading it into R, open the file in Excel or another tool and get a sense of the data. Notice that there are many missing values for many of the year columns.

In order to load this into a data frame, we will use the read.csv function in R. Save this to a new data frame:

my_data <- read.csv("C:/R/PovStatsData.csv", header=TRUE)

If you look at the data in R, you'll notice all the missing values became NA. It becomes a bit tricky to do any analysis on these columns if they are NA instead of 0. Again, what is the story you want to tell? Should you ignore all empty values, or assume that they are 0, so they can be accounted for in statistical analyses?

First, let's look at a way to recode these values to 0.

Recode the Values

One way to update the data set is to convert the NA values to 0. This is done via the is function and using the na option. If a value of NA is encountered in the data, set it to 0:

my_data[] <- 0

Now when you display the data, all NA values are 0. This means we can't say it has rained every day!

Another option in R is to exclude all missing values entirely.

Exclude the Values

If we exclude all rows with an NA in the POV data sample, we won't get any records. This is because every row contains at least one column with an NA. In order to test the exclusion, we'll enter a very small data set manually.

Enter the following R code to enter a small sample set.

my_df <- data.frame(Column1 = c(1:3, NA),
 Column2 = c('Text', NA, 'is', 'here'),
 Column3 = c(TRUE, FALSE, TRUE, TRUE),
 Column4 = c(0.5, 2.2, 32.2, NA),
 stringsAsFactors = FALSE)

Next, print the result:


The results should be as follows:

Column1 Column2 Column3 Column4
1 1 Text TRUE 0.5
2 2 <NA> FALSE 2.2
3 3 is TRUE 32.2
4 NA here TRUE NA

In order to only pull complete rows, we use the complete.cases() function in R. If we enter that function in R, only the full rows will display. Note that this function alone will only display True/False for a row it is evaluating. Thus, the following code:


Results in this:


What this means is that row 1 is complete, row 2 is not, row 3 is complete, and row 4 is not.

In order to actually get the data that is complete, you need to use the complete.cases function as follows. It uses the complete cases as a subset to the my_df data frame:

my_df[complete.cases(my_df), ]

The result is then:

To unlock this lesson you must be a Member.
Create your account

Register to view this lesson

Are you a student or a teacher?

Unlock Your Education

See for yourself why 30 million people use

Become a member and start learning now.
Become a Member  Back
What teachers are saying about
Try it risk-free for 30 days

Earning College Credit

Did you know… We have over 200 college courses that prepare you to earn credit by exam that is accepted by over 1,500 colleges and universities. You can test out of the first two years of college and save thousands off your degree. Anyone can earn credit-by-exam regardless of age or education level.

To learn more, visit our Earning Credit Page

Transferring credit to the school of your choice

Not sure what college you want to attend yet? has thousands of articles about every imaginable degree, area of study and career path that can help you find the school that's right for you.

Create an account to start this course today
Try it risk-free for 30 days!
Create an account