Detecting Data Errors in R Programming

Instructor: Alexis Kypridemos

Alexis is a technical writer for an IT company and has worked in publishing as a writer, editor and web designer. He has a BA in Communication.

In this lesson we demonstrate through code examples how to detect errors and correct them in R as part of the data cleaning process, using functions and commands like str(), readLines() and warn.

Data Errors vs. Technically Correct Data

Before we can explain what data errors are and how to find and correct them, we need to define the opposite: technically correct data. After all, technically correct data is what we want to produce in the first stage of data cleaning. We'll discuss the stages of data cleaning below.

Data are technically correct when they have been imported into R in tabular format, and each observation that measures the real-world occurrence we're studying is listed in a separate row. Each value in an observation measures a specific attribute of the real-world occurrence and is stored in a separate column. Each value in a column should be of the same data type, such as numeric, character, etc.

For example, when importing data collected from a survey about how many leisure activities participants engage in and the average time they spend on these activities, we would expect data in this format:

How many activities Time spent on activities (hrs/day)
3 1
2 2
5 0.5

The above data is tabular, and each measured attribute ('number of activities' and 'time spent on activity') forms a separate column. All the values in each column are of the same type. The values in the first column are integers, and numeric in the second.

Data errors are values that do not conform to the above standards. Values imported in incorrect data types, multiple values in a single column, column headers mistaken for observation values are some common data errors. It's worth mentioning that many errors can arise because they were recorded incorrectly to begin with, but can also occur because the data is 'misread' when importing to R.

Data Cleaning, Errors and Inconsistencies

We mentioned earlier that in this first stage of data cleaning, we are concerned with identifying and correcting data errors, so that we produce data that is technically correct. Following this, there is an additional data cleaning stage, where we check for data inconsistencies. These are values that may be of the correct data type, but cannot be accurate for the real-world object we're studying. In the activities example, a data inconsistency would a value of -1 for ''time spent on activity''. The value is of the correct type, numeric, but cannot be accurate, as it is impossible to spend negative time on anything.

Missing Values

One of the most common issues with data are missing values, noted as NA in R. These cross the line between data error and inconsistency. In another lesson devoted to detecting and correcting data inconsistencies, we discuss how to correct missing values.

Errors on Import

Let's assume the activity data in the table shown earlier is stored in a CSV file named activity.csv in R's working directory. However, the file includes only the values, and not the column headers. When importing this into R using the read.csv() function, we'd get the following result:

activity <-read.csv("activity.csv")
Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'activity.csv'

The incomplete final line message alerts us to the fact that the CSV file does not include a blank line after the data. A simple way to correct this would be to open the file, navigate to the last line of data, press 'Enter' to create a new line and save the changes. However, when dealing with many files, this can quickly become impractical.

Let's see how to work around this problem by including the readLines() function inside read.csv().

activity <-read.csv(text=readLines("activity.csv", warn=FALSE))

With this configuration, the data is imported without throwing errors. Let's see what we did.

In the read.csv() function, we entered the argument text, as opposed to a file path. In turn, we set the text value to the result of the readLines() function. In readLines(), we entered two arguments:

  • the path to the CSV file
  • we disabled the default behavior of checking for a blank final line by setting the value of warn to FALSE

Checking for Errors After Import

With the data now imported, let's check for errors.

Whenever importing or creating a data frame in R, always perform initial checks by using these three functions:

To unlock this lesson you must be a Study.com Member.
Create your account

Register to view this lesson

Are you a student or a teacher?

Unlock Your Education

See for yourself why 30 million people use Study.com

Become a Study.com member and start learning now.
Become a Member  Back
What teachers are saying about Study.com
Try it risk-free for 30 days

Earning College Credit

Did you know… We have over 200 college courses that prepare you to earn credit by exam that is accepted by over 1,500 colleges and universities. You can test out of the first two years of college and save thousands off your degree. Anyone can earn credit-by-exam regardless of age or education level.

To learn more, visit our Earning Credit Page

Transferring credit to the school of your choice

Not sure what college you want to attend yet? Study.com has thousands of articles about every imaginable degree, area of study and career path that can help you find the school that's right for you.

Create an account to start this course today
Try it risk-free for 30 days!
Create an account
Support