Copyright

Inconsistent Data in R Programming

Instructor: Alexis Kypridemos

Alexis is a technical writer for an IT company and has worked in publishing as a writer, editor and web designer. He has a BA in Communication.

This lesson demonstrates through a code example how to use R programming to identify inconsistent data and bring it into the correct format for analysis.

Data Inconsistencies

Inconsistencies in a data set may arise for many reasons. In the case of a survey, for example, it is more than likely that participants may each fill in a specific survey field in a different manner. For instance, when asked how much time the participants devote to a specific activity, some may fill in ''2 hours'', while others may fill in ''two hours'' or ''2 hr''. It is up to data scientists to first of all identify such inconsistencies, and bring all values into the correct format before analyzing the data.

Inconsistencies vs. Outliers

Data inconsistencies are somewhat harder to define and differentiate from errors and outliers. An outlier is a value which is significantly outside the range of similar values. For instance, in the above example of a survey question about time spent on an activity, if most participants have answered that they spend between 1 and 2 hours on a given activity, and only one participant has answered that they spend 8 hours on the same activity, this response would be considered an outlier. But this does not make it an inconsistency or an error. The fact that a single response lies outside the normal range may very well be important for later analysis.

Inconsistencies vs. Errors

Inconsistencies do include some errors but not others. The first stage of data cleaning has to do with identifying certain types of errors and correcting them, but does not go as far as dealing with all inconsistencies. In the first data cleaning stage, we are concerned with making sure the data is technically correct. In R, this means that we want to arrive at a result where the data set has been imported as a data frame, in which each measurable attribute, like ''time spent on activity'', is represented by a column. Each value in that column should be of the same type, like numeric, character, etc.

So in the activity example, the values in the column would be numeric, measuring the number of hours spent on the activity. Therefore, a value like ''two hours'', which is a character value in R, would be the type of data error identified in the first stage of data cleaning. Conversely, a value like '-1', which is of the correct type, but an obvious error, as it is impossible to spend negative time on anything, would be the type of error identified and corrected when searching for inconsistencies.

Example of Identifying and Correcting Data Inconsistencies

Consider the following data frame, called activity:

time spent on activity
1
3
0.5
2
-1
NA

In this sample of six observations, there are examples of two common inconsistencies: an obvious error (-1), and a missing value (NA). Let's see how to identify and correct such issues in R.

Missing values stop us from performing mathematical operations on data frames. For example, if we tried to calculate the mean value of the time spent on activity column, we would receive the following:

mean(activity$time.spent.on.activity)
[1] NA

To correct this, we will perform an imputation. Imputation is the substitution of a missing value with a value derived from the valid values in the data set. We will accomplish this using the ifelse() function. This function incorporates the operations normally performed through an if-else conditional structure into a single function. It takes three arguments:

  • the condition to check
  • the action to perform if the condition is met
  • the action to perform when the condition is not met

In this case, to remove the NA value, we will configure the ifelse() function as follows:

activity$time.spent.on.activity = ifelse(is.na(activity$time.spent.on.activity), mean(activity$time.spent.on.activity, na.rm = TRUE), activity$time.spent.on.activity)

Let's explain what we did here. We perform the ifelse() function on one column of a data frame at a time, in this case, on activity$time.spent.on.activity.

Next, for the function's first argument, we use another function, is.na(), to check whether each value in the column is a missing value: ifelse(is.na(activity$time.spent.on.activity).

The second argument performs the imputation, substituting the missing value with column's mean. The mean is calculated for all values in the column, minus the missing value. This is accomplished by using the na.rm = TRUE command, rm standing for remove: mean(activity$time.spent.on.activity, na.rm = TRUE).

For the third and last argument, the 'else' condition, when the value is not missing, we leave the existing value in place: activity$time.spent.on.activity).

Having imputed the missing value, we can now calculate the mean:

To unlock this lesson you must be a Study.com Member.
Create your account

Register to view this lesson

Are you a student or a teacher?

Unlock Your Education

See for yourself why 30 million people use Study.com

Become a Study.com member and start learning now.
Become a Member  Back
What teachers are saying about Study.com
Try it risk-free for 30 days

Earning College Credit

Did you know… We have over 200 college courses that prepare you to earn credit by exam that is accepted by over 1,500 colleges and universities. You can test out of the first two years of college and save thousands off your degree. Anyone can earn credit-by-exam regardless of age or education level.

To learn more, visit our Earning Credit Page

Transferring credit to the school of your choice

Not sure what college you want to attend yet? Study.com has thousands of articles about every imaginable degree, area of study and career path that can help you find the school that's right for you.

Create an account to start this course today
Try it risk-free for 30 days!
Create an account
Support