Practical Application for Programming in R: Data Processing in R

Instructor: Alexis Kypridemos

Alexis is a technical writer for an IT company and has worked in publishing as a writer, editor and web designer. He has a BA in Communication.

This lesson will help you understand the difference between data processing and specific operations in R, by providing sample code that detects outliers in a matrix and asking you to modify the program in various ways.

Lesson Overview & Knowledge Required

In this lesson, program code is provided that detects the outliers in the built-in R data set airquality. You will then be asked to modify the code to substitute the outlier values with the median value of the column where the outlier is found, as well as perform other operations.

Required knowledge to complete the tasks includes a familiarity with key data structures in R such as vectors, data frames and matrices in particular. Also required is an understanding of the different R data types, such as numeric, integer and character, as well as how to convert data from one type to another. Finally, knowledge of basic programming concepts, such as conditional statements and loops will also be required to complete the tasks.

Program Code

Below is R code that converts the built-in R data set airquality into a matrix, detects the outlier values in the set, and stores them in a separate vector. This code makes use of the outliers package. Installing the package is included in the first lines of code. Note that the finding of outliers is limited to the first four columns of the airquality data set, as the last two columns contain date information.


Code Application

Modify the above code so that the outlier values in the aqm matrix are replaced with the median value of the column where each outlier is found. Tip: you can read more about the outlier() function and the arguments it takes here:

Follow-up Questions

  1. Once you've performed the above task, convert all the values in the output matrix to integer.
  2. Add an additional column to the aqm matrix that combines the values from the last two columns, Month and Day. It is more than likely that the matrix will need to be converted to a data frame to accomplish this, which is fine.
  3. Convert the values in the newly created column to a date data type.
  4. Having performed these individual manipulations, combine all the code snippets into a complete program.

Answer Key

Code Application

Here is the code that replaces the outlier values with the median of the column where each of those values has been detected:

ol<-outlier(aqm[,1:4], logical=TRUE)
aqm[x,y] = ifelse(ol[x,y] == "TRUE", median(aqm[,y], na.rm=TRUE), aqm[x,y])

Let's explain what we did here.

The first line of code is very similar to the last line in the provided program code. It uses the outlier() function to detect outlier values in the aqm matrix and store them in the ol matrix. However, this time the function includes the logical=TRUE argument. By doing so, the ol matrix becomes populated exclusively with Boolean (TRUE / FALSE) values. For each outlier value detected in the aqm matrix, the ol matrix contains a value of 'TRUE' at the same index (row and column coordinates). All the other values contained in ol are 'FALSE', or 'NA'.

Having configured ol this way makes it easy to use it as a reference to build a conditional statement to replace the outlier values in aqm, which is exactly what we do in the following lines of code.

To achieve this replacement with the minimum number of code lines, we create two numerical vectors, x and y. x stores the number of rows found in the matrices and y the columns. As we mentioned above, we're only concerned with the first four columns of the aqm matrix. If you're wondering how to measure a matrix's dimensions, you can always use the str() function. Among other results, this function returns the matrix's dimensions.

Finally, with the last line of code, we use the ifelse() function to perform the outlier substitution (imputation) with the column's median. The ifelse() function combines an if-else conditional statement into a single function. Also, it provides the capability of a nested for loop, looping through two variables, in this case x and y. We use this capability to check whether each and every value in the ol matrix is 'TRUE'. If it is, we perform the substitution in the aqm matrix.

The ifelse() function takes three arguments:

  • the condition to check; in this case, whether the value in the current index of ol is 'TRUE'
  • the action to perform if the condition is met; here, this is the substitution of the outlier value in aqm with the current column's median, ignoring any NA values contained in the column
  • the action to perform when the condition is not met; here, we simply leave the current value of aqm as is

To unlock this lesson you must be a Member.
Create your account

Register to view this lesson

Are you a student or a teacher?

Unlock Your Education

See for yourself why 30 million people use

Become a member and start learning now.
Become a Member  Back
What teachers are saying about
Try it risk-free for 30 days

Earning College Credit

Did you know… We have over 200 college courses that prepare you to earn credit by exam that is accepted by over 1,500 colleges and universities. You can test out of the first two years of college and save thousands off your degree. Anyone can earn credit-by-exam regardless of age or education level.

To learn more, visit our Earning Credit Page

Transferring credit to the school of your choice

Not sure what college you want to attend yet? has thousands of articles about every imaginable degree, area of study and career path that can help you find the school that's right for you.

Create an account to start this course today
Try it risk-free for 30 days!
Create an account