Copyright

Transforming Data to Data Frame in R Programming

Instructor: Alexis Kypridemos

Alexis is a technical writer for an IT company and has worked in publishing as a writer, editor and web designer. He has a BA in Communication.

Raw data is generally converted into data frames, as a part of the process of cleaning up of the data. This lesson explains how to transform raw data into data frames in R programming, and why this is important for analysis.

Working with Data

Before discussing the specifics of transforming 'unruly' data into data frames in R, it is useful to briefly look at the overall process of working with data, and why it is important to 'clean' the data.

Working with data can be broken down into four basic stages:

  1. Collection, also called Acquisition
  2. Cleaning
  3. Analysis
  4. Reporting

This lesson will focus on data cleaning.

Data Cleaning

Data cleaning, also called data cleansing, is the process of maximizing the accuracy of the data stored, to bring it into a form that will be easier to analyze. If this step is skipped, the data is likely to not be correct, and the analysis results will be skewed.

To stress the importance of data cleaning, data scientists frequently refer to the 2014 New York Times article 'For Big-Data Scientists, Janitor Work Is Key Hurdle to Insights', in which data experts say that 50 to 80% of their time is spent on cleaning data, as opposed to analyzing it.

Data Acquisition, Analysis and Reporting

While not the focus of this lesson, it is worth briefly explaining the other stages of working with data. Data acquisition, often abbreviated to DAQ, is the process of taking measurements from the real world, like temperature measurements, and digitizing them for easy analysis on a computer.

Data analysis forms the core work of data science, which is to interpret data taken from measurements or other sources into some kind of meaningful result.

Reporting is the process of conveying these results to the appropriate audience.

Transforming Data to a Data Frame

One useful step in cleaning data is to transform raw data into an R data frame, which brings it into a tabular format, with rows and columns, like a spreadsheet.

This is best explained through an example.

The United States Weather Service has recorded the dates when temperatures of 100 degrees Fahrenheit or above, were reached in Central Park, New York City, between 1869 and 2012. They have published these measurements as a publicly available PDF document at the URL: https://www.weather.gov/media/okx/Climate/CentralPark/100DegreeDays.pdf.

Let's assume that we want to take the data from that PDF and import it into R for analysis.

Here is a sample row of data from that PDF, for the year 2011:

2011 Jul 22/ 104 Jul 23/ 100

While easy for humans to read, the above cannot be imported directly into R, because it contains values that combine characters and numbers, like Jul 22. For brevity's sake, the data has been brought into a 'friendlier' format for R for the purposes of this lesson.

Now in a compatible format for R, the same data looks like this:

2011, 'Jul 22', 104, 'Jul 23', 100

Let's assume the end goal of the data cleaning task is to convert this data into a data frame, where one column will hold the year (2011), another the date ('Jul 22'), and a third the recorded temperature (104).

Because this row of data contains multiple data types, as a first step let's import it as a list into R, since a list is an R data structure that supports all kinds of data types. We'll name the list 'weatherData'. We'll use the list() function to do this:

weatherData <- list(2011, 'Jul 22', 104, 'Jul 23', 100)

Next, let's convert that list into a data frame with the data.frame() function:

weatherData <- data.frame(weatherData)

The above code has created a data frame, but it is not yet in an optimal format for analysis, as it contains a single row with five columns.

Wide vs. Long Data

In the example data frame, there are two measurements (Jul 22 and Jul 23) for a single year (2011) in one row. When there are multiple observations on a single row this is referred to as 'wide' data. The opposite, when there is a single observation on each row, is termed as 'long' data. While 'wide' data is not inherently bad or incorrect, most analysis in R is performed on long data.

To unlock this lesson you must be a Study.com Member.
Create your account

Register to view this lesson

Are you a student or a teacher?

Unlock Your Education

See for yourself why 30 million people use Study.com

Become a Study.com member and start learning now.
Become a Member  Back
What teachers are saying about Study.com
Try it risk-free for 30 days

Earning College Credit

Did you know… We have over 160 college courses that prepare you to earn credit by exam that is accepted by over 1,500 colleges and universities. You can test out of the first two years of college and save thousands off your degree. Anyone can earn credit-by-exam regardless of age or education level.

To learn more, visit our Earning Credit Page

Transferring credit to the school of your choice

Not sure what college you want to attend yet? Study.com has thousands of articles about every imaginable degree, area of study and career path that can help you find the school that's right for you.

Create an account to start this course today
Try it risk-free for 30 days!
Create An Account
Support