Categorical Variables in Regression Analysis

Instructor: Bob Bruner

Bob is a software professional with 24 years in the industry. He has a bachelor's degree in Geology, and also has extensive experience in the Oil and Gas industry.

A categorical variable is one that takes on non-numeric values such as gender or race. In this lesson, we look at coding of categorical variables using dummy numeric variables so that this data can be used in regression analysis.

Categorical Variables and Statistics

Suppose you need to run a marketing campaign that is targeted to a specific group, perhaps men or women or maybe homeowners vs. renters. If you have data available that contains this type of categorical information, how do you go about applying statistical measures to validate your plans?

Categorical variables are those items that are described as belonging to a unique category. Quite often these are binary variables, they only take on one of two possible values, such as male or female or yes or no. Other categorical variables may be multidimensional. For example, your co-workers might get to work in one of four ways: they walk, ride a bike, drive in a car, or take public transportation.

Categorical variables by themselves cannot be used directly in a regression analysis, which is a useful statistical tool for highlighting trends and making predictions from measured data. A regression analysis requires numerical data as the basis of its computations. There are various ways to get around this issue when dealing with categorical variables. In this lesson, we explore one of these, coding of categorical variables, as one method that can be used to include this type of data in a regression analysis.

Coding Categorical Variables

In coding a categorical variable, we assign a unique number to each of the possible category values. This process is often referred to as using a dummy variable, as we are substituting a dummy value in place of the actual category designations. These dummy variables are often referred to as 'binary,' or 'zero-one indicator variables,' when we have two values to code.

The term 'zero-one indicator variable' is quite descriptive of the actual values typically used in the coding process. One descriptor takes on the value of 0, and the other takes on the value of 1. When multiple values are required for a category, we code each of these as separate dummy variables and include them in the regression analysis as needed.

The use of 0 and 1 makes regression evaluation relatively simple and is useful in accepting or rejecting the null hypothesis that will be made about the relationships found in the data. Exactly which categorical value is assigned a specific value is arbitrary, but by convention, 0 is used for the reference or comparison value, and the characteristic of interest takes on the value of 1.

Did You Like the Movie?

Let's start with a simple example. Moviegoers have been asked to rate a movie they have just seen from 1 to 5 stars. We record their responses and also note whether they are male or female.

Histogram Scores by Gender
Histogram Scores by Gender

It appears that women liked the movie a bit more than men. Can this be evaluated statistically? To do this, first create a dummy variable where female = 0 and male = 1, and use this in place of the categorical variable gender in the regression analysis.

Recall that in linear regression we have a dependent variable and one or more independent variables. In this case, we want to see if movie scores are dependent on gender, so score is set as the dependent variable. Our coded dummy variable value can then be plugged in as the independent variable. Doing this, we obtain the following equation from the actual regression analysis.

Score = 3.28 - (0.64 x Dummy)

Plug in the dummy variable accordingly:

Males Score = 3.28 - (0.64 x 1) = 2.64

Females Score = 3.28 - (0.64 x 0) = 3.28

You might notice that in this very simple example where we have only one independent variable coded as a dummy variable, these numbers are essentially the same as the average scores in each category. However, we do get to obtain additional statistics from the regression analysis.

R2 = 0.085 (8.5%)

T-statistic of dummy = -2.11 standard deviations

P-value of dummy = .039 (3.9%)

Our null hypothesis would be that gender is not related to the movie scores. Our p-value indicates that there is only a 3.9% percent chance that these results are due to random occurrence. Therefore, we reject the null hypothesis and conclude that statistically speaking, gender and score are related. And the R2 value indicates that 8.5% of the variation is due to gender.

To unlock this lesson you must be a Member.
Create your account

Register to view this lesson

Are you a student or a teacher?

Unlock Your Education

See for yourself why 30 million people use

Become a member and start learning now.
Become a Member  Back
What teachers are saying about
Try it risk-free for 30 days

Earning College Credit

Did you know… We have over 200 college courses that prepare you to earn credit by exam that is accepted by over 1,500 colleges and universities. You can test out of the first two years of college and save thousands off your degree. Anyone can earn credit-by-exam regardless of age or education level.

To learn more, visit our Earning Credit Page

Transferring credit to the school of your choice

Not sure what college you want to attend yet? has thousands of articles about every imaginable degree, area of study and career path that can help you find the school that's right for you.

Create an account to start this course today
Try it risk-free for 30 days!
Create an account