Filter(is.character,birth_dat)
Summarizing Categorical Data
We will use the births
data set to summarize and visualize categorical variables using the base R approach
Categorical data is a type of data that is divided into categories or groups, such as hair color or education level. Categorical data can be further divided into nominal data, which is data that has no inherent order, such as hair color, and ordinal data, which is data that has a defined order, such as education level
Single categorical variable
One way to extract all the character columns is using both Filter
and is.character
functions. A similar argument can be said for any columns that are factors, but instead we’d use is.factor
.
Below are the first five rows of all the character columns in birth_dat
Gender Premie Marital Racemom Racedad Hispmom Hispdad Habit
1 Male No Married White White Mexican Mexican NonSmoker
2 Male No Unmarried White Unknown NotHisp Unknown Smoker
3 Male No Married White White OtherHisp OtherHisp NonSmoker
4 Male No Married White White Mexican Mexican NonSmoker
5 Female No Unmarried Black Unknown NotHisp Unknown NonSmoker
MomPriorCond BirthDef DelivComp BirthComp
1 None None None None
2 At Least One None None None
3 None None At Least One None
4 None None At Least One None
5 None None None None
The names of the character columns can be obtained using colnames()
function in combination with the above statement
colnames(Filter(is.character,birth_dat) )
[1] "Gender" "Premie" "Marital" "Racemom" "Racedad"
[6] "Hispmom" "Hispdad" "Habit" "MomPriorCond" "BirthDef"
[11] "DelivComp" "BirthComp"
We will only consider the Hispmom
variable from our dataset to demonstrate methods to summarize and visualize a character variable.
First, we’ll save the values from Hispmom
column into a separate variable and compute several categorical summaries
<- birth_dat$Hispmom hispanic_mom
The table()
function in R can be used to quickly create frequency tables.
table(hispanic_mom)
hispanic_mom
Mexican NotHisp OtherHisp
216 1697 85
From the above frequency table we observe there were 25 mom who were Mexican, 1693 non Hispanic, and 84 were other types of Hispanic. We can easily convert the frequency table into a frequency table of proportions using prop.table()
. The input for prop.table()
is a table created using table()
.
prop.table(table(hispanic_mom))
hispanic_mom
Mexican NotHisp OtherHisp
0.10810811 0.84934935 0.04254254
Now, we observe roughly 10.79% of moms were Mexican, 84.99% were non Hispanic and 4.22% were other types of Hispanic. Note that all of the proportions should add up to 1.
sum(prop.table(table(hispanic_mom)))
[1] 1
To plot a single categorical variable we can use barplot()
. The input for barplot()
when dealing with categorical data is a table
, like the ones we created above
barplot(table(hispanic_mom))
Instead of the frequency counts, we can plot frequency of proportions by inputting a frequency tables of proportions.
barplot(prop.table(table(hispanic_mom)),
main = 'Ethnicity Proportions of Moms',
col = '#d59cdb')
Two categorical variables
For this example, we consider the following two character variables Hispdad
and Habit
. Hispdad
determines whether the father of the baby was Hispanic or not. In particular, are they Mexican, non-Hispanic, or other type of Hispanic ethnicity. Habit
determines whether or not the subject had a smoking habit or not.
When dealing with two categorical variables we can create a two-way table using table(v1,v2)
. Below is the table of frequency for both Habit
and Hispdad
.
Note: We save the table as a variable so we can use it later
<- table(birth_dat$Habit,birth_dat$Hispdad)
smoker_hispanic_dad smoker_hispanic_dad
Mexican NotHisp OtherHisp Unknown
0 4 0 2
NonSmoker 184 1236 78 307
Smoker 5 117 2 63
From the above frequency table of counts you will notice that there were 184 Mexican dads who were non-smokers, 5 Mexican dads who were smokers, 1236 non-Hispanics who were non-smokers, 117 non-Hispanics who were smokers and similar interpretations can be made for the remaining cells.
We can obtain a table of proportions using prop.table()
prop.table(smoker_hispanic_dad)
Mexican NotHisp OtherHisp Unknown
0.000000000 0.002002002 0.000000000 0.001001001
NonSmoker 0.092092092 0.618618619 0.039039039 0.153653654
Smoker 0.002502503 0.058558559 0.001001001 0.031531532
Now, lets plot the results of our table using the default barplot
settings
barplot(smoker_hispanic_dad)
It is difficult to understand the meaning of the black and gray filled sections of the barplot. Although we may have a general understanding that the gray portion represents smokers and the black portion represents non-smokers based on the accompanying table, we should not assume that the reader will automatically make this connection.
We can add a legend by using the argument legend.text=TRUE
, and barplot
will use the row names of our table to make the legend. Moreover, we add appropriate labels to our plot
barplot(smoker_hispanic_dad,
legend.text = TRUE,
xlab = 'Ethnicity',
ylab = 'Counts')
The above figure shows a stacked bar plot. If we wanted the bars next to each other, rather than on top of each other, we can use the argument beside=TRUE
.
barplot(smoker_hispanic_dad,
legend.text = TRUE,
beside = TRUE,
xlab = 'Ethnicity',
ylab = 'Counts')
It is evident that the number of non-smokers exceeds that of smokers across all ethnicities. However, we may be able to obtain a more comprehensive understanding of the data by altering the grouping order of the bars. Specifically, we should examine which ethnic group has a higher count for each smoking category.
We can change the order of our table by taking the transpose, that is we swap the columns and rows. In R, we can transpose any table-like object using the function t()
t(smoker_hispanic_dad)
NonSmoker Smoker
Mexican 0 184 5
NotHisp 4 1236 117
OtherHisp 0 78 2
Unknown 2 307 63
From this point of view, we can observe the number of counts in each smoking habit category for each ethnicity. For example, there were 184 Mexican fathers who are non-smokers and 5 Mexican fathers that did smoke. Similar, interpretations can be made for other ethnic groups.
We can now use barplot
on this new transposed table
barplot(t(smoker_hispanic_dad),
legend.text = TRUE,
beside = TRUE,
xlab = 'Smoking Habit',
ylab = 'Counts')
We can clearly see the non-Hispanic fathers make up the highest counts for non-smokers and smokers. While the default color palette is color-blind friendly it can be hard to distinguish the categories based on these colors.
With a quick Google search of “four color palettes” you can find great palettes for 4 categories. For example, the following color palette was obtain from colorhunt.co
barplot(t(smoker_hispanic_dad),
legend.text = TRUE,
col =c('#4E6E81','#F9DBBB','#FF0303','#2E3840'),
beside = TRUE,
xlab = 'Smoking Habit',
ylab = 'Counts')