4. Descriptive assessment of datasets • Text as Data Resources

Overview

In this Recipe we will build on our R coding skills to implement descriptive assessment of datasets.

Setup

Packages

First let’s load some packages that will help us manipulate and create summaries of data.

library(tidyverse)  # to manipulate and visualize datasets
library(janitor)  # to create (cross-)tabulations
library(skimr)  # to provide statistical summaries

Dataset

We will be using a dataset drawn from the Barcelona English Language Corpus (Carmen Muñoz 2007; Carme Muñoz 2006). The corpus includes a series of tasks that aim to provide data on the effects of age on the acquisition of English as a foreign language (EFL). The task selected here is the “Written composition” which comprises written essays by students from Barcelona ranging from 10 to 17 years of age.

Let’s read the curated dataset¹.

# read the `belc` dataset from an .rds file
belc <- read_rds(file = "data/rds/belc.rds")

Informational values

In R there is a set of fundamental vector types which are used for distinct informational value types and these vector types are associated with either categorical- (character and logical) and continuous-types (integer and double). A couple things to note: first, character and logical vectors can be re-typed as a more complex vector type called a factor. Factors allow us to encode order to the character vectors (say for ordinal variables) and assign numeric values to the character vector to allow for us to do mathematical operations. We will how factors work later. Second, both integer and doubles are also called numeric vectors. The difference between integer and doubles is that doubles allow for decimal places, whereas integers are whole numbers.

The belc dataset reflects a summary of each of the essays by each of the students and includes the number of word tokens (num_tokens), the number of word types (num_types) and the ratio of tokens to types (ttr). The structure of the dataset (belc) is seen below.

## Rows: 79
## Columns: 6
## $ participant_id <chr> "L02", "L05", "L10", "L11", "L12", "L16", "L22", "L27",…
## $ age_group      <chr> "10-year-olds", "10-year-olds", "10-year-olds", "10-yea…
## $ sex            <chr> "female", "female", "female", "female", "female", "fema…
## $ num_tokens     <int> 12, 18, 36, 10, 41, 13, 47, 8, 84, 53, 18, 16, 7, 24, 3…
## $ num_types      <int> 12, 15, 26, 8, 23, 12, 30, 8, 34, 34, 13, 9, 5, 18, 23,…
## $ ttr            <dbl> 1.000, 0.833, 0.722, 0.800, 0.561, 0.923, 0.638, 1.000,…

As we see from the overview of the belc dataset, we have character, integer and double vector types in the data frame. Therefore this dataset contains three categorical (participant_id, age_group, and sex) and three continuous variables (num_tokens, num_types, and ttr).

Summaries

Now let’s look at approaching descriptive summaries using the belc dataset. First we will look at single vector summaries and then we will look at multiple vector summaries. The type of summary that we will apply will depend on the type of values that a variable contains, that is whether the values are categorical or continuous.

To prepare to work with the categorical data, let’s re-type the categorical variables (participant_id, age_group and sex) as factors. To apply the as.factor() function to all the character vectors I make use of the mutate_if() function which allows me to target only the vectors in the dataset that are of type character. Since the age_group variable is ordinal we will explicitly encode that order. Again, the mutate() function is called but only targets the age_group vector.

belc <- 
  belc %>% 
  mutate_if(is.character, as.factor) %>% # create factors from character variables
  mutate(age_group = factor(age_group, ordered = TRUE)) # create ordered variable for age_group

Now let’s take a look at the dataset.

glimpse(belc)  # dataset overview

## Rows: 79
## Columns: 6
## $ participant_id <fct> L02, L05, L10, L11, L12, L16, L22, L27, L28, L29, L41, …
## $ age_group      <ord> 10-year-olds, 10-year-olds, 10-year-olds, 10-year-olds,…
## $ sex            <fct> female, female, female, female, female, female, female,…
## $ num_tokens     <int> 12, 18, 36, 10, 41, 13, 47, 8, 84, 53, 18, 16, 7, 24, 3…
## $ num_types      <int> 12, 15, 26, 8, 23, 12, 30, 8, 34, 34, 13, 9, 5, 18, 23,…
## $ ttr            <dbl> 1.000, 0.833, 0.722, 0.800, 0.561, 0.923, 0.638, 1.000,…

Now we have factors instead of character vectors for our categorical variables.

Single vector

Categorical

To get a descriptive summary of the categorical variables I will use the skimr package (Waring et al. 2022) and call the skim() function and then pipe these results to yank() and target only the factor variables.

belc %>%
    skim() %>%
    yank("factor")

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
participant_id	1	FALSE	36	L05: 3, L10: 3, L11: 3, L12: 3
age_group	1	TRUE	4	10-: 24, 16-: 24, 12-: 16, 17-: 15
sex	1	FALSE	2	fem: 48, mal: 31

From this output we see a host of descriptive information about our categorical variables. What is important to note is that categorical variables are summarized by counts. The most common value for a categorical variable is called the mode. If we want to look at a specific variable, say age_group we can use the tabyl() function from the janitor package (Firke 2021).

belc %>%
    tabyl(age_group)  # create a tabulation of the `age_group` variable

age_group	n	percent
10-year-olds	24	0.304
12-year-olds	16	0.203
16-year-olds	24	0.304
17-year-olds	15	0.190

We see that the tabyl() function provides the counts but also the proportions of each of the values of the age_group variable. Note that when working with factors the values are often called ‘levels’. We can see that this variable has two levels with the same count. This characterizes what is called a bimodal distribution as there are two most frequent levels.

Tabular summaries are often the most effective way to assess categorical variables, but let’s set the stage for working with plotting in R.

Among other package, the tidyverse package load the ggplot2 package (Wickham et al. 2021) which is a powerful package for creating plots in R. The ‘gg’ in ggplot2 refers to the use of “The Grammar of Graphics” approach to building plots. There are three basic elements for all ggplot2 plots: (1) data, (2) mappings, and (3) geometries. The data is, of course, the dataset that we want to use. The mappings used to select the variables to be used in the plot and how the variables are to be mapped to the plotting space. Finally the geometries specify how the mappings are organized.²

Let’s create a simple plot for the age_group variable. First we pass the datset to ggplot(). The ggplot() function then requires that we provide the aesthetic mappings aes() in this case we have one variable and so we will want this variable to appear on the x axis. The geom_bar() function by default will then count the levels of our age_group factor variable and plot them on the y axis.

belc %>% # dataset
  ggplot(aes(x = age_group)) + # map age_group to x
  geom_bar() # create a barplot

There we go. Not a particularly informative plot, given we are only looking at a single categorical variable, but we will build on this basic formula to creat more informative graphics.

Note that the equivalent of the %>% for building ggplot2 plots is the + operator. This can be confusing, but it is important to recognize this distinction as it is easily overlooked and can cause unexpected errors.

Continuous

Now let’s turn to continuous variables. Where tabular summaries of categorical variables makes sense, this is not the case for continuous variables as by definition a continuous variable is not count-based but rather the values range along a continuum. Let’s look at what the basic descriptives are for out continuous variables with skim() (this time selecting only the numeric variables).

belc %>%
    skim() %>%
    yank("numeric")

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
num_tokens	1	66.23	43.90	1.00	29.00	55.00	90.00	185	▇▇▇▃▁
num_types	1	40.25	22.80	1.00	22.00	38.00	54.00	97	▆▇▇▃▂
ttr	1	0.67	0.13	0.41	0.57	0.64	0.73	1	▂▇▅▂▂

Here we see that the type of summary information is not count-based, rather we have a new set of descriptives. The mean and sd (standard deviation) are easy to identify and straightforward. The summaries prefixed with p represent the percentiles, in this case we have five percentile points (0, 25, 50, 75, 100), which slice the percentile space in four ranges therefore we call these the quartiles. These values are often called the ‘five-number summary’. The 50th percentile is also known as the median. The five number summary provides a numerical view of the distribution of a continuous variable. Another use of these quartiles. is to calculate the range between the 25th and 75th quantile, (50% of the values), known as the Interquartile Range (IQR). This gives us a more precise estimate of the distribution as it does not include extreme values (above or below the IQR).

We can calculate this measure manually, or just apply the IQR() function.

IQR(belc$num_types)  # calculate iqr for `num_types`

## [1] 32

Where with categorical variables tabular formats are often the most informative way to understand a variable, for continuous variables plots are the most informative. Let’s now create some plots which provide views of the distribution of the num_types variable.

Picking up with the quantiles, we can create an Empirical Cumulative Distribution Frequency (ECDF) plot which will give us an understanding of the proportions of the values along a continuous percentile range.

belc %>% # dataset
  ggplot(aes(x = num_types)) + # map `num_types` to x
  stat_ecdf(geom = "step") # generate the cumulative distribution

Here we can graphically inspect the points that intersect on the x and y axis to estimate the percentile of the values that have some number of unique words. We can also get a specific determination by using the following functions. Let’s say we want to know how many number of unique words are in the lower 10% of the written compositions from BELC.

ecdf(belc$num_tokens) %>% # calculate the cumulative distribution
  quantile(.1) # pull the value for the 10th percentile

##  10% 
## 15.4

Now let’s move towards looking at distributions. This is done by creating either a histogram or a density plot. Let’s plot both here. I will assign each plot to an object and then have them output side-by-side using the gridExtra package (Auguie 2017)

p1 <- 
  belc %>% # dataset
  ggplot(aes(x = num_types)) + # map `num_types` to x
  geom_histogram() # create histogram

p2 <- 
  belc %>% # dataset
  ggplot(aes(x = num_types)) + # map `num_types` to x
  geom_density() # create density plot

gridExtra::grid.arrange(p1, p2, ncol = 2) # arrange both plots in two columns

As you can see, a histogram also provides count information, like we saw with categorical variables. However, the counts here are based on binned groups, that is, a range of values are calculated that span the entire value space and then values that fall within one of these bins (ranges) are counted. The size of the bin range can be adjusted, but we’ve just gone with the default (in this case bins = 30). The density plot uses proportions to provide an more continuous view of the distribution. Both plot types have their advantages. In the case of histograms it can be easier to identify outliers while density plots can help us determine more easily if the distribution is normal or skewed (left or right). As we will see, identifying outliers and determining the type of distribution we are working with will be useful downstream in certain types of analysis approaches.

On the topic of normal distributions, let’s look at a useful plot for assessing the extent to which a continuous variable is normally distributed –the Quantile-Quantile plots (QQ Plot).

belc %>% # dataset
  ggplot(aes(sample = num_types)) + # map `num_types` to sample
  stat_qq() + # calculate the sample and theoretical quantiles points
  stat_qq_line() # plot the theoretical line

In QQ-plots the more the points diverge from the line, the less likely that the distribution is normal.

Multiple vector

Now we turn our attention to working with multiple variables. We will first look at variables of the same type, and then look at describing variables of distinct types.

Categorical

Just was with descriptions of single categorical variables, tabular summaries are very useful. When there are multiple categorical variables, we cross-tabulate. That is, that each of the values of one variable are tabulated for the each value of the other variable(s).

Let’s do a cross-tabulation of the age_group and sex variables. Again we will use the tabyl() function but this time with two variables.

belc %>% # dataset
  tabyl(age_group, sex) # cross-tab of `age_group` and `sex`

age_group	female	male
10-year-olds	14	10
12-year-olds	11	5
16-year-olds	13	11
17-year-olds	10	5

We can also add proportions to this cross-tabulation by adding the adorn_percentages() function. I’ve also rounded the output with adorn_rounding(). Both these functions are part of the janitor package.

belc %>% # dataset
  tabyl(age_group, sex) %>%  # cross-tab of `age_group` and `sex`
  adorn_percentages() %>% # add percentages (row by default)
  adorn_rounding(2) # round the output

age_group	female	male
10-year-olds	0.58	0.42
12-year-olds	0.69	0.31
16-year-olds	0.54	0.46
17-year-olds	0.67	0.33

The order of the variables in the tabyl() function can allow you rotate the output. This may be desirable depending on the number of levels in a particular categorical variable.

belc %>% # dataset
  tabyl(sex, age_group) %>%  # cross-tab of `age_group` and `sex`
  adorn_percentages() %>% # add percentages (row by default)
  adorn_rounding(2) # round the output

sex	10-year-olds	12-year-olds	16-year-olds	17-year-olds
female	0.29	0.23	0.27	0.21
male	0.32	0.16	0.35	0.16

Now we can visualize this relationship in a bar plot as well. Let’s create two bar plots, in fact. One for the counts and the second for proportions.

p1 <- 
  belc %>% # dataset
  ggplot(aes(x = sex, fill = age_group)) + # map sex to x and age_group to y
  geom_bar() + # generate bar plot with counts
  labs(y = "Count") # add labels

p2 <- 
  belc %>% # dataset
  ggplot(aes(x = sex, fill = age_group)) + # map sex to x and age_group to y
  geom_bar(position = "fill") + # generate bar plot with proportions
  labs(y = "Proportion") # add labels

gridExtra::grid.arrange(p1, p2, ncol = 2) # arrange both plots in two columns

The proportions provide a apples-to-apples comparison allowing us to see relative sizes of the age group levels for each level of sex.

I’ve added another function labs() to the plot to change the y axis label to a custom label. With the labs() function you can change the labels of any of axis as well as other mapping aesthetics as well as the title.

Continuous

For summaries of continuous variables we can generate correlation statistics as well as visualize relationships. Let’s start with building a plot to visualize the relationship between num_tokens and ttr. To plot points where continuous variables coincide we use the geom_point() function. If we want to include a trend line we use the geom_smooth() function. If we want that trend line to be linear, then the argument method = "lm" is included.

p1 <- 
  belc %>% # dataset
  ggplot(aes(x = num_tokens, y = ttr)) + # map num_tokens to x, ttr to y
  geom_point() + # plot x/y points
  labs(x = "Number of tokens", y = "Type-Token Ratio")
p2 <- 
  belc %>% # dataset
  ggplot(aes(x = num_tokens, y = ttr)) + # map num_tokens to x, ttr to y
  geom_point() + # plot x/y points
  geom_smooth(method = "lm") + # add a linear trend line
  labs(x = "Number of tokens", y = "Type-Token Ratio")

gridExtra::grid.arrange(p1, p2, ncol = 2) # arrange both plots in two columns

To calculate a statistical summary of a relationship between two continuous variables (correlation) we can use the cor() function from base R’s stats package. We select the variables we want to explore and assign them to x and y and then select the appropriata method for the correlation assessment. For normally distributed continuous variables, we set method = to ‘pearson’ and for non-normal distributions ‘kendall’.

cor(x = belc$num_tokens, y = belc$ttr, method = "kendall")  # correlation stat

## [1] -0.563

Correlation statistics range from 1 to -1. The closer to either of these values means that the relationship is strong. A value closer to 0 means the correlation is weak and if it is 0 or near 0 there is no correlation.

Remember that to determine if a continuous variable conforms to the normal distribution we can apply the Shapiro-Wilk Normality Test using the shapiro.test() function. A significant \(p\)-value means that the distribution is not normal.

Mixed

The final scenario is one in which we are interested in assessing the relationship between a categorical variable and a continuous variable. We can perform a summary using group_by() and the summarise() functions. First we group the dataset by the categorical variable and then we create a new variable which is the result of the summary statistic that we want to calculate. In this case, let’s look at the mean number of tokens by each level of the learner age group.

belc %>% # dataset
  group_by(age_group) %>% # group dataset by age_group
  summarise(mean_num_tokens = mean(num_tokens)) # calculate the mean num_tokens

age_group	mean_num_tokens
10-year-olds	27.8
12-year-olds	57.4
16-year-olds	81.7
17-year-olds	112.4

The statistic(s) that we want to calculate are up to us and we can create multiple statistics by adding other functions to the summarise() function.

belc %>% # dataset
  group_by(age_group) %>% # group dataset by age_group
  summarise(mean_num_tokens = mean(num_tokens), # calculate mean
            sd_num_tokens = sd(num_tokens), # calculate standard deviation 
            median_num_tokens = median(num_tokens), # calculate median
            iqr_num_tokens = IQR(num_tokens)) # calculate the interquartile range score

age_group	mean_num_tokens	sd_num_tokens	median_num_tokens	iqr_num_tokens
10-year-olds	27.8	18.1	23.5	21.8
12-year-olds	57.4	28.5	55.5	38.0
16-year-olds	81.7	30.1	84.5	29.2
17-year-olds	112.4	50.3	119.0	77.0

The numeric summaries are helpful for reporting, but a visual can be much easier to interpret. To assess a categorical variable and a continuous variable we turn to box plots.

belc %>% # dataset
  ggplot(aes(x = age_group, y = num_tokens)) + # map age_group to x and num_tokens to y
  geom_boxplot() # create box plot

Summary

In this recipe we covered various common strategies for descriptively assessing variables in a dataset. We worked with single variables of both categorical and continuous types discussing the relationship between R’s vector types and informational values as well as looking a descriptive stats and visualizations. We also looked at strategies for assessing multiple variables either of the same type or mixed.

References

Auguie, Baptiste. 2017. gridExtra: Miscellaneous Functions for "Grid" Graphics. https://CRAN.R-project.org/package=gridExtra.

Firke, Sam. 2021. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://github.com/sfirke/janitor.

Muñoz, Carme. 2006. Age and the Rate of Foreign Language Learning. Clevedon: Multilingual Matters.

Muñoz, Carmen. 2007. “SLABank Catalan-English Barcelona Corpus.” TalkBank. https://doi.org/10.21415/T5S89C.

Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2022. Skimr: Compact and Flexible Summaries of Data. https://CRAN.R-project.org/package=skimr.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2021. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.

To inspect the process by which this dataset was created, you can look at the source file for this recipe.↩︎
For reference visit the the ggplot2 website ↩︎