8. Dataset manipulation: tokenization and joining datasets • Text as Data Resources

Overview

In this Recipe we will look at two primary types of transformations, tokenization and joins. Tokenization is the process of recasting textual units as smaller textual units. The process of joining datasets aims to incorporate other datasets to augment or filter the dataset of interest.

We will first look at a sample dataset to explore the strategies associated with tokenization and joins and then we will put these into practice with a more practical example.

Let’s load the packages that we will use for this Recipe.

library(tidyverse, quietly = TRUE)  # data manipulation
library(tidytext)  # tokenization

Coding strategies

To illustrate the relevant coding strategies I’ve created a curated dataset of the “Big Data Set from RateMyProfessor.com for Professors’ Teaching Evaluation” (He 2020).

Let’s take a look at the curated dataset and get oriented to its structure.

rmp <- read_csv(file = "recipe_8/data/derived/rate_my_professor_sample/rmp_curated.csv")  # read curated dataset

glimpse(rmp)  # preview structure

## Rows: 10
## Columns: 4
## $ rating_id    <dbl> 89, 91, 5770, 3350, 1763, 4918, 2462, 982, 4734, 7903
## $ online       <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 1
## $ student_star <dbl> 1, 5, 5, 5, 5, 5, 5, 1, 5, 1
## $ comments     <chr> "By far the most condescending, mean, disrespectful, and …

We see that there are 10 observations and four columns.

There is a data dictionary associated with the rmp curated dataset. Let’s read it and show it in a human-readable format.

read_csv(file = "recipe_8/data/derived/rate_my_professor_sample/rmp_curated_data_dictionary.csv") %>% # read data dictionary
  knitr::kable(booktabs = TRUE,
               caption = "Rate My Professor curated sample data dictionary.") # show preview table

Table 1: Rate My Professor curated sample data dictionary.
variable_name	name	description
rating_id	Rating ID	Unique ID for each student course rating
online	Online Course	Was the course online or not: 1 is TRUE and 0 is FALSE
student_star	Student Rating	Scalar rating provided by the student
comments	Student Comments	Student comments provided for the course

Now let’s look a this small curated sample in its current form.

rmp %>% # dataset
  knitr::kable(booktabs = TRUE,
               caption = "Rate My Professor curated sample preview.") # show dataset preview

Table 2: Rate My Professor curated sample preview.
rating_id	online	student_star	comments
89	0	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.
91	0	5	Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure
5770	0	5	It's been “awhile” since law school…but, I thought Professor Frey was GREAT…he was my all time favorite law professor! He was always available…and willing to answer questions. He really cares about his students and wants them to succeed. Marty ROCKS!!!!!!!!!!!
3350	0	5	Amazing professor! One of the few professors that has left an everylasting impression. I learned so much in his class. He makes everything interesting! However, the class does require a lot of assignments but you'll do them and not care. He is just that awesome. You have labs where you need to prepare teaches (intimidating) but it's a wonderful exp
1763	0	5	outstanding. she may be tough but i think she helped me figure out why i wanted to teach phys ed.all you have to do is pay attention,speek upin class, pass the test and just show that you care about this major and you will be fine.she is tough but its to help you get ready for the real world
4918	1	5	This was one of the best courses I've ever taken. He puts so much effort into his lectures and assignments, you really learn everything inside and out. You must put in the effort and follow his requirements in the syllabus to do well. Waiting until the last day to do the assignments is not a good idea. Not an easy A, but highly recommend.
2462	1	5	I took PSYCH 337 Comm. & Society w/ Prof. Milburn (online). The material he provides is very interesting and I really enjoyed the class. He is extremely responsive if you ask him anything which is great. The class is very straight-forward and what is expected of you is laid out clearly. The work load is fair, & if you do the work you will do well.
982	1	1	Would not recommend anyone to take his online course. According to Hassan, “As far as I know, this course has never Audio or Vedio lectures because they are not developed by the departmentschool. So, we have to read the text book and then post any questions on any concept or methods here,and I will address it.” TA worked pretty hard.
4734	1	5	Let me start by saying this is not an easy class. You will have tons of homework to do each week. That being said I think almost all Algebra classes have the same work. Mr. G is a good guy and wants people to pass. If you do the homework and study his review you will be fine. May not get an A, but you will pass. I would recommend him
7903	1	1	Never actually explains what he wants. He will fail your assignments and not give clear feedback as to what he actually expects from you. Quiz, 2 journals, and groupwork due each week. Rude and extremely unhelpful.

From this orientation to the dataset we can see that there are four columns id, online, student_star, and comments. The first three are metadata associated with the text in comments. We also see that the online column contains five positive comments and five negative comments.

Tokenization

The very helpful function unnest_tokens() from the tidytext package is the most efficient way to recast a column with text into various smaller textual units –all while maintaining the metadata structure from the curated dataset. In this way, our transformation will maintain a tidy data format.

Let’s consider some of the key options for tokenization that are provided through the unnest_tokens() function. First let’s look at the arguments using the args() function.

args(unnest_tokens)  # view the arguments

## function (tbl, output, input, token = "words", format = c("text", 
##     "man", "latex", "html", "xml"), to_lower = TRUE, drop = TRUE, 
##     collapse = NULL, ...) 
## NULL

In order of appearance in the function, the tbl argument takes a data frame, the output will be a character vector with the desired name of the output column after tokenization, the input is the character vector which names the column which contains the textual information to be tokenized, the token argument is where we specify what type of token we would like to generate from the input column, the format argument is often left as the default ‘text’ –as we most often that not are working with text, the drop argument by default with drop the input column in the tokenized dataset, the to_lower argument let’s us decide if we want to lowercase the text when it is tokenized, the collapse argument allows for grouping the tokenization output and is often left to ‘NULL’ (the default), and finally we have a ... argument which leaves the possibility of adding arguments that are relevant for some of the token options, specifically for ‘ngrams’ and ‘character_shingles’.

Let’s see unnest_tokens() in action starting first with the most common tokenization unit (and therefore the default) ‘words’.

Words

rmp %>% # dataset
  unnest_tokens(output = "word", # tokenized output column
                input = "comments") %>% # input column to tokenize
  slice_head(n = 10) # preview first 10 observations

rating_id	student_star	word
89	1	by
89	1	far
89	1	the
89	1	most
89	1	condescending
89	1	mean
89	1	disrespectful
89	1	and
89	1	downright
89	1	rude

We now see from this preview of the first 10 observations that we have words from the comments tokenized. unnest_tokens() will return each of these tokens on their own row and maintain the metadata from the original dataset (dropping the input comments column). We also see that by default that the tokens have been lowercased, again, this is the default behavior.

Let’s change the drop = argument and the to_lower = argument from their defaults (TRUE).

rmp %>% # dataset
  unnest_tokens(output = "word", # tokenized output column
                input = "comments", # input column to tokenize
                to_lower = FALSE, # do not lowercase
                drop = FALSE) %>%  # do not drop input column
  slice_head(n = 10) # preview first 10 observations

rating_id	student_star	comments	word
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	By
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	far
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	the
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	most
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	condescending
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	mean
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	disrespectful
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	and
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	downright
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	rude

Note that if the textual input has punctuation, the unnest_tokens() function will strip this punctuation when doing the tokenization for words.

Sentences

If we specify that the tokenized unit is sentences, then the punctuation is not stripped.

rmp %>% # dataset
  unnest_tokens(output = "sentence", # tokenized output column
                input = "comments", # input column to tokenize
                token = "sentences", # tokenize to sentences
                to_lower = FALSE, # do not lowercase
                drop = FALSE) %>%  # do not drop input column
  slice_head(n = 10) # preview first 10 observations

rating_id	student_star	comments	sentence
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	By far the most condescending, mean, disrespectful, and downright rude professor ever.
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	Of all the classes I have taken in my college career, this was the absolute worst.
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	Very unclear instructions not to mention ridiculous amounts of outdated readings required.
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	Do yourself a favor and do not take any classes by her.
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	You will regret it.
91	5	Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure	Dr.
91	5	Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure	White is by far the best teacher I've had at UT.
91	5	Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure	If you come to class and pay attention, she will be incredibly helpful, especially in office hours.
91	5	Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure	She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class.
91	5	Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure	Overall, she's helpful, to the point, and knows her stuff.

If we take a close look at the output of using sentence tokens in this case we see that there are multiple sentences in the same observation row. This appears to be due to the fact that students sometimes opted not to capitalize the beginning of the next sentence. This suggests that the algorithm that unnest_tokens() uses sentences punctuation followed by a capitalized word to segment/ tokenize sentences.

It is important to review the output of the tokenization to catch these types of anomalies and not assume that the algorithm will be perfectly accurate.

If the tokenization defaults (words, sentences, etc.) do not produce the desired result, we can specify the token = argument to regex. This allows us to specify a regular expression pattern to do the tokenization in the added argument pattern =.

rmp %>% # dataset
  unnest_tokens(output = "sentence", # tokenized output column
                input = "comments", # input column to tokenize
                token = "regex", # tokenize by a regex pattern
                pattern = "[.!?]\\s",
                to_lower = FALSE, # do not lowercase
                drop = FALSE) %>%  # do not drop input column
  slice_head(n = 10) # preview first 10 observations

rating_id	student_star	comments	sentence
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	By far the most condescending, mean, disrespectful, and downright rude professor ever
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	Of all the classes I have taken in my college career, this was the absolute worst
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	Very unclear instructions not to mention ridiculous amounts of outdated readings required
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	Do yourself a favor and do not take any classes by her
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	You will regret it.
91	5	Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure	Dr
91	5	Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure	White is by far the best teacher I've had at UT
91	5	Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure	If you come to class and pay attention, she will be incredibly helpful, especially in office hours
91	5	Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure	She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class
91	5	Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure	Overall, she's helpful, to the point, and knows her stuff

Note that when a pattern used to segment the text is matched, the match is removed. We can use some regular expression magic with the ‘positive lookbehind’ operator (?<=) to detect a pattern, but not use it as a match. In this case if we apply this to the punctuation part of our original regex, we can preserve the sentence punctuation and still segment the sentences.

rmp %>% # dataset
  unnest_tokens(output = "sentence", # tokenized output column
                input = "comments", # input column to tokenize
                token = "regex", # tokenize by a regex pattern
                pattern = "(?<=[.!?])\\s",
                to_lower = FALSE, # do not lowercase
                drop = FALSE) %>%  # do not drop input column
  slice_head(n = 10) # preview first 10 observations

rating_id	student_star	comments	sentence
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	By far the most condescending, mean, disrespectful, and downright rude professor ever.
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	Of all the classes I have taken in my college career, this was the absolute worst.
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	Very unclear instructions not to mention ridiculous amounts of outdated readings required.
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	Do yourself a favor and do not take any classes by her.
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	You will regret it.
91	5	Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure	Dr.
91	5	Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure	White is by far the best teacher I've had at UT.
91	5	Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure	If you come to class and pay attention, she will be incredibly helpful, especially in office hours.
91	5	Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure	She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class.
91	5	Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure	Overall, she's helpful, to the point, and knows her stuff.

Ngrams

Now let’s turn to ngram tokenization. An ngram is a sequence of words where \(n\) is the sequence desired in the output. Word tokenization is sometimes called a unigram. To get ngrams larger than one word, we use the specify token = to ngrams. Then we need to add the argument n = and set the number of word sequences we want to tokenize. n = 2 would produce bigrams, n = 3 trigrams, and so on.

So let’s see this in action by creating bigrams.

rmp %>% # dataset
  unnest_tokens(output = "bigram", # tokenized output column
                input = "comments", # input column to tokenize
                token = "ngrams", # tokenize ngram sequences
                n = 2, # two word sequences
                to_lower = FALSE, # do not lowercase
                drop = FALSE) %>%  # do not drop input column
  slice_head(n = 10) # preview first 10 observations

rating_id	student_star	comments	bigram
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	By far
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	far the
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	the most
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	most condescending
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	condescending mean
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	mean disrespectful
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	disrespectful and
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	and downright
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	downright rude
89	1	By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it.	rude professor

Great. We now have two word sequences (bigrams) as our tokens. But if we look at the ouput we see that the tokenization of bigrams included sequences that span between sentences (ex. ‘teacher wouldnt’. This is due to the fact that we used the original input (comments) which has all the text. In some cases we may not want to capture these cross-sentential word sequences. To avoid this we can first tokenize our comments by sentences (with the regular expression approach), then pass this result to our bigram tokenization.

rmp %>% # dataset
  # Tokenize by sentences
  unnest_tokens(output = "sentence", # tokenized output column
                input = "comments", # input column to tokenize
                token = "regex", # tokenize by a regex pattern
                pattern = "(?<=[.!?])\\s",
                to_lower = FALSE) %>%  # do not lowercase
  # Add a sentence_id to the dataset
  group_by(rating_id) %>% # group the comments
  mutate(sentence_id = row_number()) %>% # add a sentence id to index the individual sentences for each comment 
  ungroup() %>% # remove grouping attribute
  # Tokenize the sentences by bigrams
  unnest_tokens(output = "bigram", # tokenized output column
                input = "sentence", # input column to tokenize
                token = "ngrams", # tokenize by ngrams
                n = 2, # create bigrams
                to_lower = FALSE) %>%  # do not lowercase
  slice_head(n = 10) # preview first 10 observations

rating_id	student_star	sentence_id	bigram
89	1	1	By far
89	1	1	far the
89	1	1	the most
89	1	1	most condescending
89	1	1	condescending mean
89	1	1	mean disrespectful
89	1	1	disrespectful and
89	1	1	and downright
89	1	1	downright rude
89	1	1	rude professor

So by applying firs the sentence tokenization and then the ngram tokenization we avoid cross-sentential word sequences.

Note that I added a sentence_id column to make sure that the sentence from which the bigram comes is documented in the dataset.

With this overview of the options and strategies for tokenizing textual input, I will now create a word-based tokenization of the rmp dataset, lowercasing the text in preparation for our next strategy to cover, joins.

rmp_words <- 
  rmp %>% # dataset
  unnest_tokens(output = "word", # tokenized output column
                input = "comments") # input column to tokenize

rmp_words %>% 
  slice_head(n = 10) %>% 
  knitr::kable(booktabs = TRUE, 
               caption = "Preview of the `rmp_words` dataset.")

Table 3: Preview of the `rmp_words` dataset.
rating_id	student_star	word
89	1	by
89	1	far
89	1	the
89	1	most
89	1	condescending
89	1	mean
89	1	disrespectful
89	1	and
89	1	downright
89	1	rude

Joining datasets

The dplyr package, loaded as part of the tidyverse, contains a number of functions aimed at joining datasets. These functions are of two main types: mutating joins and filtering joins.

In both cases a join relates two datasets that share a column (or column) which has overlapping values. For mutating joins, the shared column/s is/are the key that connects the two datasets and effectively expands the columns combining the columns from each dataset where the values match across both datasets. For filtering joins, the shared column effectively is used to filter rows in a dataset that have matching values in both datasets. Filter may be used to exclude matching values, or only include those values that match. Let’s look a these two types of joins to get a better sense of their behavior.

Mutating joins

As a demonstration, let’s consider a dataset included in the tidytext package which provides a list of words and a sentiment value for each word.

get_sentiments() %>%
    group_by(sentiment) %>%
    slice_head(n = 5)

word	sentiment
2-faces	negative
abnormal	negative
abolish	negative
abominable	negative
abominably	negative
abound	positive
abounds	positive
abundance	positive
abundant	positive
accessable	positive

We can see that the get_sentiments() function returns a dataset with two columns (word and sentiment). I’ve only provided the first five word-sentiment pairs for ‘negative’ and ‘positive’ sentiments. However, the full dataset contains 6786 words.

We can see how many are listed as positive and negative.

get_sentiments() %>%
    count(sentiment)

sentiment	n
negative	4781
positive	2005

We can see that negative words outnumber the positive-labeled words.

With this information, we can now see that our rmp_words dataset and the dataset from get_sentiments() share a column called word. More importantly, the columns share the same type of values, i.e. words. If we wanted to augment our rmp_words dataset with the sentiment labels from get_sentiments() we will want to use a mutating join. The idea will be to create a data frame with the following structure:

tribble(
  ~rating_id, ~online, ~student_star, ~word, ~sentiment,
  84, 0, 5, "good", "positive",
  84, 0, 5, "teacher", NA,
  2802, 1, 1, "worst", "negative",
  NA, NA, NA, "...", "..."
)

rating_id	online	student_star	word	sentiment
84	0	5	good	positive
84	0	5	teacher	NA
2802	1	1	worst	negative
NA	NA	NA	…	…

In this structure we want all of the observations (words) from rmp_words to appear and those words with matches in get_sentiments() should also get a corresponding sentiment value. To do this we use the left_join() function. This function takes to primary arguments x and y where x is the dataset which we want all of the observations to be included and y where the matching values will also get the corresponding values.

left_join(rmp_words, get_sentiments()) %>%
    slice_head(n = 10)

rating_id	student_star	word	sentiment
89	1	by	NA
89	1	far	NA
89	1	the	NA
89	1	most	NA
89	1	condescending	negative
89	1	mean	NA
89	1	disrespectful	negative
89	1	and	NA
89	1	downright	NA
89	1	rude	negative

Note that left_join() keeps all of the rows from the x dataset –in this case rmp_words. If, for example, we wanted to do a mutating join and remove words from x that do not have a match in y, then we can turn to inner_join().

inner_join(rmp_words, get_sentiments()) %>%
    slice_head(n = 10)

rating_id	student_star	word	sentiment
89	1	condescending	negative
89	1	disrespectful	negative
89	1	rude	negative
89	1	worst	negative
89	1	unclear	negative
89	1	ridiculous	negative
89	1	favor	positive
89	1	regret	negative
91	5	best	positive
91	5	incredibly	positive

inner_join() is in essence a mutating join with a filtering side effect. If we want to simply filter a dataset based on the values in other dataset, we turn to the filtering joins.

Filtering joins

To look at filtering joins let’s consider another dataset also included with the tidytext package get_stopwords().

get_stopwords() %>%
    slice_head(n = 10)

word	lexicon
i	snowball
me	snowball
my	snowball
myself	snowball
we	snowball
our	snowball
ours	snowball
ourselves	snowball
you	snowball
your	snowball

Stopwords are words that are considered to have little semantic content (they roughly correspond to pronouns, prepositions, conjunctions, etc.). In some research cases we will want to remove these words from a dataset. To remove these words we can use the filtering join called anti_join(), which you can imagine will return all the rows in x that do not have a match in y.

anti_join(rmp_words, get_stopwords()) %>%
    slice_head(n = 10)

rating_id	student_star	word
89	1	far
89	1	condescending
89	1	mean
89	1	disrespectful
89	1	downright
89	1	rude
89	1	professor
89	1	ever
89	1	classes
89	1	taken

We see now that the stopwords have been removed from the rmp_words dataset.

Now if we want to the the inverse operation, keeping the stopwords in rmp_words we can use the semi_join() function.

semi_join(rmp_words, get_stopwords()) %>%
    slice_head(n = 10)

rating_id	student_star	word
89	1	by
89	1	the
89	1	most
89	1	and
89	1	of
89	1	all
89	1	the
89	1	i
89	1	have
89	1	in

One last case that is worth including here has to do with a filtering join which takes a character vector, not a data frame. The %in% operator can be used as a semi_join() keeping matching values in x or as an anti_join() removing values in x.

rmp_words %>% 
  filter(word %in% c("very", "teacher")) %>%  # keep matching rows
  slice_head(n = 10)

rating_id	online	student_star	word
89	0	1	very
91	0	5	teacher
2462	1	5	very
2462	1	5	very

rmp_words %>% 
  filter(!word %in% c("very", "teacher")) %>%  # remove matching rows
  slice_head(n = 10)

rating_id	student_star	word
89	1	by
89	1	far
89	1	the
89	1	most
89	1	condescending
89	1	mean
89	1	disrespectful
89	1	and
89	1	downright
89	1	rude

Note that in all filtering joins, no new columns are added, only rows are affected.

Case

Let’s now turn to a practical case and see tokenization and joins in action. I will work with the Love On The Spectrum curated dataset that we have previously worked with.

# Read the curated dataset for Love on the Spectrum Season 1
lots <- read_csv(file = "recipe_8/data/derived/love_on_the_spectrum/lots_curated.csv")

glimpse(lots)

## Rows: 3
## Columns: 4
## $ series   <chr> "Love On The Spectrum", "Love On The Spectrum", "Love On The …
## $ season   <chr> "01", "01", "01"
## $ episode  <chr> "01", "02", "03"
## $ dialogue <chr> "It'll be like a fairy tale. A natural high, I suppose? Gets …

The aim will be to tokenize the dataset by words and then join an imported dataset which contains word frequencies calculated on a corpus of TV/Film transcripts, SUBTLEXus word frequencies. I’ll read in this dataset and clean up the columns so that we only have the relevant columns for our transformational goals.

word_frequencies <- 
  read_tsv(file = "recipe_8/data/original/word_frequency_list/SUBTLEXus.tsv")

word_frequencies <- 
  word_frequencies %>% # dataset
  select(word, word_freq = SUBTLWF) # select columns

word_frequencies %>% 
  slice_head(n = 10)

word	word_freq
the	29449
to	22678
a	20415
you	41857
and	13388
it	18896
s	20731
of	11577
for	6895
I	39971

The result of this transformation aims to produce the following dataset structure:

tribble(
  ~series, ~season, ~episode, ~word, ~word_freq,
  "Love On The Spectrum", "01", "01", "it", 18896
)

series	season	episode	word	word_freq
Love On The Spectrum	01	01	it	18896

Tokenize

The first step is to tokenize the lots curated dataset into word tokens.

# Tokenize dialogue into words
lots_words <- 
  lots %>% # dataset
  unnest_tokens(output = "word", # output column
                input = "dialogue", # input column
                token = "words") # tokenized unit

lots_words %>% 
  slice_head(n = 10)

series	season	episode	word
Love On The Spectrum	01	01	it’ll
Love On The Spectrum	01	01	be
Love On The Spectrum	01	01	like
Love On The Spectrum	01	01	a
Love On The Spectrum	01	01	fairy
Love On The Spectrum	01	01	tale
Love On The Spectrum	01	01	a
Love On The Spectrum	01	01	natural
Love On The Spectrum	01	01	high
Love On The Spectrum	01	01	i

One thing I notice from this preview is that words like “it’ll” are considered one token, not two (i.e. ‘it’ and ‘ll’). Let’s use %in% to filter (i.e. search) the word_frequencies dataset to see if words like “it’ll” are listed.

word_frequencies %>% # dataset
  filter(word %in% c("it'll", "it", "ll")) #search for it'll, it, and ll

word	word_freq
it	18896
ll	4394

It appears that ‘it’ and ‘ll’ are treated as separate words. Therefore we want to make sure that our tokenization of the lots dataset reflects this too. Our original tokenization using the default token = "words" did not do this so let’s create a regular expression to do this.

lots %>% # dataset
  unnest_tokens(output = "word", # output column
                input = "dialogue", # input column
                token = "regex", # regex tokenization
                pattern = "(\\s|')") %>% # regex pattern
  slice_head(n = 10) # preview

series	season	episode	word
Love On The Spectrum	01	01	it
Love On The Spectrum	01	01	ll
Love On The Spectrum	01	01	be
Love On The Spectrum	01	01	like
Love On The Spectrum	01	01	a
Love On The Spectrum	01	01	fairy
Love On The Spectrum	01	01	tale.
Love On The Spectrum	01	01	a
Love On The Spectrum	01	01	natural
Love On The Spectrum	01	01	high,

This works, but there is a side effect –namely that the punctuation has not been stripped. To get rid of the punctuation we can normalize the word column, removing punctuation.

lots %>% # dataset
  unnest_tokens(output = "word", # output column
                input = "dialogue", # input column
                token = "regex", # regex tokenization
                pattern = "(\\s|')") %>% # regex pattern
  mutate(word = str_remove(word, pattern = "[:punct:]")) %>% # remove punctuation
  slice_head(n = 10) # preview

series	season	episode	word
Love On The Spectrum	01	01	it
Love On The Spectrum	01	01	ll
Love On The Spectrum	01	01	be
Love On The Spectrum	01	01	like
Love On The Spectrum	01	01	a
Love On The Spectrum	01	01	fairy
Love On The Spectrum	01	01	tale
Love On The Spectrum	01	01	a
Love On The Spectrum	01	01	natural
Love On The Spectrum	01	01	high

This appears to look good. Let’s now assign this ouput to an object so we can move to join this dataset with the word_frequencies dataset.

lots_words <- 
  lots %>% # dataset
  unnest_tokens(output = "word", # output column
                input = "dialogue", # input column
                token = "regex", # regex tokenization
                pattern = "(\\s|')") %>% # regex pattern
  mutate(word = str_remove(word, pattern = "[:punct:]")) # remove punctuation

Join

Now it is time to join the lots_words and the word_frequencies keeping all the observations in x and adding the word_freq column for words that match in x and y. So we will turn to the function left_join().

left_join(lots_words, word_frequencies) %>%
    slice_head(n = 10)

series	season	episode	word	word_freq
Love On The Spectrum	01	01	it	18896.3
Love On The Spectrum	01	01	ll	4394.1
Love On The Spectrum	01	01	be	5746.8
Love On The Spectrum	01	01	like	3999.0
Love On The Spectrum	01	01	a	20415.3
Love On The Spectrum	01	01	fairy	16.7
Love On The Spectrum	01	01	tale	12.0
Love On The Spectrum	01	01	a	20415.3
Love On The Spectrum	01	01	natural	42.4
Love On The Spectrum	01	01	high	195.0

This looks good so let’s assign this operation to a new object lots_words_freq.

lots_words_freq <- left_join(lots_words, word_frequencies)

Document

The final step in the process is to write the transformed dataset to disk and document it with a data dictionary.

write_csv(lots_words_freq, file = "recipe_8/data/derived/love_on_the_spectrum/lots_words_freq.csv")

Using our data_dic_starter() we can create the data dictionary template that we can then open in a spreadsheet and document.

data_dic_starter(data = lots_words_freq, file_path = "recipe_8/data/derived/love_on_the_spectrum/lots_words_freq_data_dictionary.csv")

References

He, Jibo. 2020. “Big Data Set from RateMyProfessor.com for Professors’ Teaching Evaluation” 2 (March). https://doi.org/10.17632/fvtfjyvw7d.2.