8. Dataset manipulation: tokenization and joining datasets
Source:vignettes/recipe_8.Rmd
recipe_8.Rmd
Overview
In this Recipe we will look at two primary types of transformations, tokenization and joins. Tokenization is the process of recasting textual units as smaller textual units. The process of joining datasets aims to incorporate other datasets to augment or filter the dataset of interest.
We will first look at a sample dataset to explore the strategies associated with tokenization and joins and then we will put these into practice with a more practical example.
Let’s load the packages that we will use for this Recipe.
Coding strategies
To illustrate the relevant coding strategies I’ve created a curated dataset of the “Big Data Set from RateMyProfessor.com for Professors’ Teaching Evaluation” (He 2020).
Let’s take a look at the curated dataset and get oriented to its structure.
rmp <- read_csv(file = "recipe_8/data/derived/rate_my_professor_sample/rmp_curated.csv") # read curated dataset
glimpse(rmp) # preview structure
## Rows: 10
## Columns: 4
## $ rating_id <dbl> 89, 91, 5770, 3350, 1763, 4918, 2462, 982, 4734, 7903
## $ online <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 1
## $ student_star <dbl> 1, 5, 5, 5, 5, 5, 5, 1, 5, 1
## $ comments <chr> "By far the most condescending, mean, disrespectful, and …
We see that there are 10 observations and four columns.
There is a data dictionary associated with the rmp
curated dataset. Let’s read it and show it in a human-readable format.
read_csv(file = "recipe_8/data/derived/rate_my_professor_sample/rmp_curated_data_dictionary.csv") %>% # read data dictionary
knitr::kable(booktabs = TRUE,
caption = "Rate My Professor curated sample data dictionary.") # show preview table
variable_name | name | description |
---|---|---|
rating_id | Rating ID | Unique ID for each student course rating |
online | Online Course | Was the course online or not: 1 is TRUE and 0 is FALSE |
student_star | Student Rating | Scalar rating provided by the student |
comments | Student Comments | Student comments provided for the course |
Now let’s look a this small curated sample in its current form.
rmp %>% # dataset
knitr::kable(booktabs = TRUE,
caption = "Rate My Professor curated sample preview.") # show dataset preview
rating_id | online | student_star | comments |
---|---|---|---|
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. |
91 | 0 | 5 | Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure |
5770 | 0 | 5 | It's been “awhile” since law school…but, I thought Professor Frey was GREAT…he was my all time favorite law professor! He was always available…and willing to answer questions. He really cares about his students and wants them to succeed. Marty ROCKS!!!!!!!!!!! |
3350 | 0 | 5 | Amazing professor! One of the few professors that has left an everylasting impression. I learned so much in his class. He makes everything interesting! However, the class does require a lot of assignments but you'll do them and not care. He is just that awesome. You have labs where you need to prepare teaches (intimidating) but it's a wonderful exp |
1763 | 0 | 5 | outstanding. she may be tough but i think she helped me figure out why i wanted to teach phys ed.all you have to do is pay attention,speek upin class, pass the test and just show that you care about this major and you will be fine.she is tough but its to help you get ready for the real world |
4918 | 1 | 5 | This was one of the best courses I've ever taken. He puts so much effort into his lectures and assignments, you really learn everything inside and out. You must put in the effort and follow his requirements in the syllabus to do well. Waiting until the last day to do the assignments is not a good idea. Not an easy A, but highly recommend. |
2462 | 1 | 5 | I took PSYCH 337 Comm. & Society w/ Prof. Milburn (online). The material he provides is very interesting and I really enjoyed the class. He is extremely responsive if you ask him anything which is great. The class is very straight-forward and what is expected of you is laid out clearly. The work load is fair, & if you do the work you will do well. |
982 | 1 | 1 | Would not recommend anyone to take his online course. According to Hassan, “As far as I know, this course has never Audio or Vedio lectures because they are not developed by the departmentschool. So, we have to read the text book and then post any questions on any concept or methods here,and I will address it.” TA worked pretty hard. |
4734 | 1 | 5 | Let me start by saying this is not an easy class. You will have tons of homework to do each week. That being said I think almost all Algebra classes have the same work. Mr. G is a good guy and wants people to pass. If you do the homework and study his review you will be fine. May not get an A, but you will pass. I would recommend him |
7903 | 1 | 1 | Never actually explains what he wants. He will fail your assignments and not give clear feedback as to what he actually expects from you. Quiz, 2 journals, and groupwork due each week. Rude and extremely unhelpful. |
From this orientation to the dataset we can see that there are four columns id
, online
, student_star
, and comments
. The first three are metadata associated with the text in comments
. We also see that the online
column contains five positive comments and five negative comments.
Tokenization
The very helpful function unnest_tokens()
from the tidytext package is the most efficient way to recast a column with text into various smaller textual units –all while maintaining the metadata structure from the curated dataset. In this way, our transformation will maintain a tidy data format.
Let’s consider some of the key options for tokenization that are provided through the unnest_tokens()
function. First let’s look at the arguments using the args()
function.
args(unnest_tokens) # view the arguments
## function (tbl, output, input, token = "words", format = c("text",
## "man", "latex", "html", "xml"), to_lower = TRUE, drop = TRUE,
## collapse = NULL, ...)
## NULL
In order of appearance in the function, the tbl
argument takes a data frame, the output
will be a character vector with the desired name of the output column after tokenization, the input
is the character vector which names the column which contains the textual information to be tokenized, the token
argument is where we specify what type of token we would like to generate from the input
column, the format
argument is often left as the default ‘text’ –as we most often that not are working with text, the drop
argument by default with drop the input
column in the tokenized dataset, the to_lower
argument let’s us decide if we want to lowercase the text when it is tokenized, the collapse
argument allows for grouping the tokenization output and is often left to ‘NULL’ (the default), and finally we have a ...
argument which leaves the possibility of adding arguments that are relevant for some of the token options, specifically for ‘ngrams’ and ‘character_shingles’.
Let’s see unnest_tokens()
in action starting first with the most common tokenization unit (and therefore the default) ‘words’.
Words
rmp %>% # dataset
unnest_tokens(output = "word", # tokenized output column
input = "comments") %>% # input column to tokenize
slice_head(n = 10) # preview first 10 observations
rating_id | online | student_star | word |
---|---|---|---|
89 | 0 | 1 | by |
89 | 0 | 1 | far |
89 | 0 | 1 | the |
89 | 0 | 1 | most |
89 | 0 | 1 | condescending |
89 | 0 | 1 | mean |
89 | 0 | 1 | disrespectful |
89 | 0 | 1 | and |
89 | 0 | 1 | downright |
89 | 0 | 1 | rude |
We now see from this preview of the first 10 observations that we have words from the comments tokenized. unnest_tokens()
will return each of these tokens on their own row and maintain the metadata from the original dataset (dropping the input comments
column). We also see that by default that the tokens have been lowercased, again, this is the default behavior.
Let’s change the drop =
argument and the to_lower =
argument from their defaults (TRUE
).
rmp %>% # dataset
unnest_tokens(output = "word", # tokenized output column
input = "comments", # input column to tokenize
to_lower = FALSE, # do not lowercase
drop = FALSE) %>% # do not drop input column
slice_head(n = 10) # preview first 10 observations
rating_id | online | student_star | comments | word |
---|---|---|---|---|
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | By |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | far |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | the |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | most |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | condescending |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | mean |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | disrespectful |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | and |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | downright |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | rude |
Note that if the textual input has punctuation, the unnest_tokens()
function will strip this punctuation when doing the tokenization for words.
Sentences
If we specify that the tokenized unit is sentences
, then the punctuation is not stripped.
rmp %>% # dataset
unnest_tokens(output = "sentence", # tokenized output column
input = "comments", # input column to tokenize
token = "sentences", # tokenize to sentences
to_lower = FALSE, # do not lowercase
drop = FALSE) %>% # do not drop input column
slice_head(n = 10) # preview first 10 observations
rating_id | online | student_star | comments | sentence |
---|---|---|---|---|
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | By far the most condescending, mean, disrespectful, and downright rude professor ever. |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | Of all the classes I have taken in my college career, this was the absolute worst. |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | Very unclear instructions not to mention ridiculous amounts of outdated readings required. |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | Do yourself a favor and do not take any classes by her. |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | You will regret it. |
91 | 0 | 5 | Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure | Dr. |
91 | 0 | 5 | Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure | White is by far the best teacher I've had at UT. |
91 | 0 | 5 | Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure | If you come to class and pay attention, she will be incredibly helpful, especially in office hours. |
91 | 0 | 5 | Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure | She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. |
91 | 0 | 5 | Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure | Overall, she's helpful, to the point, and knows her stuff. |
If we take a close look at the output of using sentence tokens in this case we see that there are multiple sentences in the same observation row. This appears to be due to the fact that students sometimes opted not to capitalize the beginning of the next sentence. This suggests that the algorithm that unnest_tokens()
uses sentences punctuation followed by a capitalized word to segment/ tokenize sentences.
It is important to review the output of the tokenization to catch these types of anomalies and not assume that the algorithm will be perfectly accurate.
If the tokenization defaults (words
, sentences
, etc.) do not produce the desired result, we can specify the token =
argument to regex
. This allows us to specify a regular expression pattern to do the tokenization in the added argument pattern =
.
rmp %>% # dataset
unnest_tokens(output = "sentence", # tokenized output column
input = "comments", # input column to tokenize
token = "regex", # tokenize by a regex pattern
pattern = "[.!?]\\s",
to_lower = FALSE, # do not lowercase
drop = FALSE) %>% # do not drop input column
slice_head(n = 10) # preview first 10 observations
rating_id | online | student_star | comments | sentence |
---|---|---|---|---|
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | By far the most condescending, mean, disrespectful, and downright rude professor ever |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | Of all the classes I have taken in my college career, this was the absolute worst |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | Very unclear instructions not to mention ridiculous amounts of outdated readings required |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | Do yourself a favor and do not take any classes by her |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | You will regret it. |
91 | 0 | 5 | Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure | Dr |
91 | 0 | 5 | Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure | White is by far the best teacher I've had at UT |
91 | 0 | 5 | Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure | If you come to class and pay attention, she will be incredibly helpful, especially in office hours |
91 | 0 | 5 | Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure | She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class |
91 | 0 | 5 | Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure | Overall, she's helpful, to the point, and knows her stuff |
Note that when a pattern used to segment the text is matched, the match is removed. We can use some regular expression magic with the ‘positive lookbehind’ operator (?<=)
to detect a pattern, but not use it as a match. In this case if we apply this to the punctuation part of our original regex, we can preserve the sentence punctuation and still segment the sentences.
rmp %>% # dataset
unnest_tokens(output = "sentence", # tokenized output column
input = "comments", # input column to tokenize
token = "regex", # tokenize by a regex pattern
pattern = "(?<=[.!?])\\s",
to_lower = FALSE, # do not lowercase
drop = FALSE) %>% # do not drop input column
slice_head(n = 10) # preview first 10 observations
rating_id | online | student_star | comments | sentence |
---|---|---|---|---|
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | By far the most condescending, mean, disrespectful, and downright rude professor ever. |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | Of all the classes I have taken in my college career, this was the absolute worst. |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | Very unclear instructions not to mention ridiculous amounts of outdated readings required. |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | Do yourself a favor and do not take any classes by her. |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | You will regret it. |
91 | 0 | 5 | Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure | Dr. |
91 | 0 | 5 | Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure | White is by far the best teacher I've had at UT. |
91 | 0 | 5 | Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure | If you come to class and pay attention, she will be incredibly helpful, especially in office hours. |
91 | 0 | 5 | Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure | She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. |
91 | 0 | 5 | Dr. White is by far the best teacher I've had at UT. If you come to class and pay attention, she will be incredibly helpful, especially in office hours. She gives pop quizzes but also gives at least 3 opportunities to replace a bad quiz grade if you come to class. Overall, she's helpful, to the point, and knows her stuff. Take this class for sure | Overall, she's helpful, to the point, and knows her stuff. |
Ngrams
Now let’s turn to ngram
tokenization. An ngram is a sequence of words where \(n\) is the sequence desired in the output. Word tokenization is sometimes called a unigram. To get ngrams larger than one word, we use the specify token =
to ngrams
. Then we need to add the argument n =
and set the number of word sequences we want to tokenize. n = 2
would produce bigrams, n = 3
trigrams, and so on.
So let’s see this in action by creating bigrams.
rmp %>% # dataset
unnest_tokens(output = "bigram", # tokenized output column
input = "comments", # input column to tokenize
token = "ngrams", # tokenize ngram sequences
n = 2, # two word sequences
to_lower = FALSE, # do not lowercase
drop = FALSE) %>% # do not drop input column
slice_head(n = 10) # preview first 10 observations
rating_id | online | student_star | comments | bigram |
---|---|---|---|---|
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | By far |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | far the |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | the most |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | most condescending |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | condescending mean |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | mean disrespectful |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | disrespectful and |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | and downright |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | downright rude |
89 | 0 | 1 | By far the most condescending, mean, disrespectful, and downright rude professor ever. Of all the classes I have taken in my college career, this was the absolute worst. Very unclear instructions not to mention ridiculous amounts of outdated readings required. Do yourself a favor and do not take any classes by her. You will regret it. | rude professor |
Great. We now have two word sequences (bigrams) as our tokens. But if we look at the ouput we see that the tokenization of bigrams included sequences that span between sentences (ex. ‘teacher wouldnt’. This is due to the fact that we used the original input (comments
) which has all the text. In some cases we may not want to capture these cross-sentential word sequences. To avoid this we can first tokenize our comments
by sentences (with the regular expression approach), then pass this result to our bigram tokenization.
rmp %>% # dataset
# Tokenize by sentences
unnest_tokens(output = "sentence", # tokenized output column
input = "comments", # input column to tokenize
token = "regex", # tokenize by a regex pattern
pattern = "(?<=[.!?])\\s",
to_lower = FALSE) %>% # do not lowercase
# Add a sentence_id to the dataset
group_by(rating_id) %>% # group the comments
mutate(sentence_id = row_number()) %>% # add a sentence id to index the individual sentences for each comment
ungroup() %>% # remove grouping attribute
# Tokenize the sentences by bigrams
unnest_tokens(output = "bigram", # tokenized output column
input = "sentence", # input column to tokenize
token = "ngrams", # tokenize by ngrams
n = 2, # create bigrams
to_lower = FALSE) %>% # do not lowercase
slice_head(n = 10) # preview first 10 observations
rating_id | online | student_star | sentence_id | bigram |
---|---|---|---|---|
89 | 0 | 1 | 1 | By far |
89 | 0 | 1 | 1 | far the |
89 | 0 | 1 | 1 | the most |
89 | 0 | 1 | 1 | most condescending |
89 | 0 | 1 | 1 | condescending mean |
89 | 0 | 1 | 1 | mean disrespectful |
89 | 0 | 1 | 1 | disrespectful and |
89 | 0 | 1 | 1 | and downright |
89 | 0 | 1 | 1 | downright rude |
89 | 0 | 1 | 1 | rude professor |
So by applying firs the sentence tokenization and then the ngram tokenization we avoid cross-sentential word sequences.
Note that I added a sentence_id
column to make sure that the sentence from which the bigram comes is documented in the dataset.
With this overview of the options and strategies for tokenizing textual input, I will now create a word-based tokenization of the rmp
dataset, lowercasing the text in preparation for our next strategy to cover, joins.
rmp_words <-
rmp %>% # dataset
unnest_tokens(output = "word", # tokenized output column
input = "comments") # input column to tokenize
rmp_words %>%
slice_head(n = 10) %>%
knitr::kable(booktabs = TRUE,
caption = "Preview of the `rmp_words` dataset.")
rating_id | online | student_star | word |
---|---|---|---|
89 | 0 | 1 | by |
89 | 0 | 1 | far |
89 | 0 | 1 | the |
89 | 0 | 1 | most |
89 | 0 | 1 | condescending |
89 | 0 | 1 | mean |
89 | 0 | 1 | disrespectful |
89 | 0 | 1 | and |
89 | 0 | 1 | downright |
89 | 0 | 1 | rude |
Joining datasets
The dplyr package, loaded as part of the tidyverse, contains a number of functions aimed at joining datasets. These functions are of two main types: mutating joins and filtering joins.
In both cases a join relates two datasets that share a column (or column) which has overlapping values. For mutating joins, the shared column/s is/are the key that connects the two datasets and effectively expands the columns combining the columns from each dataset where the values match across both datasets. For filtering joins, the shared column effectively is used to filter rows in a dataset that have matching values in both datasets. Filter may be used to exclude matching values, or only include those values that match. Let’s look a these two types of joins to get a better sense of their behavior.
Mutating joins
As a demonstration, let’s consider a dataset included in the tidytext package which provides a list of words and a sentiment value for each word.
get_sentiments() %>%
group_by(sentiment) %>%
slice_head(n = 5)
word | sentiment |
---|---|
2-faces | negative |
abnormal | negative |
abolish | negative |
abominable | negative |
abominably | negative |
abound | positive |
abounds | positive |
abundance | positive |
abundant | positive |
accessable | positive |
We can see that the get_sentiments()
function returns a dataset with two columns (word
and sentiment
). I’ve only provided the first five word-sentiment pairs for ‘negative’ and ‘positive’ sentiments. However, the full dataset contains 6786 words.
We can see how many are listed as positive and negative.
get_sentiments() %>%
count(sentiment)
sentiment | n |
---|---|
negative | 4781 |
positive | 2005 |
We can see that negative words outnumber the positive-labeled words.
With this information, we can now see that our rmp_words
dataset and the dataset from get_sentiments()
share a column called word
. More importantly, the columns share the same type of values, i.e. words. If we wanted to augment our rmp_words
dataset with the sentiment labels from get_sentiments()
we will want to use a mutating join. The idea will be to create a data frame with the following structure:
tribble(
~rating_id, ~online, ~student_star, ~word, ~sentiment,
84, 0, 5, "good", "positive",
84, 0, 5, "teacher", NA,
2802, 1, 1, "worst", "negative",
NA, NA, NA, "...", "..."
)
rating_id | online | student_star | word | sentiment |
---|---|---|---|---|
84 | 0 | 5 | good | positive |
84 | 0 | 5 | teacher | NA |
2802 | 1 | 1 | worst | negative |
NA | NA | NA | … | … |
In this structure we want all of the observations (words) from rmp_words
to appear and those words with matches in get_sentiments()
should also get a corresponding sentiment value. To do this we use the left_join()
function. This function takes to primary arguments x
and y
where x
is the dataset which we want all of the observations to be included and y
where the matching values will also get the corresponding values.
left_join(rmp_words, get_sentiments()) %>%
slice_head(n = 10)
rating_id | online | student_star | word | sentiment |
---|---|---|---|---|
89 | 0 | 1 | by | NA |
89 | 0 | 1 | far | NA |
89 | 0 | 1 | the | NA |
89 | 0 | 1 | most | NA |
89 | 0 | 1 | condescending | negative |
89 | 0 | 1 | mean | NA |
89 | 0 | 1 | disrespectful | negative |
89 | 0 | 1 | and | NA |
89 | 0 | 1 | downright | NA |
89 | 0 | 1 | rude | negative |
Note that left_join()
keeps all of the rows from the x
dataset –in this case rmp_words
. If, for example, we wanted to do a mutating join and remove words from x
that do not have a match in y
, then we can turn to inner_join()
.
inner_join(rmp_words, get_sentiments()) %>%
slice_head(n = 10)
rating_id | online | student_star | word | sentiment |
---|---|---|---|---|
89 | 0 | 1 | condescending | negative |
89 | 0 | 1 | disrespectful | negative |
89 | 0 | 1 | rude | negative |
89 | 0 | 1 | worst | negative |
89 | 0 | 1 | unclear | negative |
89 | 0 | 1 | ridiculous | negative |
89 | 0 | 1 | favor | positive |
89 | 0 | 1 | regret | negative |
91 | 0 | 5 | best | positive |
91 | 0 | 5 | incredibly | positive |
inner_join()
is in essence a mutating join with a filtering side effect. If we want to simply filter a dataset based on the values in other dataset, we turn to the filtering joins.
Filtering joins
To look at filtering joins let’s consider another dataset also included with the tidytext package get_stopwords()
.
get_stopwords() %>%
slice_head(n = 10)
word | lexicon |
---|---|
i | snowball |
me | snowball |
my | snowball |
myself | snowball |
we | snowball |
our | snowball |
ours | snowball |
ourselves | snowball |
you | snowball |
your | snowball |
Stopwords are words that are considered to have little semantic content (they roughly correspond to pronouns, prepositions, conjunctions, etc.). In some research cases we will want to remove these words from a dataset. To remove these words we can use the filtering join called anti_join()
, which you can imagine will return all the rows in x
that do not have a match in y
.
anti_join(rmp_words, get_stopwords()) %>%
slice_head(n = 10)
rating_id | online | student_star | word |
---|---|---|---|
89 | 0 | 1 | far |
89 | 0 | 1 | condescending |
89 | 0 | 1 | mean |
89 | 0 | 1 | disrespectful |
89 | 0 | 1 | downright |
89 | 0 | 1 | rude |
89 | 0 | 1 | professor |
89 | 0 | 1 | ever |
89 | 0 | 1 | classes |
89 | 0 | 1 | taken |
We see now that the stopwords have been removed from the rmp_words
dataset.
Now if we want to the the inverse operation, keeping the stopwords in rmp_words
we can use the semi_join()
function.
semi_join(rmp_words, get_stopwords()) %>%
slice_head(n = 10)
rating_id | online | student_star | word |
---|---|---|---|
89 | 0 | 1 | by |
89 | 0 | 1 | the |
89 | 0 | 1 | most |
89 | 0 | 1 | and |
89 | 0 | 1 | of |
89 | 0 | 1 | all |
89 | 0 | 1 | the |
89 | 0 | 1 | i |
89 | 0 | 1 | have |
89 | 0 | 1 | in |
One last case that is worth including here has to do with a filtering join which takes a character vector, not a data frame. The %in%
operator can be used as a semi_join()
keeping matching values in x
or as an anti_join()
removing values in x
.
rating_id | online | student_star | word |
---|---|---|---|
89 | 0 | 1 | very |
91 | 0 | 5 | teacher |
2462 | 1 | 5 | very |
2462 | 1 | 5 | very |
rating_id | online | student_star | word |
---|---|---|---|
89 | 0 | 1 | by |
89 | 0 | 1 | far |
89 | 0 | 1 | the |
89 | 0 | 1 | most |
89 | 0 | 1 | condescending |
89 | 0 | 1 | mean |
89 | 0 | 1 | disrespectful |
89 | 0 | 1 | and |
89 | 0 | 1 | downright |
89 | 0 | 1 | rude |
Note that in all filtering joins, no new columns are added, only rows are affected.
Case
Let’s now turn to a practical case and see tokenization and joins in action. I will work with the Love On The Spectrum curated dataset that we have previously worked with.
# Read the curated dataset for Love on the Spectrum Season 1
lots <- read_csv(file = "recipe_8/data/derived/love_on_the_spectrum/lots_curated.csv")
glimpse(lots)
## Rows: 3
## Columns: 4
## $ series <chr> "Love On The Spectrum", "Love On The Spectrum", "Love On The …
## $ season <chr> "01", "01", "01"
## $ episode <chr> "01", "02", "03"
## $ dialogue <chr> "It'll be like a fairy tale. A natural high, I suppose? Gets …
The aim will be to tokenize the dataset by words and then join an imported dataset which contains word frequencies calculated on a corpus of TV/Film transcripts, SUBTLEXus word frequencies. I’ll read in this dataset and clean up the columns so that we only have the relevant columns for our transformational goals.
word_frequencies <-
read_tsv(file = "recipe_8/data/original/word_frequency_list/SUBTLEXus.tsv")
word_frequencies <-
word_frequencies %>% # dataset
select(word, word_freq = SUBTLWF) # select columns
word_frequencies %>%
slice_head(n = 10)
word | word_freq |
---|---|
the | 29449 |
to | 22678 |
a | 20415 |
you | 41857 |
and | 13388 |
it | 18896 |
s | 20731 |
of | 11577 |
for | 6895 |
I | 39971 |
The result of this transformation aims to produce the following dataset structure:
tribble(
~series, ~season, ~episode, ~word, ~word_freq,
"Love On The Spectrum", "01", "01", "it", 18896
)
series | season | episode | word | word_freq |
---|---|---|---|---|
Love On The Spectrum | 01 | 01 | it | 18896 |
Tokenize
The first step is to tokenize the lots
curated dataset into word tokens.
# Tokenize dialogue into words
lots_words <-
lots %>% # dataset
unnest_tokens(output = "word", # output column
input = "dialogue", # input column
token = "words") # tokenized unit
lots_words %>%
slice_head(n = 10)
series | season | episode | word |
---|---|---|---|
Love On The Spectrum | 01 | 01 | it’ll |
Love On The Spectrum | 01 | 01 | be |
Love On The Spectrum | 01 | 01 | like |
Love On The Spectrum | 01 | 01 | a |
Love On The Spectrum | 01 | 01 | fairy |
Love On The Spectrum | 01 | 01 | tale |
Love On The Spectrum | 01 | 01 | a |
Love On The Spectrum | 01 | 01 | natural |
Love On The Spectrum | 01 | 01 | high |
Love On The Spectrum | 01 | 01 | i |
One thing I notice from this preview is that words like “it’ll” are considered one token, not two (i.e. ‘it’ and ‘ll’). Let’s use %in%
to filter (i.e. search) the word_frequencies
dataset to see if words like “it’ll” are listed.
word_frequencies %>% # dataset
filter(word %in% c("it'll", "it", "ll")) #search for it'll, it, and ll
word | word_freq |
---|---|
it | 18896 |
ll | 4394 |
It appears that ‘it’ and ‘ll’ are treated as separate words. Therefore we want to make sure that our tokenization of the lots
dataset reflects this too. Our original tokenization using the default token = "words"
did not do this so let’s create a regular expression to do this.
lots %>% # dataset
unnest_tokens(output = "word", # output column
input = "dialogue", # input column
token = "regex", # regex tokenization
pattern = "(\\s|')") %>% # regex pattern
slice_head(n = 10) # preview
series | season | episode | word |
---|---|---|---|
Love On The Spectrum | 01 | 01 | it |
Love On The Spectrum | 01 | 01 | ll |
Love On The Spectrum | 01 | 01 | be |
Love On The Spectrum | 01 | 01 | like |
Love On The Spectrum | 01 | 01 | a |
Love On The Spectrum | 01 | 01 | fairy |
Love On The Spectrum | 01 | 01 | tale. |
Love On The Spectrum | 01 | 01 | a |
Love On The Spectrum | 01 | 01 | natural |
Love On The Spectrum | 01 | 01 | high, |
This works, but there is a side effect –namely that the punctuation has not been stripped. To get rid of the punctuation we can normalize the word
column, removing punctuation.
lots %>% # dataset
unnest_tokens(output = "word", # output column
input = "dialogue", # input column
token = "regex", # regex tokenization
pattern = "(\\s|')") %>% # regex pattern
mutate(word = str_remove(word, pattern = "[:punct:]")) %>% # remove punctuation
slice_head(n = 10) # preview
series | season | episode | word |
---|---|---|---|
Love On The Spectrum | 01 | 01 | it |
Love On The Spectrum | 01 | 01 | ll |
Love On The Spectrum | 01 | 01 | be |
Love On The Spectrum | 01 | 01 | like |
Love On The Spectrum | 01 | 01 | a |
Love On The Spectrum | 01 | 01 | fairy |
Love On The Spectrum | 01 | 01 | tale |
Love On The Spectrum | 01 | 01 | a |
Love On The Spectrum | 01 | 01 | natural |
Love On The Spectrum | 01 | 01 | high |
This appears to look good. Let’s now assign this ouput to an object so we can move to join this dataset with the word_frequencies
dataset.
lots_words <-
lots %>% # dataset
unnest_tokens(output = "word", # output column
input = "dialogue", # input column
token = "regex", # regex tokenization
pattern = "(\\s|')") %>% # regex pattern
mutate(word = str_remove(word, pattern = "[:punct:]")) # remove punctuation
Join
Now it is time to join the lots_words
and the word_frequencies
keeping all the observations in x
and adding the word_freq
column for words that match in x
and y
. So we will turn to the function left_join()
.
left_join(lots_words, word_frequencies) %>%
slice_head(n = 10)
series | season | episode | word | word_freq |
---|---|---|---|---|
Love On The Spectrum | 01 | 01 | it | 18896.3 |
Love On The Spectrum | 01 | 01 | ll | 4394.1 |
Love On The Spectrum | 01 | 01 | be | 5746.8 |
Love On The Spectrum | 01 | 01 | like | 3999.0 |
Love On The Spectrum | 01 | 01 | a | 20415.3 |
Love On The Spectrum | 01 | 01 | fairy | 16.7 |
Love On The Spectrum | 01 | 01 | tale | 12.0 |
Love On The Spectrum | 01 | 01 | a | 20415.3 |
Love On The Spectrum | 01 | 01 | natural | 42.4 |
Love On The Spectrum | 01 | 01 | high | 195.0 |
This looks good so let’s assign this operation to a new object lots_words_freq
.
lots_words_freq <- left_join(lots_words, word_frequencies)
Document
The final step in the process is to write the transformed dataset to disk and document it with a data dictionary.
write_csv(lots_words_freq, file = "recipe_8/data/derived/love_on_the_spectrum/lots_words_freq.csv")
Using our data_dic_starter()
we can create the data dictionary template that we can then open in a spreadsheet and document.
data_dic_starter(data = lots_words_freq, file_path = "recipe_8/data/derived/love_on_the_spectrum/lots_words_freq_data_dictionary.csv")