A dataset containing the 1,155,866 tokenized words for 15 genre categories of a sample of American English.
Format
A data frame with 223,506 rows and 11 variables:
- document_id
ID for each corpus document
- category
Label code for each of the 15 corpus categories
- category_description
Description label for the corpus categories
- words
Tokenized words from the corpus
- pos
Part of speech label for each word in the corpus