Skip to contents

A dataset containing the 1,155,866 tokenized words for 15 genre categories of a sample of American English.

Usage

brown

Format

A data frame with 223,506 rows and 11 variables:

document_id

ID for each corpus document

category

Label code for each of the 15 corpus categories

category_description

Description label for the corpus categories

words

Tokenized words from the corpus

pos

Part of speech label for each word in the corpus