Brown Corpus — brown • Text as Data Resources

A dataset containing the 1,155,866 tokenized words for 15 genre categories of a sample of American English.

Usage

brown

Format

A data frame with 223,506 rows and 11 variables:

document_id: ID for each corpus document
category: Label code for each of the 15 corpus categories
category_description: Description label for the corpus categories
words: Tokenized words from the corpus
pos: Part of speech label for each word in the corpus

Source

http://www.nltk.org/nltk_data/