Skip to contents
library(tidyverse)        # data manipulation
library(tidycensus)       # access US Census data
library(sf)               # manipulate spatial objects

Overview

In this guide I will provide an overview of how to coordinate US Census demographic information with Twitter status posts collected from the Twitter API that include geolocation information (i.e. latitude and longitude coordinates).

Access to the US Census

The first step is to acquire a US Census API Key. You can do so by following this link. Once you have your API key, you will now want to add it to RStudio –but you want to add it to the R environment, not in a script in your project directory. To do this we can run the tidycensus::census_api_key() function and use your API key as a character vector as the first argument then set install = TRUE.

tidycensus::census_api_key(key = "<your-api-key-here>", install = TRUE)

The install = TRUE argument will make this API key available across your RStudio sessions.

Query US Census data

We use the tidycensus package (Walker and Herman 2022) to query the US Census. In particular we will look at querying the American Community Survey (ACS) data which is a rolling survey over 5 years (“acs5”). To make it easier to find the codes for the various variables that the survey provides, we can acess the variables table with the load_variables() function.

acs_5 <- 
  load_variables(year = 2019, # select last year of 5-year summary
                 dataset = "acs5", # select the dataset
                 cache = TRUE) # cache the results (to speed subsequent queries)

View(acs_5) # view in RStudio editor pane

Assigning it to an object and then calling View() allows us to peruse the variables and search them with the filter tool in RStudio, as seen in Figure 1.

ACS 5 survey variable codes and descriptions

Figure 1: ACS 5 survey variable codes and descriptions

We use the get_acs() function to pull the demographic information. We can specify the state(s) to query and the type of geography (block, tract, county, state) that we want along with the variable(s) that we want. In addition we can pull the geometry of the geography so we can have the geographic limits of these geographies.

Let’s use the variable B19013_001 which is the median household income in the past 12 months for the counties in Arizona. We will also request the geometry of the counties for graphic visualization.

get_acs() defaults to 2019 ACS 5-year sample.

az_county_sf <- 
  get_acs(state = "AZ", # state
          geography = "county", # census geography
          variables = "B19013_001", # median income
          geometry = TRUE) # get simple features (sf) geography

Let’s take a look at the structure of the dataset that is returned.

glimpse(az_county_sf)  # preview
## Rows: 15
## Columns: 6
## $ GEOID    <chr> "04027", "04001", "04017", "04013", "04012", "04005", "04021"…
## $ NAME     <chr> "Yuma County, Arizona", "Apache County, Arizona", "Navajo Cou…
## $ variable <chr> "B19013_001", "B19013_001", "B19013_001", "B19013_001", "B190…
## $ estimate <dbl> 48790, 33967, 43140, 67799, 34956, 59000, 60968, 46907, 47686…
## $ moe      <dbl> 1981, 2108, 2357, 444, 5708, 1900, 1223, 3169, 1355, 3071, 11…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-115 32.5, ..., MULTIPOLYGON (((…

This dataset has 15 observations corresponding to the 15 counties in Arizona. The columns include the GEOID, NAME (county name in this case), variable (median income), estimate (median income values), moe (margin of error of the sample), and the geometry.

az_county_sf %>%
    slice_head(n = 5)
GEOID NAME variable estimate moe geometry
04027 Yuma County, Arizona B19013_001 48790 1981 MULTIPOLYGON (((-115 32.5, …
04001 Apache County, Arizona B19013_001 33967 2108 MULTIPOLYGON (((-110 37, -1…
04017 Navajo County, Arizona B19013_001 43140 2357 MULTIPOLYGON (((-111 35, -1…
04013 Maricopa County, Arizona B19013_001 67799 444 MULTIPOLYGON (((-113 33.6, …
04012 La Paz County, Arizona B19013_001 34956 5708 MULTIPOLYGON (((-115 33.3, …

Let’s visualize median income in counties in Arizona with ggplot() and the geom_sf() function.

az_county_sf %>% 
  ggplot(aes(fill = estimate)) + # mapping
  geom_sf() + # plot the county geometries
  theme_minimal() + # use a minimal ggplot theme
  labs(title = "Arizona: Median income by county", 
       fill = "Median income") # labels

We can also query multiple states or variables. Let’s look into education which is the B07009 category. The sample total is B07009_001 and the estimate of the sample which have less than a high school education is B07009_002.

az_county_sf <- 
  get_acs(state = "AZ", # state
          geography = "county", # census geography
          variables = c("B07009_001", "B07009_002"), # edu: total, < high school
          geometry = TRUE) # get simple features (sf) geography
glimpse(az_county_sf)  # preview structure
## Rows: 30
## Columns: 6
## $ GEOID    <chr> "04027", "04027", "04001", "04001", "04017", "04017", "04013"…
## $ NAME     <chr> "Yuma County, Arizona", "Yuma County, Arizona", "Apache Count…
## $ variable <chr> "B07009_001", "B07009_002", "B07009_001", "B07009_002", "B070…
## $ estimate <dbl> 134749, 34280, 45607, 8459, 71918, 11620, 2952370, 342904, 16…
## $ moe      <dbl> 297, 1420, 109, 410, 76, 609, NA, 5371, 194, 473, 157, 600, 1…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-115 32.5, ..., MULTIPOLYGON (((…

In this dataset we have the same columns as before, yet we have double the observations. This is due to the fact that each variable is by default added in tidy format as seen below.

az_county_sf %>%
    slice_head(n = 5)  # preview data frame
GEOID NAME variable estimate moe geometry
04027 Yuma County, Arizona B07009_001 134749 297 MULTIPOLYGON (((-115 32.5, …
04027 Yuma County, Arizona B07009_002 34280 1420 MULTIPOLYGON (((-115 32.5, …
04001 Apache County, Arizona B07009_001 45607 109 MULTIPOLYGON (((-110 37, -1…
04001 Apache County, Arizona B07009_002 8459 410 MULTIPOLYGON (((-110 37, -1…
04017 Navajo County, Arizona B07009_001 71918 76 MULTIPOLYGON (((-111 35, -1…

For variables in which we want may want to compare the particular variable to the variable’s total estimate to get a percentage, it is often better to have the data in wide format. We can run a similar query and add the argument output = "wide".

az_county_sf <- 
  get_acs(state = "AZ", # state
          geography = "county", # census geography
          variables = c("B07009_001", "B07009_002"), # edu: total, < high school
          geometry = TRUE, # get simple features (sf) geography
          output = "wide") # get wide output
glimpse(az_county_sf)  # preview structure
## Rows: 15
## Columns: 7
## $ GEOID       <chr> "04027", "04001", "04017", "04013", "04012", "04005", "040…
## $ NAME        <chr> "Yuma County, Arizona", "Apache County, Arizona", "Navajo …
## $ B07009_001E <dbl> 134749, 45607, 71918, 2952370, 16600, 84314, 311415, 39740…
## $ B07009_001M <dbl> 297, 109, 76, NA, 194, 157, 128, 39, 130, 105, 131, 123, 1…
## $ B07009_002E <dbl> 34280, 8459, 11620, 342904, 3245, 7597, 40308, 5063, 22045…
## $ B07009_002M <dbl> 1420, 410, 609, 5371, 473, 600, 1602, 577, 1325, 389, 913,…
## $ geometry    <MULTIPOLYGON [°]> MULTIPOLYGON (((-115 32.5, ..., MULTIPOLYGON …

In this dataset we have the same number of observations, but additional columns. The estimate (E) and the margin of error (M) appear for each variable.

az_county_sf %>%
    slice_head(n = 5)  # preview data frame
GEOID NAME B07009_001E B07009_001M B07009_002E B07009_002M geometry
04027 Yuma County, Arizona 134749 297 34280 1420 MULTIPOLYGON (((-115 32.5, …
04001 Apache County, Arizona 45607 109 8459 410 MULTIPOLYGON (((-110 37, -1…
04017 Navajo County, Arizona 71918 76 11620 609 MULTIPOLYGON (((-111 35, -1…
04013 Maricopa County, Arizona 2952370 NA 342904 5371 MULTIPOLYGON (((-113 33.6, …
04012 La Paz County, Arizona 16600 194 3245 473 MULTIPOLYGON (((-115 33.3, …

Now with the multiple variables for education in wide format we can use the total estimate (B07009_001E) and the less than high school estimate (B07009_002E) to create a percentage of the county’s population with less than a high school education. We use our data manipulation skills to do this. I will use the mutate() to calculate the percentage and add it to our data frame.

az_county_sf <- 
  az_county_sf %>% 
  mutate(perc_less_hs = (B07009_002E / B07009_001E) * 100) # calculate percent of county with less than high school edu

az_county_sf %>% 
  select(!ends_with("M")) %>% # filter out margin of error columns
  slice_head(n = 5) # preview
GEOID NAME B07009_001E B07009_002E perc_less_hs geometry
04027 Yuma County, Arizona 134749 34280 25.4 MULTIPOLYGON (((-115 32.5, …
04001 Apache County, Arizona 45607 8459 18.5 MULTIPOLYGON (((-110 37, -1…
04017 Navajo County, Arizona 71918 11620 16.2 MULTIPOLYGON (((-111 35, -1…
04013 Maricopa County, Arizona 2952370 342904 11.6 MULTIPOLYGON (((-113 33.6, …
04012 La Paz County, Arizona 16600 3245 19.5 MULTIPOLYGON (((-115 33.3, …

And again we can visualize this information with ggplot().

az_county_sf %>% 
  ggplot(aes(fill = perc_less_hs)) + # mapping
  geom_sf() + # plot the county geometries
  theme_minimal() + # use a minimal ggplot theme
  labs(title = "Arizona: Less than high school by county", 
       subtitle = "Percentage",
       fill = "Less than high school") # labels

Twitter

Now let’s look at how to join the information from the US Census with Twitter status posts (tweets) with geolocation information (i.e. latitude and longitude coordinates).

Orientation

As an example, I will read in a dataset collected from the Twitter API via rtweet (Kearney 2020). The collection includes various regionalisms (faucet, spigot, frying pan, skillet, pail, bucket, coke, pop, soda, you guys, yall). We will focus on the terms “y’all” (and “yall”) and “you guys”. I will also remove all tweets with no geolocation coordinate information.

typ_df <- 
  read_csv(file = "guide_2/data/derived/tusr_curated.csv") %>% # read dataset
  mutate(user_id = as.character(user_id), # convert to character
         status_id = as.character(status_id)) %>%  # convert to character
  filter(search_term %in% c("yall", "you guys")) %>% # only keep you plural search terms
  mutate(search_term = factor(search_term)) %>%  # make search term a factor
  filter(lat != "") # remove tweets with no geolocation information

glimpse(typ_df) # preview
## Rows: 3,018
## Columns: 6
## $ user_id     <chr> "29038175", "191434404", "1383299286373265408", "127876171…
## $ status_id   <chr> "1452684447824887808", "1452684402715201536", "14526843251…
## $ search_term <fct> you guys, you guys, you guys, you guys, you guys, you guys…
## $ text        <chr> "@WingfieldNFL @ckparrot Can you guys tell Grier and Co, s…
## $ lat         <dbl> 41.8, 43.6, 45.1, 43.0, 49.9, 49.9, 41.8, 41.8, 41.8, 33.9…
## $ lng         <dbl> -71.4, -79.3, -93.3, -80.6, -97.2, -97.2, -72.8, -72.8, -7…

We have 3018 observations and 6 variables.

Now let’s take a look at the data dictionary for this dataset.

Table 1: Data diciontary for the Twitter US regionalisms dataset.
variable_name name description
user_id User ID Twitter user id
status_id Tweet ID ID for specific tweet
search_term Search Term Term used to collect this tweet
text Tweet Text Tweet text in status post
lat Latitude Latitude coordinate (if geoenabled)
lng Longitude Longitude coordinate (if geoenabled)

Since we have geolocation coordinates, we can map these using ggplot() with a base map from the map_data() function.

states_map <- map_data("state") # get US map of states

p <- ggplot() + # base plot
  geom_polygon(data = states_map, # map data
               aes(x = long, y = lat, group = group), fill = "white", color = "black") + # set background/ lines
  labs(title = "Tweets in the USA", subtitle = "Regional terms") + # labels
  theme_minimal()

p + # add to previous base plot
  geom_point(data = typ_df, # tweet data with lat and lng coordinates and languages
             aes(x = lng, y = lat, group = 1, color = search_term), # map lat and lng and color for language names
             alpha = 1/4, size = 1.5) + # transparency and size of points
  labs(color = "Regional terms") # labels

Trends towards “Y’all” in the South and eastern metropolitan areas. “You guys” is interspersed with “Y’all” to some degree but appears to be more common in the Mid-West and northern states.

The overlap appears to be quite extensive in some states. Let’s look at California in particular and explore how we can associate demographic variables with the tweets emanating from tracts (sub-divisions of counties) in this state.

US Census

Let’s pull the demographic estimates for median income, as an example, for all the tracts in California.

ca_tract_sf <- 
  get_acs(state = "CA", # state
          geography = "tract", # census geography
          variables = c("B19013_001"), # median income
          geometry = TRUE) # get simple features (sf) geography
glimpse(ca_tract_sf)  # preview ACS tracts for CA
## Rows: 9,129
## Columns: 6
## $ GEOID    <chr> "06077005127", "06077003406", "06077004402", "06077005124", "…
## $ NAME     <chr> "Census Tract 51.27, San Joaquin County, California", "Census…
## $ variable <chr> "B19013_001", "B19013_001", "B19013_001", "B19013_001", "B190…
## $ estimate <dbl> 87125, 33750, 65850, 54034, 31098, 55956, 54754, 46696, 88897…
## $ moe      <dbl> 15157, 6659, 8817, 6463, 6802, 14313, 6871, 9628, 13887, 1033…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-121 37.8, ..., MULTIPOLYGON (((…

The dataset has 9129 observations corresponding to the number of tracts in California and 6 columns –the same we saw earlier for the single variable, tidy version of the ACS query.

For fun, let’s plot the census information for tract median income in California.

ca_tract_sf %>% 
  ggplot(aes(fill = estimate)) + # mapping
  geom_sf() + # plot the county geometries
  theme_minimal() + # use a minimal ggplot theme
  labs(title = "California: Median income by tract", 
       fill = "Median income") # labels

Now let’s add the appropriate CRS (coordinate reference system) scheme to the lng and lat variables in the tweet dataset (typ_df). We use the st_as_sf() function to do this. ‘sf’ stands for spatial features object and we map the lng and lat to coords and set the crs to the CRS of the ca_tract_sf object.

typ_sf <- 
  typ_df %>% # original dataset
  st_as_sf(coords = c("lng", "lat"), # map x/long, y/lat values to coords
           crs = st_crs(ca_tract_sf)) # align coordinate reference systems

glimpse(typ_sf) # preview
## Rows: 3,018
## Columns: 5
## $ user_id     <chr> "29038175", "191434404", "1383299286373265408", "127876171…
## $ status_id   <chr> "1452684447824887808", "1452684402715201536", "14526843251…
## $ search_term <fct> you guys, you guys, you guys, you guys, you guys, you guys…
## $ text        <chr> "@WingfieldNFL @ckparrot Can you guys tell Grier and Co, s…
## $ geometry    <POINT [°]> POINT (-71.4 41.8), POINT (-79.3 43.6), POINT (-93.3…

We now see that our data frame has converted the latitude and longitude to a geometry vector. We can now use the st_join() function to associate those geolocation points from our tweet data with the demographic information from the US Census.

typ_census_sf <- st_join(typ_sf, ca_tract_sf)  # join tweets and census by geometry

glimpse(typ_census_sf)  # preview
## Rows: 3,018
## Columns: 10
## $ user_id     <chr> "29038175", "191434404", "1383299286373265408", "127876171…
## $ status_id   <chr> "1452684447824887808", "1452684402715201536", "14526843251…
## $ search_term <fct> you guys, you guys, you guys, you guys, you guys, you guys…
## $ text        <chr> "@WingfieldNFL @ckparrot Can you guys tell Grier and Co, s…
## $ geometry    <POINT [°]> POINT (-71.4 41.8), POINT (-79.3 43.6), POINT (-93.3…
## $ GEOID       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "06037650502", NA, NA,…
## $ NAME        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Census Tract 6505.02,…
## $ variable    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "B19013_001", NA, NA, …
## $ estimate    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 122240, NA, NA, NA, NA…
## $ moe         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 13636, NA, NA, NA, NA,…

We have combined the tweets and census data resulting in 3018 observations. Notice, however, that there are various observations with NA values. These are tweets that do not emanate from within California. So we can eliminate them from the dataset for our purposes here.

typ_census_ca_sf <- typ_census_sf %>%
    filter(GEOID != "")  # keep observations with GEOIDs

glimpse(typ_census_ca_sf)  # preview
## Rows: 344
## Columns: 10
## $ user_id     <chr> "1087968900", "15723895", "38469445", "61715298", "4398000…
## $ status_id   <chr> "1452683638881984512", "1452677571825467392", "14526762309…
## $ search_term <fct> you guys, you guys, you guys, you guys, you guys, you guys…
## $ text        <chr> "@melissawisz @StevieLynnne @delanieewisz @wiszrealestate …
## $ geometry    <POINT [°]> POINT (-118 33.9), POINT (-118 34), POINT (-118 34),…
## $ GEOID       <chr> "06037650502", "06037271803", "06037271803", "06059088801"…
## $ NAME        <chr> "Census Tract 6505.02, Los Angeles County, California", "C…
## $ variable    <chr> "B19013_001", "B19013_001", "B19013_001", "B19013_001", "B…
## $ estimate    <dbl> 122240, 83226, 83226, 62273, 83226, 132950, 83226, 155074,…
## $ moe         <dbl> 13636, 6367, 6367, 10934, 6367, 27282, 6367, 96400, NA, 17…

After filtering out tweets not originating from California we now have 344 observations.

References

Kearney, Michael W. 2020. Rtweet: Collecting Twitter Data. https://CRAN.R-project.org/package=rtweet.
Walker, Kyle, and Matt Herman. 2022. Tidycensus: Load US Census Boundary and Attribute Data as Tidyverse and Sf-Ready Data Frames. https://walker-data.com/tidycensus/.