19  Joins

Today is going to be a few examples riffing on class Group Projects

19.1 Load libraries

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE

19.2 California Place Boundaries

All of the groups are using data from SoCal. We could grab a dataset from SCAG. This data is from 2019 though, so boundaries may not be up-to-date.

However, instead we’re going to learn a different approach and import the data using the tigris package. Hat tip and extra credit to Anna for using this in her assignment.

First, we need to install the new package.

Then, once installed, we load the library.

To enable caching of data, set `options(tigris_use_cache = TRUE)`
in your R script or .Rprofile.

As always, there’s a bunch of cool functions. The one we will use is called places(). It pulls cities and census designated places geospatial information.

In this example, I

#read the data
CA_places <- places(state = 'CA', cb = T, #pull for California, lower detail level
                    year = 2022, 
                    progress_bar = FALSE) |> 
  st_transform(crs = 4326)
  
head(CA_places)
Simple feature collection with 6 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -120.5913 ymin: 33.05765 xmax: -116.9939 ymax: 35.22005
Geodetic CRS:  WGS 84
  STATEFP PLACEFP  PLACENS         AFFGEOID   GEOID         NAME
1      06   55156 02411359 1600000US0655156 0655156     Palmdale
2      06   22804 02410455 1600000US0622804 0622804    Escondido
3      06   02924 02409738 1600000US0602924 0602924        Arvin
4      06   16350 02410232 1600000US0616350 0616350       Corona
5      06   43224 02410875 1600000US0643224 0643224 Los Alamitos
6      06   31414 02410672 1600000US0631414 0631414    Guadalupe
           NAMELSAD STUSPS STATE_NAME LSAD     ALAND AWATER
1     Palmdale city     CA California   25 274705334 629569
2    Escondido city     CA California   25  96726218 276445
3        Arvin city     CA California   25  12482061      0
4       Corona city     CA California   25 103320711  53926
5 Los Alamitos city     CA California   25  10380819 161830
6    Guadalupe city     CA California   25   3394103  10948
                        geometry
1 MULTIPOLYGON (((-118.2877 3...
2 MULTIPOLYGON (((-117.0237 3...
3 MULTIPOLYGON (((-118.8511 3...
4 MULTIPOLYGON (((-117.673 33...
5 MULTIPOLYGON (((-118.093 33...
6 MULTIPOLYGON (((-120.5912 3...

Figure 19.1 shows a map of Cities using leaflet.

leaflet() |> 
  addTiles() |> 
  addPolylines(data = CA_places)
Figure 19.1: California cities and census designated place boundaries

That’s just too much information. For the Fontana and Bloomington projects, we probably want to just show those two communities. Let’s use the %in% operator to filter() those two places from the list.

Figure 19.2 shows a leaflet map of the cities, zoomed into the main valley.

FontanaBloomington <- CA_places |> 
  filter(NAME %in% c('Fontana', 'Bloomington'))

  leaflet() |> 
  addTiles() |> 
  addPolygons(data = FontanaBloomington,
              color = 'black',
              weight = 1,
              fillOpacity = 0.3,
              label = ~NAME)
Figure 19.2: Fontana and Bloomington on a leaflet map

19.3 Spatial Joins

Spatial joins use geospatial information to identify overlapping geometries in space. Thus, if one has two geospatial datasets, one can quickly identify intersections of interest.

19.3.1 Example 1 - Spatial Joins

Let’s pick all the cities and census designated places in Los Angeles County.

Step 1 - pull Los Angeles County spatial boundaries using the county

#read the data
LA_county <- counties(state = 'CA', cb = T, progress_bar = FALSE) |>  #pull for California, lower detail level, year = 2022 ) 
  st_transform(crs = 4326) |> 
  filter(NAME == 'Los Angeles') # Just pick LA County
Retrieving data for the year 2022

Check if it worked by making a basic leaflet map. Figure 19.3

leaflet() |> 
  addTiles() |> 
  addPolygons(data = LA_county,
              color = 'black')
Figure 19.3: Los Angeles County on a leaflet map

Great!

Now let’s spatially filter the Los Angeles places using st_filter() from the sf package.

The syntax is to provide the basic dataset to filter first, then the geospatial polygon to filter on second.

LA_places <- CA_places |> 
  st_filter(LA_county, #
            .predicate = st_covered_by) # This adjective includes fully 'covered' places - not ones that merely intersect partially, 
# which is the default

head(LA_places)
Simple feature collection with 6 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -118.6682 ymin: 33.70515 xmax: -117.9162 ymax: 34.76915
Geodetic CRS:  WGS 84
  STATEFP PLACEFP  PLACENS         AFFGEOID   GEOID            NAME
1      06   55156 02411359 1600000US0655156 0655156        Palmdale
2      06   71806 02411897 1600000US0671806 0671806    Sierra Madre
3      06   19990 02410361 1600000US0619990 0619990          Duarte
4      06   44000 02410877 1600000US0644000 0644000     Los Angeles
5      06   45400 02411020 1600000US0645400 0645400 Manhattan Beach
6      06   40130 02411620 1600000US0640130 0640130       Lancaster
              NAMELSAD STUSPS STATE_NAME LSAD      ALAND   AWATER
1        Palmdale city     CA California   25  274705334   629569
2    Sierra Madre city     CA California   25    7643252    10258
3          Duarte city     CA California   25   17376217        0
4     Los Angeles city     CA California   25 1218634998 80417875
5 Manhattan Beach city     CA California   25   10189493 16389932
6       Lancaster city     CA California   25  244189279   651713
                        geometry
1 MULTIPOLYGON (((-118.2877 3...
2 MULTIPOLYGON (((-118.0682 3...
3 MULTIPOLYGON (((-117.9902 3...
4 MULTIPOLYGON (((-118.6682 3...
5 MULTIPOLYGON (((-118.4228 3...
6 MULTIPOLYGON (((-118.3252 3...

Ok, let’s make a map to see if these are just the correct cities.

Figure 19.4 shows the place and city boundaries in blue, and LA County in red.

leaflet() |> 
  addTiles() |> 
  addPolylines(data = LA_county,
              color = 'red') |> 
  addPolygons(data = LA_places,
              color = 'blue',
              fillOpacity = 0.2,
              label = ~NAME,
              weight = 2)
Figure 19.4: Los Angeles County and its cities and places on a leaflet map.

19.4 Database Joins

Dataset CA_places has the geospatial information on city and census designated place boundaries. Some of the other datasets you are using have the information you would like to compare. How can we merge or join these disparate data layers for visualization and analysis?

There are two types of joins that can help to merge disparate datasets.

The first type of join is based on unique records in both datasets. If you have a column in dataset A and it has some matching records in dataset B, there are ways to join the two datasets.

Here’s all the permutations. - inner_join() - keep only records present in both dataset A and B. - left_join() - keep all records from the dataset A and only matching variables from B. Fill missing with NA. - right_join() - keep all records from the dataset B and only matching variables from A. Fill missing with NA. - full_join() - keep all records from A and B - fill NA in any records where dataset is missing from one of the datasets. - anti_join() - keep only records from A that don’t occur in B.

The example below shows Venn diagrams of all the permutations.

Join Venn Diagrams - credit Tavareshugo github

19.4.1 Example #2 - population data

The California Department of Finance provides city and place information on households and populations. I’d like to map population of cities and places using the latest and greatest data.

First, let me scrape the data using a new package for reading MS Excel spreadsheets - readxl.

We need to install the new package which reads Excel files.

install.packages('openxlsx')

Then, once installed, we load the library.

Scrape the data using the read.xlsx() function from openxlsx

Tbl_E1 <- read.xlsx(xlsxFile = 'https://dof.ca.gov/wp-content/uploads/sites/352/Forecasting/Demographics/Documents/E-1_2024_InternetVersion.xlsx',                   sheet = 'E-1 CityCounty2024', #multiple sheets, we need to specify the sheetname
                    rows = 5:599) |> # this is filtering for rows with decent data
  rename(place = 'State/County/City',
         pop2023 = '1/1/2023',
         pop2024 = '1/1/2024')

head(Tbl_E1)
       place  pop2023  pop2024     Change
1 California 39061058 39128162  0.1717926
2    Alameda  1650656  1641869 -0.5323338
3    Alameda    77237    78071  1.0797934
4     Albany    20354    20325 -0.1424781
5   Berkeley   125181   125327  0.1166311
6     Dublin    72681    72917  0.3247066

Alright, we have population data and place names. Let’s see if we can show city populations using a join function.

LA_places_pop <- LA_places |> 
  left_join(Tbl_E1, by = c('NAME' = 'place')) |> # this tells it what columns to join the two tables by
  filter(!is.na(pop2024)) |> # remove census places with no population data in this table
  filter(NAME != 'Los Angeles') # remove Los Angeles City because it is too big

palPop <- colorNumeric(domain = LA_places_pop$pop2024,
                       palette = 'Spectral', reverse = T)

Time for a map. Figure 19.5 shows population of cities in Los Angeles County other than the city of Los Angeles.

leaflet() |> 
  addProviderTiles(provider = providers$CartoDB.Positron) |> 
  addPolygons(data = LA_places_pop,
              color = ~palPop(pop2024),
              fillOpacity = 0.8,
              label = ~str_c(NAME, ', ', pop2024), # add concatenated string label with city name and population
              weight = 2) |> 
  addLegend(data = LA_places_pop,
            title = '2024 population',
            pal = palPop,
            values = ~pop2024)
Figure 19.5: Los Angeles city populations other than leaflet map.