12  Importing Data - Part 1

Today we will focus on the practice of importing data - better than last time.

Our framework for the workflow of data visualization is shown in Figure 25.1

Figure 12.1: Tidyverse framework again

Acquiring and importing data is the most complicated part of this course and data visualization in general. This Unit is done now, rather than at the beginning, because of its difficulty and pain - while providing little immediate satisfaction of a cool map or graphic. In my experience, data import and manipulation is 80+% of the work when creating visualizations; it needs to be covered at least nominally in any course on data visualization.

12.1 Load and Install Packages

As always, we should load the packages we need to import the data. There are many specialized data import packages, but tidyverse and sf are a good start and can handle many standard tables and geospatial data files. Remember, you can check to make sure a package is loaded in your R session by checking on the files, plots, and packages panel, clicking on the Packages tab, and scrolling down to tidyverse and sf to make sure they are checked.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE

12.2 Option 1. Point and Click Download, File Save, Read

The basic way to acquire data is the Point and Click method. This is a step-by-step instruction for doing that.

12.2.1 Find and Download Data

Go to CalEnviroScreen

Download the Zipped Shapefile shown in the screenshot in Figure 12.2

Figure 12.2: CalEnviroScreen Shapefile Location

By default, downloads are often placed in a Downloads directory, although you may have changed that on your local machine.

You can skip the next step if you directly save the zip file to your working directory.

12.2.2 Move the Zipped Shapefile to the R Working Directory

By default, downloads are often placed in a Downloads directory, although you may have changed that on your local machine. If this occurred in your download, the zipped needs to be either (a) moved to the R working directory or (b) identify the filepath of the default download directory and work with it from there.

For today, I will only show path (a) because it is good data science practice to keep the data in a directory associated with the visualization.

  1. Identify the directory where the zipped shapefile was downloaded. On my machine, this is a Downloads folder which can be accessed through my web browser after the file download is complete. The name of the file is calenviroscreen40shpf2021shp.zip.

  2. Identify the R working directory on your machine using the getwd() function.

[1] "C:/Dev/EA078_Fall2024"
wd <- getwd()
  1. Move calenviroscreen40shpf2021shp.zip from the default download directory to the R working directory. Either drag it, copy and paste it, or cut and paste it. For Macs - use the Finder tool. Here’s a youTube video on how to move a file - start at 0:27 seconds.

For PCs, use File Explorer.

  1. Check your Files, Plots, and Packages panel to see the zipped file is identified by RStudio. See the example in Figure 12.3.
Figure 12.3: Files, Plots, and Packages Panel

If you see the calenviroscreen40shpf2021shp.zip in the directory on your machine, congratulations! You are a winner!

12.2.3 Unzip the data - Two Ways

Although the data is in the right place, it is not directly readable while zipped.

12.2.3.1 Point and Click Unzip

I think the process is basically the same for Mac and PC, but we will identify this in class.

  • On a Mac, Double-click the .zip file. The unzipped item appears in the same folder as the .zip file.

  • On a PC, right-clicking on a zipped file will bring up a menu that includes an Extract All option. Choosing the Extract All option brings up a pathname to extract the file to. The default is to extract the zip file to a subfolder named after the zip file.

Again, go to the Files, Plots, and Packages panel and check if there is a folder called calenviroscreen40shpf2021shp as shown in Figure 12.4.

Figure 12.4: Shapefile folder

The sf library is used to import geospatial data. As before, st_read() is function used to import geospatial files.

Shapefiles are the esri propietary geospatial format and are very common.

The CalEnviroScreen data are in the shapefile format, which is a bunch of individual files organized in a folder directory. In the calenviroscreen40shpf2021shp directory, there are 8 individual files with 8 different file extensions. We can ignore that and just point read_sf() at the directory and it will do the rest. The dsn = argument stands for data source name which can be a directory, file, or a database.

wd <- getwd()
directory <- 'calenviroscreen40shpf2021shp'
CalEJ <- sf::st_read(dsn = directory) |> 
  sf::st_transform(crs = 4326)
Reading layer `CES4 Final Shapefile' from data source 
  `C:\Dev\EA078_Fall2024\calenviroscreen40shpf2021shp' using driver `ESRI Shapefile'
Simple feature collection with 8035 features and 66 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -373976.1 ymin: -604512.6 xmax: 539719.6 ymax: 450022.5
Projected CRS: NAD83 / California Albers

Check the Environment panel after running this line of code. Is there a CalEJ file with 8035 observations of 67 variables present?

If so, success is yours! Let’s make a map of Pesticide census tract percentiles to celebrate with Figure 12.5!

12.2.4 Visualize the data

The whole California map is too big, so I am just going to show the southernmost counties here by using the filter function for a small subset of counties. We’ll also remove the tracts with no pesticide information (-999).

CalEJ2 <- CalEJ  |>  
  filter(County %in% c('Imperial', 'Riverside', 'San Diego', 'Orange')) |> 
  filter(PesticideP >=0) |> 
  st_transform(crs = 4326)
palPest <- colorNumeric(palette = 'Greys', domain = CalEJ2$PesticideP)

leaflet(data = CalEJ2) |> 
    addTiles() |> 
    addPolygons(color = ~palPest(PesticideP),
                fillOpacity = 0.5,
                weight = 2,
                label = ~ApproxLoc) |> 
    addLegend(pal = palPest,
              title = 'Pesticide (%)', 
              values = ~PesticideP)
Figure 12.5: Pesticide percentile census tracts in California

12.2.5 Option 2 - Directly Read the Dataset

Methane monthly average concentrations sampled by flasks

Today I am selecting Mauna Loa (MLO) in Hawaii. Methane’s chemical formula is CH4. Therefore, I will assign the path of URL.MLO.CH4

URL.MLO.CH4 <- file.path( 'https://gml.noaa.gov/aftp/data/trace_gases/ch4/flask/surface/txt/ch4_mlo_surface-flask_1_ccgg_month.txt')

We did this before for Alert, let’s try the successful code using the read_table() function. Note, that when I follow the link, the first line of the dataset says there are 71 header lines.

MLO.CH4 <- read_table(URL.MLO.CH4, skip = 71)

── Column specification ────────────────────────────────────────────────────────
cols(
  MLO = col_character(),
  `1984` = col_double(),
  `10` = col_double(),
  `1673.79` = col_double()
)
head(MLO.CH4)
# A tibble: 6 × 4
  MLO   `1984`  `10` `1673.79`
  <chr>  <dbl> <dbl>     <dbl>
1 MLO     1984    11     1676.
2 MLO     1984    12     1671.
3 MLO     1985     1     1662.
4 MLO     1985     2     1665.
5 MLO     1985     3     1677.
6 MLO     1985     4     1674.
headers <- c('site', 'year', 'month', 'value')
colnames(MLO.CH4) <- headers

head(MLO.CH4)
# A tibble: 6 × 4
  site   year month value
  <chr> <dbl> <dbl> <dbl>
1 MLO    1984    11 1676.
2 MLO    1984    12 1671.
3 MLO    1985     1 1662.
4 MLO    1985     2 1665.
5 MLO    1985     3 1677.
6 MLO    1985     4 1674.

This is better.

We can now visualize the data in Figure 12.6.

MLO.CH4 |> 
  mutate(decimal.Date = (year + month/12)) |> 
  ggplot(aes(x = decimal.Date, y = value)) +
  geom_point() +
  geom_line(alpha = 0.6) +
  geom_smooth() +
  theme_bw() +
  labs(x = 'Year', y = 'Methane concentration (ppb)',
       title = 'Mauna Loa - methane trend')
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Figure 12.6: Trend in Methane concentrations (ppb) at Mauna Loa, Hawaii

12.2.6 Advanced data visualization

Now that we have Mauna Loa, I want to add the Alert dataset to it using the code we developed last week. This code downloads the Alert dataset and renames its headers.

URL.ALT.CH4 <- file.path( 'https://gml.noaa.gov/aftp/data/trace_gases/ch4/flask/surface/txt/ch4_alt_surface-flask_1_ccgg_month.txt')
ALT.CH4 <- read_table(URL.ALT.CH4, skip = 70)

── Column specification ────────────────────────────────────────────────────────
cols(
  ALT = col_character(),
  `1986` = col_double(),
  `10` = col_double(),
  `1774.12` = col_double()
)
colnames(ALT.CH4) <- headers

Now we can put the datasets together to make a combined visualization. The bind_rows() function from tidyverse let’s us put the datasets take together since they have the same headers. Then we can use the color argument to aes() to get two separate time series as shown in Figure 12.7. I also grouped the data by the shape of the symbol to ensure that the two datasets are distinguishable.

CH4 <- bind_rows(ALT.CH4, MLO.CH4)

CH4 |> 
  mutate(decimal.Date = (year + month/12)) |> 
  ggplot(aes(x = decimal.Date, y = value, color = site, shape = site)) +
  geom_point() +
  geom_line(alpha = 0.6) +
  #geom_smooth(se = FALSE) +
  theme_bw() +
  labs(x = 'Year', y = 'Methane concentration (ppb)',
       title = 'Methane trend')
Figure 12.7: Trend in Methane concentrations (ppb) at Mauna Loa, Hawaii and Alert, Canada

12.2.7 Downloading secured zip files

I have not yet found a reliable method to get this to work every time on Macs and PCs. Stay tuned.

12.3 Exercise 1.

  1. Go to the Environmental Justice Index Accessibility Tool.
  2. Pick a state from the dropdown menu.
  3. Press the Apply button
  4. An Actions button should appear; Figure 12.8 shows where that is. Press the Actions button, select Export All and choose Export to geoJSON.

Figure 12.8

Figure 12.8: ActionButton
  1. A file named Environmental Justice Index 2022 result.geojson should appear in your default download folder.
  2. Move the Environmental Justice Index 2022 result.geojson file to the working directory.
  3. Check the Files panel. Is Environmental Justice Index 2022 result.geojson there?
  4. Read in the file using read_sf(). The dsn argument can point directly to the file name for this type of file. Assign it a name that incorporates EJI and the state abbreviation.
  5. Check the Environment panel. Did it import?
  6. Make a visualization - but not a map because projections are wonky?
[1] "C:/Dev/EA078_Fall2024"
CO_EJI_raw <- sf::st_read(dsn = 'Environmental Justice Index 2022 result.geojson')  |> 
  sf::st_transform(crs = 4326) |> 
  mutate(DSLPM = as.numeric(EPL_DSLPM)) # for some reason all the values are importing as character values
Reading layer `Environmental Justice Index 2022 result' from data source 
  `C:\Dev\EA078_Fall2024\Environmental Justice Index 2022 result.geojson' 
  using driver `GeoJSON'
Simple feature collection with 100 features and 119 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -106.0394 ymin: 37.35623 xmax: -103.7057 ymax: 40.00146
Geodetic CRS:  WGS 84
  # this code converts one row to numeric

Figure 12.9 shows Diesel PM from their environmental indicators layer for San Bernardino County.

ggplot(data = CO_EJI_raw) +
  geom_sf(aes(fill = DSLPM), linewidth = 0) +
  theme_bw()
Figure 12.9: Diesel PM indicator in California census tracts from the EJI tool