SEAI 2022 - R - Lab 3

Intro to R

Vincenzo Nardelli - vincnardelli@gmail.com - https://github.com/vincnardelli

Lab structure

Loading and plotting spatial data

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data = read.csv("data/dataNUTS3/dataNUTS3.csv")
head(data)

##   NUTS_ID year alpine            category pcgdp employ_1564 employ_1564_male
## 1   AT113 1977     No Predominantly rural    NA          NA               NA
## 2   AT112 1977     No Predominantly rural    NA          NA               NA
## 3   AT111 1977     No Predominantly rural    NA          NA               NA
## 4   AT111 1978     No Predominantly rural    NA          NA               NA
## 5   AT112 1978     No Predominantly rural    NA          NA               NA
## 6   AT113 1978     No Predominantly rural    NA          NA               NA
##   employ_1564_female employ_5564 unempl educ_2564 fertility pop pop_dens
## 1                 NA          NA     NA        NA        NA  NA       NA
## 2                 NA          NA     NA        NA        NA  NA       NA
## 3                 NA          NA     NA        NA        NA  NA       NA
## 4                 NA          NA     NA        NA        NA  NA       NA
## 5                 NA          NA     NA        NA        NA  NA       NA
## 6                 NA          NA     NA        NA        NA  NA       NA
##   ppopGE65 ppopLT15 research_exp state
## 1       NA       NA           NA    AT
## 2       NA       NA           NA    AT
## 3       NA       NA           NA    AT
## 4       NA       NA           NA    AT
## 5       NA       NA           NA    AT
## 6       NA       NA           NA    AT

data <- data %>% 
  filter(year == 2012)

head(data)

##   NUTS_ID year alpine            category pcgdp employ_1564 employ_1564_male
## 1   AT111 2012     No Predominantly rural 20600        70.4             75.9
## 2   AT113 2012     No Predominantly rural 22900        70.4             75.9
## 3   AT112 2012     No Predominantly rural 28500        70.4             75.9
## 4   AT127 2012     No Predominantly urban 43200        72.6             77.2
## 5   AT124 2012     No Predominantly rural 25300        72.6             77.2
## 6   AT125 2012     No Predominantly rural 19000        72.6             77.2
##   employ_1564_female employ_5564 unempl educ_2564 fertility    pop pop_dens
## 1               64.8        38.7    4.6      13.9      1.30  37561     54.1
## 2               64.8        38.7    4.6      13.9      1.30  97721     67.4
## 3               64.8        38.7    4.6      13.9      1.30 150500     99.0
## 4               68.1        42.9    4.6      17.2      1.49 320945    223.1
## 5               68.1        42.9    4.6      17.2      1.49 219354     48.1
## 6               68.1        42.9    4.6      17.2      1.49 123374     51.6
##    ppopGE65  ppopLT15 research_exp state
## 1 0.2108836 0.1269934           NA    AT
## 2 0.2007347 0.1280994           NA    AT
## 3 0.1879601 0.1367442           NA    AT
## 4 0.1863808 0.1476172           NA    AT
## 5 0.2069577 0.1352015           NA    AT
## 6 0.1995396 0.1324752           NA    AT

summary(data$category)

##    Length     Class      Mode 
##      1474 character character

data <- data %>% 
  mutate(category = as.factor(category), 
         state = as.factor(state))
summary(data$category)

##        Intermediate Predominantly rural Predominantly urban                NA's 
##                 450                 416                 261                 347

summary(data$state)

##  AT  BE  BG  CH  CY  CZ  DE  DK  EE  EL  ES  FI  FR  HR  HU  IE  IS  IT  LI  LT 
##  35  44  28  26   1  14 396  11   5  52  59  19 101  21  20   8   2 110   1  10 
##  LU  LV  ME  MK  MT  NL  NO  PL  PT  RO  SE  SI  SK  TR  UK 
##   1   6   1   8   2  40  19  72  25  42  21  12   8  81 173

library(ggplot2)

data <- data %>% 
  mutate(pop_dens_log = log(pop_dens), 
         pcgdp_log = log(pcgdp))

ggplot(data) + 
  geom_point(aes(pop_dens_log, pcgdp_log))

## Warning: Removed 236 rows containing missing values (geom_point).

ggplot(data) + 
  geom_point(aes(pop_dens_log, pcgdp_log, color=state), alpha=0.3)

## Warning: Removed 236 rows containing missing values (geom_point).

Areal data

sf package Link

Package sf represents simple features as records in a data.frame or tibble with a geometry list-column and natively in R all 17 simple feature types for all dimensions (XY, XYZ, XYM, XYZM)

Simple features or simple feature access refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects. It also describes how such objects can be stored in and retrieved from databases, and which geometrical operations should be defined for them.

The standard is widely implemented in spatial databases (such as PostGIS), commercial GIS (e.g., ESRI ArcGIS) and forms the vector data basis for libraries such as GDAL. A subset of simple features forms the GeoJSON standard.

R has well-supported classes for storing spatial data (sp) and interfacing to the above mentioned environments (rgdal, rgeos), but has so far lacked a complete implementation of simple features, making conversions at times convoluted, inefficient or incomplete. The package sf tries to fill this gap, and aims at succeeding sp in the long term.

library(sf)

## Linking to GEOS 3.10.2, GDAL 3.4.2, PROJ 8.2.1; sf_use_s2() is TRUE

df_sf <- read_sf("data/dataNUTS3/NUTS_RG_10M_2013.shp")
df_sf

## Simple feature collection with 1951 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -63.08826 ymin: -21.38731 xmax: 55.83663 ymax: 71.17354
## Geodetic CRS:  ETRS89
## # A tibble: 1,951 × 5
##    NUTS_ID STAT_LEVL_ SHAPE_AREA SHAPE_LEN                              geometry
##    <chr>        <int>      <dbl>     <dbl>                    <MULTIPOLYGON [°]>
##  1 AT               0    10.0        23.9  (((15.15733 48.99178, 15.16025 48.94…
##  2 AT1              1     2.85       11.2  (((15.15733 48.99178, 15.16025 48.94…
##  3 AT11             2     0.473       5.82 (((17.09271 48.09965, 17.06741 48.03…
##  4 AT111            3     0.0866      1.23 (((16.64622 47.4466, 16.57575 47.406…
##  5 AT112            3     0.212       2.81 (((17.1608 48.00666, 17.09466 47.970…
##  6 AT113            3     0.175       2.30 (((16.43376 47.35292, 16.48374 47.28…
##  7 AT12             2     2.33        9.92 (((15.15733 48.99178, 15.16025 48.94…
##  8 AT121            3     0.404       3.75 (((15.50768 48.3196, 15.52095 48.284…
##  9 AT122            3     0.401       3.41 (((16.38889 47.8816, 16.35772 47.866…
## 10 AT123            3     0.148       2.37 (((15.99565 48.14571, 15.91586 48.08…
## # … with 1,941 more rows

df_sf <- df_sf %>% 
  left_join(data, by="NUTS_ID") 

df_sf

## Simple feature collection with 1951 features and 23 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -63.08826 ymin: -21.38731 xmax: 55.83663 ymax: 71.17354
## Geodetic CRS:  ETRS89
## # A tibble: 1,951 × 24
##    NUTS_ID STAT_LEVL_ SHAPE_AREA SHAPE_LEN                        geometry  year
##    <chr>        <int>      <dbl>     <dbl>              <MULTIPOLYGON [°]> <int>
##  1 AT               0    10.0        23.9  (((15.15733 48.99178, 15.16025…    NA
##  2 AT1              1     2.85       11.2  (((15.15733 48.99178, 15.16025…    NA
##  3 AT11             2     0.473       5.82 (((17.09271 48.09965, 17.06741…    NA
##  4 AT111            3     0.0866      1.23 (((16.64622 47.4466, 16.57575 …  2012
##  5 AT112            3     0.212       2.81 (((17.1608 48.00666, 17.09466 …  2012
##  6 AT113            3     0.175       2.30 (((16.43376 47.35292, 16.48374…  2012
##  7 AT12             2     2.33        9.92 (((15.15733 48.99178, 15.16025…    NA
##  8 AT121            3     0.404       3.75 (((15.50768 48.3196, 15.52095 …  2012
##  9 AT122            3     0.401       3.41 (((16.38889 47.8816, 16.35772 …  2012
## 10 AT123            3     0.148       2.37 (((15.99565 48.14571, 15.91586…  2012
## # … with 1,941 more rows, and 18 more variables: alpine <chr>, category <fct>,
## #   pcgdp <dbl>, employ_1564 <dbl>, employ_1564_male <dbl>,
## #   employ_1564_female <dbl>, employ_5564 <dbl>, unempl <dbl>, educ_2564 <dbl>,
## #   fertility <dbl>, pop <dbl>, pop_dens <dbl>, ppopGE65 <dbl>, ppopLT15 <dbl>,
## #   research_exp <dbl>, state <fct>, pop_dens_log <dbl>, pcgdp_log <dbl>

ggplot(df_sf) +
  geom_sf(aes(fill=pop_dens))

ggplot(df_sf) +
  geom_sf(aes(fill=pop_dens), lwd=0.1) +
  xlim(-11, 30) +
  ylim(34, 70)

library(tmap)
df_sf <- st_make_valid(df_sf)
tm_shape(df_sf) +
  tm_polygons("pop_dens")

tm_shape(df_sf, bbox=tmaptools::bb(matrix(c(-11,30,34,70),2,2))) +
    tm_polygons(col = "pop_dens", style = "quantile")

Interactive mapping

# map <- tm_shape(df_sf, bbox=tmaptools::bb(matrix(c(-11,30,34,70),2,2))) +
#     tm_polygons(col = "pop_dens", style = "quantile")
# 
# tmap_leaflet(map)

Point data

kc = read_sf("data/kingcounty/kc_house.shp")
kc

## Simple feature collection with 21613 features and 21 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -122.519 ymin: 47.1559 xmax: -121.315 ymax: 47.7776
## Geodetic CRS:  WGS 84
## # A tibble: 21,613 × 22
##           id date   price bedrooms bathrooms sqft_liv sqft_lot floors waterfront
##        <dbl> <chr>  <dbl>    <dbl>     <dbl>    <dbl>    <dbl>  <dbl>      <dbl>
##  1    7.13e9 2014… 2.22e5        3         1     1180     5650      1          0
##  2    6.41e9 2014… 5.38e5        3         2     2570     7242      2          0
##  3    5.63e9 2015… 1.8 e5        2         1      770    10000      1          0
##  4    2.49e9 2014… 6.04e5        4         3     1960     5000      1          0
##  5    1.95e9 2015… 5.1 e5        3         2     1680     8080      1          0
##  6    7.24e9 2014… 1.23e6        4         4     5420   101930      1          0
##  7    1.32e9 2014… 2.58e5        3         2     1715     6819      2          0
##  8    2.01e9 2015… 2.92e5        3         1     1060     9711      1          0
##  9    2.41e9 2015… 2.30e5        3         1     1780     7470      1          0
## 10    3.79e9 2015… 3.23e5        3         2     1890     6560      2          0
## # … with 21,603 more rows, and 13 more variables: view <dbl>, condition <dbl>,
## #   grade <dbl>, sqft_above <dbl>, sqft_basmt <dbl>, yr_built <dbl>,
## #   yr_renov <dbl>, zipcode <dbl>, lat <dbl>, long <dbl>, sqft_liv15 <dbl>,
## #   sqft_lot15 <dbl>, geometry <POINT [°]>

Coordinate Reference Systems

A coordinate reference system (CRS) then defines how the two-dimensional, projected map in your GIS relates to real places on the earth. The decision of which map projection and CRS to use depends on the regional extent of the area you want to work in, on the analysis you want to do, and often on the availability of data.

https://datacarpentry.org/organization-geospatial/03-crs/

st_crs(kc)

## Coordinate Reference System:
##   User input: WGS 84 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4326]]

Let’s try to do the same starting from non-projected non-spatial dataset from a csv file!

kc = read.csv("data/kingcounty/kc_house_data.csv")
head(kc)

##           id            date   price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 20141013T000000  221900        3      1.00        1180     5650
## 2 6414100192 20141209T000000  538000        3      2.25        2570     7242
## 3 5631500400 20150225T000000  180000        2      1.00         770    10000
## 4 2487200875 20141209T000000  604000        4      3.00        1960     5000
## 5 1954400510 20150218T000000  510000        3      2.00        1680     8080
## 6 7237550310 20140512T000000 1225000        4      4.50        5420   101930
##   floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1      1          0    0         3     7       1180             0     1955
## 2      2          0    0         3     7       2170           400     1951
## 3      1          0    0         3     6        770             0     1933
## 4      1          0    0         5     7       1050           910     1965
## 5      1          0    0         3     8       1680             0     1987
## 6      1          0    0         3    11       3890          1530     2001
##   yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1            0   98178 47.5112 -122.257          1340       5650
## 2         1991   98125 47.7210 -122.319          1690       7639
## 3            0   98028 47.7379 -122.233          2720       8062
## 4            0   98136 47.5208 -122.393          1360       5000
## 5            0   98074 47.6168 -122.045          1800       7503
## 6            0   98053 47.6561 -122.005          4760     101930

kc <- kc %>% 
  st_as_sf(coords = c("long", "lat"), crs = 4326)

ggplot(kc) +
  geom_sf(aes(color=price))

tm_shape(kc) +
  tm_bubbles(col="price", scale=0.5, alpha=0.8, style="quantile", 
             border.lwd=0)

# map <- kc %>% 
#   head(1000) %>% 
#   tm_shape() +
#   tm_bubbles(col="price", scale=0.2, alpha=0.8, style="quantile", 
#              border.lwd=0)
# 
# tmap_leaflet(map)

Computation of W matrix

spdep package https://r-spatial.github.io/spdep/

W matrix for regular grid data

library(spdep)

## Loading required package: sp

## Loading required package: spData

## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

n<-4 #the dimension of the grid is n x n
nb_grid <- cell2nb(n, n)  
summary(nb_grid)

## Neighbour list object:
## Number of regions: 16 
## Number of nonzero links: 48 
## Percentage nonzero weights: 18.75 
## Average number of links: 3 
## Link number distribution:
## 
## 2 3 4 
## 4 8 4 
## 4 least connected regions:
## 1:1 4:1 1:4 4:4 with 2 links
## 4 most connected regions:
## 2:2 3:2 2:3 3:3 with 4 links

xyc <- attr(nb_grid, "region.id")
xy <- matrix(as.integer(unlist(strsplit(xyc, ":"))), ncol=2, byrow=TRUE)
plot(nb_grid, xy)

W matrix for irregular data

In the case of irregular data, it is possible to import a GAL (.gal).

#nb <- read.gal("path_to_file")

W matrix from sf

columbus <- read_sf("data/columbus/columbus.shp")

nb<-poly2nb(columbus, queen=T)
nb

## Neighbour list object:
## Number of regions: 49 
## Number of nonzero links: 236 
## Percentage nonzero weights: 9.829238 
## Average number of links: 4.816327

summary(nb)

## Neighbour list object:
## Number of regions: 49 
## Number of nonzero links: 236 
## Percentage nonzero weights: 9.829238 
## Average number of links: 4.816327 
## Link number distribution:
## 
##  2  3  4  5  6  7  8  9 10 
##  5  9 12  5  9  3  4  1  1 
## 5 least connected regions:
## 1 6 42 46 47 with 2 links
## 1 most connected region:
## 20 with 10 links

nb[[1]]

## [1] 2 3

columbus_sp <- as(columbus, 'Spatial')

nb_sf <- as(nb2lines(nb, coords = coordinates(columbus_sp)), 'sf')

## Warning in CRS(proj4string): CRS: projargs should not be NULL; set to NA

nb_sf <- st_set_crs(nb_sf, st_crs(columbus))

ggplot(columbus) + 
  geom_sf(fill = 'gray', color = 'white') +
  geom_sf(data = nb_sf) +
  theme_minimal() +
  ylab("Latitude") +
  xlab("Longitude")

List W object it resume a W matrix, saving space. A listw object has 3 components: 1) a nb object 2)list of n numeric vectors, each of the same length as the corresponding non-zero vectors in the nb object. These give the values of the spatial weights for each i-j neighbour pair. 3)the style of W as a character code: “B” for binary weights taking values zero or one, “W” for row-standardized matrix,“C” is globally standardised

listw <- nb2listw(nb, style="W")
listw

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 49 
## Number of nonzero links: 236 
## Percentage nonzero weights: 9.829238 
## Average number of links: 4.816327 
## 
## Weights style: W 
## Weights constants summary:
##    n   nn S0       S1       S2
## W 49 2401 49 22.75119 203.7091

nb2listw(nb, style="B")

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 49 
## Number of nonzero links: 236 
## Percentage nonzero weights: 9.829238 
## Average number of links: 4.816327 
## 
## Weights style: B 
## Weights constants summary:
##    n   nn  S0  S1   S2
## B 49 2401 236 472 5304

W matrix for point data

boston <- read_sf("data/boston/boston.shp")

coord <- cbind(boston$LON,boston$LAT)
head(coord)

##          [,1]    [,2]
## [1,] -70.9550 42.2550
## [2,] -70.9500 42.2875
## [3,] -70.9360 42.2830
## [4,] -70.9280 42.2930
## [5,] -70.9220 42.2980
## [6,] -70.9165 42.3040

KNN

knn = knearneigh(coord, k=5)
nb = knn2nb(knn)
plot(nb, coord)

listw <- nb2listw(nb)
listw

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 506 
## Number of nonzero links: 2530 
## Percentage nonzero weights: 0.9881423 
## Average number of links: 5 
## Non-symmetric neighbours list
## 
## Weights style: W 
## Weights constants summary:
##     n     nn  S0  S1      S2
## W 506 256036 506 178 2087.92

Distance

distM<- as.matrix(dist(coord))
distM[1:5, 1:5]

##            1          2          3          4          5
## 1 0.00000000 0.03288237 0.03383785 0.04661545 0.05420332
## 2 0.03288237 0.00000000 0.01470544 0.02267708 0.02990401
## 3 0.03383785 0.01470544 0.00000000 0.01280625 0.02051828
## 4 0.04661545 0.02267708 0.01280625 0.00000000 0.00781025
## 5 0.05420332 0.02990401 0.02051828 0.00781025 0.00000000

W1<- 1/(distM)
diag(W1) <- 0


W2<- 1/(1+distM)^2
diag(W2) <- 0 

W3<- exp(-distM^2)
diag(W3) <- 0 

mat2listw(W1)

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 506 
## Number of nonzero links: 255530 
## Percentage nonzero weights: 99.80237 
## Average number of links: 505 
## 
## Weights style: M 
## Weights constants summary:
##     n     nn      S0        S1           S2
## M 506 256036 4006196 414055665 149609290044

Threshold distance

nb <- dnearneigh(coord, 0, 1.5, longlat = T)
nb

## Neighbour list object:
## Number of regions: 506 
## Number of nonzero links: 6648 
## Percentage nonzero weights: 2.59651 
## Average number of links: 13.13834 
## 49 regions with no links:
## 1 41 55 56 65 66 67 68 196 197 198 199 200 201 203 204 205 253 254 255
## 256 257 274 284 285 286 287 288 290 292 300 302 303 304 331 336 342 343
## 344 345 348 349 350 351 352 353 354 355 356

plot(nb, coord)

nb <- dnearneigh(coord, 0, 5, longlat = T)
nb

## Neighbour list object:
## Number of regions: 506 
## Number of nonzero links: 51400 
## Percentage nonzero weights: 20.0753 
## Average number of links: 101.581

listw <- nb2listw(nb)
listw

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 506 
## Number of nonzero links: 51400 
## Percentage nonzero weights: 20.0753 
## Average number of links: 101.581 
## 
## Weights style: W 
## Weights constants summary:
##     n     nn  S0      S1       S2
## W 506 256036 506 29.0168 2042.721

plot(nb, coord)