Mapping mackerels
Notes: Used all of these resources, especially Bob Rudis’s guide. Invaluable stuff.
- https://rud.is/b/2015/05/14/geojson-hexagonal-statebins-in-r/
- https://geocompr.robinlovelace.net/adv-map.html
- https://www.r-graph-gallery.com/328-hexbin-map-of-the-usa.html
- https://source.opennews.org/articles/choosing-right-map-projection/
Ohio for the win!
Ohio leads all other states with 11342 “mackerels”.
## # A tibble: 10 x 3
## state state_abbr mackerels
## <chr> <chr> <dbl>
## 1 Ohio OH 11342
## 2 Alabama AL 8274
## 3 Utah UT 6619
## 4 Mississippi MS 4863
## 5 Hawaii HI 1763
## 6 Kentucky KY 1580
## 7 Wyoming WY 1364
## 8 Tennessee TN 1339
## 9 Alaska AK 1261
## 10 Nevada NV 1229
Code
Note: Lots of code on this one, available on Github
The approach is very similar to a lot of text analytics.
- We start by representing the words and states as vectors, in this case with 26 boolean elements to indicate the presence or absence of each letter. (We don’t care about number of occurrences.
- Then, we can use a simple matrix multiplication to see if any of the letters are shared.
- The choice of a sparse matrix format and using booleans help speed up the calculations a bit.
import pandas as pd
import scipy as sp
from scipy.sparse import csr_matrix
import string
def any_shared_letters(states, words):
"""
For a list of states and words, find whether the state uses any letter
in the word.
1. Construct a matrix from the list of tokens. For each token, construct
a vector of booleans to indicate whether each letter is in the token.
Then stitch together these vectors row-wise into a sparse matrix.
2. Multiply the word matrix by the transpose of the state matrix. The
result will have a row for each word, and a column for each state.
For each cell, we will have an indicator of whether the word shares
any letter with the state.
"""
state_matrix = sp.sparse.csr_matrix([
[
letter in state
for letter in string.ascii_lowercase
]
for state in states
], dtype=bool)
word_matrix = sp.sparse.csr_matrix([
[
letter in word
for letter in string.ascii_lowercase
]
for word in words
], dtype=bool)
prod = word_matrix * state_matrix.T
pdf = pd.DataFrame(
data=prod.A,
index=words,
columns=states)
return pdf
# Run Python routine via reticulate
pdf <- py$any_shared_letters(tolower(state.name), as.vector(words$word))
mackerels <- pdf %>%
rownames_to_column(var = "keyword") %>%
mutate(num_states = rowSums(subset(., select=-keyword))) %>%
filter(num_states == 49) %>%
select(everything(), -num_states) %>%
gather(key = "state", value = "any_shared_letters", -keyword) %>%
filter(!any_shared_letters)
states_df <- tibble(state = state.name)
mackerel_state_summary <- mackerels %>%
mutate(state = str_to_title(state)) %>%
count(state, sort = TRUE) %>%
right_join(states_df, by="state") %>%
mutate(n = replace(n, is.na(n), 0)) %>%
mutate(state_abbr = state.abb[which(state.name == state)])
Less promising viz
Attempt to use statebins
package
It worked fine, but the hexbin approach looked better. Might have been able to improve it with more effort, tho.
## Warning: `show_guide` has been deprecated. Please use `show.legend` instead.
Straight map using Albers projection
Begs the question, “Why use a map at all?”
Top mackerel
Notes:
- Looking for something like Paula Scher’s maps
- Can we use find better words using word frequency? (Gutenberg data?)
- Is mapping this on a physical map a good rendering? Just nouns? Just good nouns?
first_mackerel <- mackerels %>%
mutate(str_len = str_length(keyword)) %>%
group_by(state) %>%
arrange(desc(str_len)) %>%
slice(1)
first_mackerel %>%
arrange(desc(str_len))
## # A tibble: 32 x 4
## # Groups: state [32]
## keyword state any_shared_letters str_len
## <chr> <chr> <lgl> <int>
## 1 counterproductivenesses alabama FALSE 23
## 2 hydrochlorofluorocarbon mississippi FALSE 23
## 3 overscrupulousnesses hawaii FALSE 20
## 4 microelectrophoretic kansas FALSE 20
## 5 transcendentalnesses ohio FALSE 20
## 6 expressionlessnesses utah FALSE 20
## 7 psychophysiologists nevada FALSE 19
## 8 intersubjectivities oklahoma FALSE 19
## 9 spectrophotometers indiana FALSE 18
## 10 biobibliographical tennessee FALSE 18
## # … with 22 more rows