Fishy state names

Mapping mackerels

Notes: Used all of these resources, especially Bob Rudis’s guide. Invaluable stuff.

Ohio for the win!

Ohio leads all other states with 11342 “mackerels”.

## # A tibble: 10 x 3
##    state       state_abbr mackerels
##    <chr>       <chr>          <dbl>
##  1 Ohio        OH             11342
##  2 Alabama     AL              8274
##  3 Utah        UT              6619
##  4 Mississippi MS              4863
##  5 Hawaii      HI              1763
##  6 Kentucky    KY              1580
##  7 Wyoming     WY              1364
##  8 Tennessee   TN              1339
##  9 Alaska      AK              1261
## 10 Nevada      NV              1229

Code

Note: Lots of code on this one, available on Github

The approach is very similar to a lot of text analytics.

We start by representing the words and states as vectors, in this case with 26 boolean elements to indicate the presence or absence of each letter. (We don’t care about number of occurrences.
Then, we can use a simple matrix multiplication to see if any of the letters are shared.
The choice of a sparse matrix format and using booleans help speed up the calculations a bit.

import pandas as pd
import scipy as sp
from scipy.sparse import csr_matrix
import string

  
def any_shared_letters(states, words):
    """
    For a list of states and words, find whether the state uses any letter
    in the word.
       
    1.  Construct a matrix from the list of tokens. For each token, construct 
        a vector of booleans to indicate whether each letter is in the token.
        Then stitch together these vectors row-wise into a sparse matrix.
    2.  Multiply the word matrix by the transpose of the state matrix. The
        result will have a row for each word, and a column for each state.
        For each cell, we will have an indicator of whether the word shares
        any letter with the state.
    
    """
    
    state_matrix = sp.sparse.csr_matrix([
        [
            letter in state 
            for letter in string.ascii_lowercase
        ] 
        for state in states
    ], dtype=bool)
    word_matrix = sp.sparse.csr_matrix([
        [
            letter in word
            for letter in string.ascii_lowercase
        ] 
        for word in words
    ], dtype=bool)
    prod = word_matrix * state_matrix.T
    pdf = pd.DataFrame(
      data=prod.A, 
      index=words, 
      columns=states)
    return pdf

# Run Python routine via reticulate
pdf <- py$any_shared_letters(tolower(state.name), as.vector(words$word))

mackerels <- pdf %>%
  rownames_to_column(var = "keyword") %>%
  mutate(num_states = rowSums(subset(., select=-keyword))) %>%
  filter(num_states == 49) %>%
  select(everything(), -num_states) %>%
  gather(key = "state", value = "any_shared_letters", -keyword) %>%
  filter(!any_shared_letters)

states_df <- tibble(state = state.name)
mackerel_state_summary <- mackerels %>% 
  mutate(state = str_to_title(state)) %>%
  count(state, sort = TRUE) %>%
  right_join(states_df, by="state") %>%
  mutate(n = replace(n, is.na(n), 0)) %>%
  mutate(state_abbr = state.abb[which(state.name == state)])

Less promising viz

Attempt to use `statebins` package

It worked fine, but the hexbin approach looked better. Might have been able to improve it with more effort, tho.

## Warning: `show_guide` has been deprecated. Please use `show.legend` instead.

Straight map using Albers projection

Begs the question, “Why use a map at all?”

Top mackerel

Notes:

Looking for something like Paula Scher’s maps
Can we use find better words using word frequency? (Gutenberg data?)
Is mapping this on a physical map a good rendering? Just nouns? Just good nouns?

first_mackerel <- mackerels %>%
  mutate(str_len = str_length(keyword)) %>%
  group_by(state) %>%
  arrange(desc(str_len)) %>%
  slice(1) 

first_mackerel %>%
  arrange(desc(str_len))

## # A tibble: 32 x 4
## # Groups:   state [32]
##    keyword                 state       any_shared_letters str_len
##    <chr>                   <chr>       <lgl>                <int>
##  1 counterproductivenesses alabama     FALSE                   23
##  2 hydrochlorofluorocarbon mississippi FALSE                   23
##  3 overscrupulousnesses    hawaii      FALSE                   20
##  4 microelectrophoretic    kansas      FALSE                   20
##  5 transcendentalnesses    ohio        FALSE                   20
##  6 expressionlessnesses    utah        FALSE                   20
##  7 psychophysiologists     nevada      FALSE                   19
##  8 intersubjectivities     oklahoma    FALSE                   19
##  9 spectrophotometers      indiana     FALSE                   18
## 10 biobibliographical      tennessee   FALSE                   18
## # … with 22 more rows