Here is a next-step analysis jumping off of the web scraping and mapping post from a couple of months back. Showing points on a map is great for big picture visualizations, but what if we need to know a little more? Maybe we’re interested in relationships between points. Or specifically, how far are these point from each other? We can calculate euclidean distances between our points, but this doesn’t really translate to the real world. On a map, no 2 distances are ever really equal. For example, it takes about 30 minutes to drive from the Inwood, the most northern neighborhood in Manhattan, to Battery Park in the south. In the same amount of time it takes to cover those 13 miles in NYC, one can cover about double the distance driving from Grand Rapids to the city of Holland in Western Michigan.

With my mapping project, we needed a way to classify points based on their driving distance from a group of key addresses, which were already existing outdoor specialty accounts. We wanted to expand distribution with as little effect on our already established core business partners. I came across the gmapdistance package, developed and maintained by economist Rodrigo Azuero. There were 3 big appeals for me -

  • It returns a list of 3 values: status of the query, driving distance in meters, and driving time in seconds.
  • It can run on 2 vectors of locations, returning a list of matrices covering each combination of inputs.
  • It takes longitude and latitude as inputs, which we’ve already retrieved in the last exercise.

With my vectors of scraped Dillard’s stores and outdoor specialty locations, I set out to find an optimal group for expansion.

Setup

To get started, we need the below libraries.

library(tidyverse)
library(ggmap)
library(gmapsdistance)
library(knitr)

Similar to the the coordinate retriever, we will need to set up our Google Maps API key. You can use this video to guide you through set up. The set.api.key sort of applies our key to the environment, allowing us to call the API without having to specify it each time.

google_api_key <- "your_key"
register_google(key = google_api_key)
set.api.key(google_api_key)

Data Collection

Because this was covered in this post, I will simply load in my addresses. This can be done with any vector of addresses. They don’t even need to be in the formal street-city-state-postal format either! Anything that you’d search for in Google Maps will function.


The gmapdistance Package

The main function in this package is gmapdistance. It needs 3 minimum arguments: origin, destination, and mode. The last of which takes “bicycling”, “walking”, “transit”, or “driving” as a string. Note that we need to use + anywhere we’d have a space.

gmapsdistance(origin = "inwood+manhattan",
              destination = "battery+park",
              mode = "driving")
## $Time
## [1] 1882
## 
## $Distance
## [1] 20896
## 
## $Status
## [1] "OK"

The function returns a named list of length 3:

  • Time - number of seconds to complete the specified trip.
  • Distance - number of meters for the specified trip.
  • Status - returns “OK” if retrieval did not cause an error.

So, it takes about 30 minutes (1803 seconds / 60) to go from the Inwood neighborhood to Battery Park in Manhattan. Let’s check Grand Rapids to Holland, MI!

gmapsdistance(origin = "grand+rapids+mi",
              destination = "holland+mi",
              mode = "driving")
## $Time
## [1] 2336
## 
## $Distance
## [1] 52443
## 
## $Status
## [1] "OK"

Almost the same amount of time driving, yet we cover so much more ground! These sets of points are proving to be similar in terms of proximity - something we wold not have discovered using euclidean distance.

Vector Inputs

The real power of this function comes in its ability to work with vectors of locations.

origins <- c("inwood+manhattan", "grand+rapids+mi", "washington+memorial")
destinations <- c("battery+park+ny", "holland+mi", "white+house+dc")

vector_inputs <- gmapsdistance(origin = origins,
                               destination = destinations,
                               mode = "driving")

vector_inputs$Time %>% kable()
or Time.battery+park+ny Time.holland+mi Time.white+house+dc
inwood+manhattan 1882 42383 14337
grand+rapids+mi 40724 2336 35732
washington+memorial 13608 36996 310

Similarly, the function returns a list of length 3, but each entry is a data frame with dimensions dependent on the input vectors. Looking at the times above, we see the ~30 minute trips from Inwood to Battery Park and Grand Rapids to Holland, while the Washington Memorial is just 5 minutes from the White House.

Processing Considerations

In the above example, we see that 9 trips were retrieved from the API. Each of these takes time and a long vector can make a script cumbersome. See below run time using system.time:

system.time({gmapsdistance(origin = origins,
                           destination = destinations,
                           mode = "driving")})
##    user  system elapsed 
##    0.28    0.00   10.72

Run time hovers around 10 - 12 seconds. From my experience, a retrieval will be timed at about 1 second a piece. It’s important to hone in on your interested list before running any vectors of locations. There is a combinations argument that will compute either all combinations (“all”) or just corresponding entries (“pairwise”). This is useful if you’ve got specific pairs of locations you want to investigate.

Other Arguments for Tuning

The function has additional default arguments that can be modified including traffic_model (“optimistic” vs “pessimistic”), avoid (“tolls”, “highways”), and even arrival or departure times.


Using Longitude and Latitude as Inputs

Because I was storing longitude and latitude features in order to plot the locations on a map, I used another useful feature of the gmapdistance function - it takes coordinates, too! Specifically in “lat+lon” format. Using our sample vectors:

origins %>%
  data_frame(locations = .) %>%
  mutate_geocode(locations) %>%
  mutate(dist_input = paste0(lat, "+", lon)) %>%
  kable()
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
locations lon lat dist_input
inwood+manhattan -73.92120 40.86771 40.8677145+-73.9212019
grand+rapids+mi -85.66809 42.96336 42.9633599+-85.6680863
washington+memorial -77.03528 38.88948 38.8894838+-77.0352791
destinations %>%
  data_frame(locations = .) %>%
  mutate_geocode(locations) %>%
  mutate(dist_input = paste0(lat, "+", lon)) %>%
  kable()
locations lon lat dist_input
battery+park+ny -74.01703 40.70328 40.7032775+-74.0170279
holland+mi -86.10893 42.78752 42.7875235+-86.1089301
white+house+dc -77.03653 38.89768 38.8976763+-77.0365298

The dist_input column is now ready to be used as an input to gmapdistance!

Finding the best Dillards

The above method was used to convert the lists of potential Dillards locations and current Outdoor Specialty accounts into 2 input vectors in the dist_input format. As mentioned above, the vectors of just a few hundred entries would have produced a massive data frame of retrievals. To cut back on run time, I prefiltered the vectors based on state and then ran them through the gmapdistance function. Below sample was used to build a function that applied to a list of states.

# Prefilter both groups to cut on processing
dillards_fl <- dillards_dist %>% filter(state == "FL")
outdoor_fl <- outdoor_dist %>% filter(state == "FL")

# Run function
output_fl <- gmapsdistance(dillards_fl$dist_input, outdoor_fl$dist_input, mode = "driving")

# Add address field to line up coordinate inputs with readable locations
output_fl$Time <- dillards_fl$address

# Store time only
time_matrix_fl <- output_fl$Time

# Function
get_distances <- function(..state, group_1, group_2){
  group_1_filt <- group_1 %>% filter(state == ..state)
  group_2_filt <- group_2 %>% filter(state == ..state)
  output <- gmapsdistance(group_1_filt$dist_input, group_2_filt$dist_input, mode = "driving")  
  output$Time <- group_1_filt$address
  return(output$Time)
}

Using tidyr::gather to Analyze

The dataframe output is useful at a glance, but we are better off using a long table to systematically analyze. Using gather allows us to see the travel time for specific pairs of locations. For better readability, we can add in a column to show the time in minutes using mutate.

vector_inputs$Time %>%
  gather(destination, time, -or) %>%
  mutate(time_minutes = round(time / 60.0, digits = 0)) 
##                    or          destination  time time_minutes
## 1    inwood+manhattan Time.battery+park+ny  1882           31
## 2     grand+rapids+mi Time.battery+park+ny 40724          679
## 3 washington+memorial Time.battery+park+ny 13608          227
## 4    inwood+manhattan      Time.holland+mi 42383          706
## 5     grand+rapids+mi      Time.holland+mi  2336           39
## 6 washington+memorial      Time.holland+mi 36996          617
## 7    inwood+manhattan  Time.white+house+dc 14337          239
## 8     grand+rapids+mi  Time.white+house+dc 35732          596
## 9 washington+memorial  Time.white+house+dc   310            5

Using various time thresholds, I was able to grade each individual Dillard’s location by how many Outdoor Specialty stores were within that many minutes of driving.

Final Map

After agreeing on driving time thresholds, the driving distances between pairs were filtered and counted against each Dillard location. For example, Store in Mall X has 2 outdoor specialty accounts within a 30 minute drive. Is that acceptable?

After internal discussions, we decided on a selection of stores and ran the below map showing mall names and locations to communicate with our partners.

# Plot malls on US map with labels showing mall name.
ggmap(mapped_region) +
  geom_point(aes(x = lon, y = lat), alpha = 0.8, data = final_list_coord) +
  geom_label(aes(x = lon, y = lat, label = Mall), 
             position = position_jitter(width = 0.2, height = 0.2),
             label.padding = unit(0.1, "lines"),
             alpha = 0.8, size = 1, data = final_list_coord) +
  theme(axis.title = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "top",
        legend.title = element_blank()) +
  ggtitle(label = "Dillards Locations for Consideration",
          subtitle = "Based on Driving Distance to Outdoor Specialty Analysis")