Web Mining bankrate.com

Introduction

The purpose of this blog post is to demonstrate how to web mine the bankrate.com, primarily focusing on extracting and graphing with the R programming language the APY and minimum deposits for 1-year1, 3-year2, and 5-year3 CD Rates.

Setup

Before the analysis, some necessary libraries will be loaded. First, tidyverse4 and magrittr5 for their data management functions; second, flextable6 for table formatting; third, rvest7 for web mining; and fourth, plotly8 to display interactive graphs.

libs <- c('tidyverse', 'magrittr', 'flextable', 
          'rvest', 'plotly')

# If the library has not been installed,
## install and load it.
for (i in libs) {
  
  if (!require(i, character.only = TRUE)) {
    
    install.packages(i)
    library(i, character.only = TRUE)
    
  }
  
}

Web Scraping

Now, we can begin the web scraping by setting up a data frame of URLs from which to scrape.

links <- data.frame(category = c('1-Year CD Rates', 
                                 '3-Year CD Rates', 
                                 '5-Year CD Rates'),
                    url = c('https://www.bankrate.com/banking/cds/best-1-year-cd-rates/',
                            'https://www.bankrate.com/banking/cds/best-3-year-cd-rates/',
                            'https://www.bankrate.com/banking/cds/best-5-year-cd-rates/'),
                    stringsAsFactors = FALSE)

To scrape the websites, I use the rvest functions read_html(), html_node(), and html_table(). The first imports the URL into R; the second searches for a specific node (in this case, a table); and the third converts the information into a data frame. Then I use purrr’s set_names() to rename the key columns for easier referencing, followed by creating a new column based on the category column in the links data frame.

links_df <- with(links, map2_df(url, category, ~ {
  
  read_html(.x) %>%
    html_node('table') %>% 
    html_table() %>%
    .[, 1:3] %>% # Only these three columns are necessary.
    set_names(c('Institution', 'APY', 'Min_Deposit')) %>% 
    mutate(Category = .y)
  
}))

To know what the dataset looks like at this point, we employ the flextable() function from flextable.

links_df %>%
  arrange(Category, Institution) %>%
  flextable() %>%
  set_caption(caption = 'Table 1: Initial Dataset') %>%
  flextable::footnote(part = 'header', 
                      value = as_paragraph('Source: bankrate.com')) %>%
  autofit()
Table 1: Initial Dataset

Institution1

APY1

Min_Deposit1

Category1

Ally Bank

1.50%

$0

1-Year CD Rates

Amerant Bank

1.65%

$10,000

1-Year CD Rates

BMO Harris Bank

1.65%

$5,000

1-Year CD Rates

Capital One

1.50%

$0

1-Year CD Rates

CIT Bank

1.70%

$1,000

1-Year CD Rates

Citizens Access

1.50%

$5,000

1-Year CD Rates

Comenity Direct

1.70%

$1,500

1-Year CD Rates

Discover Bank

1.50%

$2,500

1-Year CD Rates

Limelight Bank*

1.65%

$1,000

1-Year CD Rates

Live Oak Bank

1.90%

$2,500

1-Year CD Rates

Marcus by Goldman Sachs

1.60%

$500

1-Year CD Rates

Synchrony Bank

1.50%

$2,000

1-Year CD Rates

TIAA Bank

1.50%

$5,000

1-Year CD Rates

Ally Bank

1.55% Rate as of 04/16/2020. This rate is updated periodically and may become temporarily out of date.

$0

3-Year CD Rates

American Express National Bank

1.55% Rate as of 04/16/2020. This rate is updated periodically and may become temporarily out of date.

$0

3-Year CD Rates

Comenity Direct

1.75%

$1,500

3-Year CD Rates

Delta Community Credit Union

1.75%

$1,000

3-Year CD Rates

Discover Bank

1.60% Rate as of 04/16/2020. This rate is updated periodically and may become temporarily out of date.

$2,500

3-Year CD Rates

First Internet Bank

1.61% Rate as of 04/16/2020. This rate is updated periodically and may become temporarily out of date.

$1,000

3-Year CD Rates

Marcus by Goldman Sachs

1.60%

$500

3-Year CD Rates

SchoolsFirst Federal Credit Union

1.65%

$20,000 minimum for this APY

3-Year CD Rates

TIAA Bank

1.60% Rate as of 04/16/2020. This rate is updated periodically and may become temporarily out of date.

$5,000

3-Year CD Rates

American Express National Bank

1.70% Rate as of 04/16/2020. This rate is updated periodically and may become temporarily out of date.

$0

5-Year CD Rates

CIT Bank

1.70% Rate as of 04/16/2020. This rate is updated periodically and may become temporarily out of date.

$1,000

5-Year CD Rates

Comenity Direct

1.80%

$1,500

5-Year CD Rates

Delta Community Credit Union

2.00%

$1,000

5-Year CD Rates

First Internet Bank of Indiana

1.71%

$1,000

5-Year CD Rates

Golden 1 Credit Union

1.80%

$1,000

5-Year CD Rates

Marcus by Goldman Sachs

1.65%

$500

5-Year CD Rates

SchoolsFirst Federal Credit Union

1.85%

$20,000

5-Year CD Rates

TIAA Bank

1.70% Rate as of 04/16/2020. This rate is updated periodically and may become temporarily out of date.

$5,000

5-Year CD Rates

1Source: bankrate.com

Data Cleaning

Because the non-numeric symbols in APY and Min_Deposit make those columns character instead of numeric, we need to conduct some data cleaning so that they are numeric for the plotting.

# Specific hard-coding for this particular institution.
links_df$Min_Deposit[links_df$Min_Deposit == '$20,000 minimum for this APY'] <- 20000

# Make APY and Min_Deposit numeric
links_df[, c('APY', 'Min_Deposit')] %<>%
  map_df(~ as.numeric(gsub('%|\\$|,', '', .x)))
## Warning in .f(.x[[i]], ...): NAs introduced by coercion

The following table is our new, cleaned dataset.

links_df %>%
  arrange(Category, Institution) %>%
  flextable() %>%
  set_caption(caption = 'Table 2: Cleaned Dataset') %>%
  flextable::footnote(part = 'header', 
                      value = as_paragraph('Source: bankrate.com')) %>%
  autofit()
Table 2: Cleaned Dataset

Institution1

APY1

Min_Deposit1

Category1

Ally Bank

1.50

0

1-Year CD Rates

Amerant Bank

1.65

10000

1-Year CD Rates

BMO Harris Bank

1.65

5000

1-Year CD Rates

Capital One

1.50

0

1-Year CD Rates

CIT Bank

1.70

1000

1-Year CD Rates

Citizens Access

1.50

5000

1-Year CD Rates

Comenity Direct

1.70

1500

1-Year CD Rates

Discover Bank

1.50

2500

1-Year CD Rates

Limelight Bank*

1.65

1000

1-Year CD Rates

Live Oak Bank

1.90

2500

1-Year CD Rates

Marcus by Goldman Sachs

1.60

500

1-Year CD Rates

Synchrony Bank

1.50

2000

1-Year CD Rates

TIAA Bank

1.50

5000

1-Year CD Rates

Ally Bank

NA

0

3-Year CD Rates

American Express National Bank

NA

0

3-Year CD Rates

Comenity Direct

1.75

1500

3-Year CD Rates

Delta Community Credit Union

1.75

1000

3-Year CD Rates

Discover Bank

NA

2500

3-Year CD Rates

First Internet Bank

NA

1000

3-Year CD Rates

Marcus by Goldman Sachs

1.60

500

3-Year CD Rates

SchoolsFirst Federal Credit Union

1.65

20000

3-Year CD Rates

TIAA Bank

NA

5000

3-Year CD Rates

American Express National Bank

NA

0

5-Year CD Rates

CIT Bank

NA

1000

5-Year CD Rates

Comenity Direct

1.80

1500

5-Year CD Rates

Delta Community Credit Union

2.00

1000

5-Year CD Rates

First Internet Bank of Indiana

1.71

1000

5-Year CD Rates

Golden 1 Credit Union

1.80

1000

5-Year CD Rates

Marcus by Goldman Sachs

1.65

500

5-Year CD Rates

SchoolsFirst Federal Credit Union

1.85

20000

5-Year CD Rates

TIAA Bank

NA

5000

5-Year CD Rates

1Source: bankrate.com

Graphs

Before we plot the results, a function will be written to generalize the desired graphing procedure: APY on the x-axis, minimum deposit on the y-axis, Institutions labeled on the graph, and colors being noted by their CD rates category (1-, 3-, and 5-year). I aim for a more minimalistic type graph with theme_light() and removing specific gridlines with panel.grid.minor and panel.grid.major.x within theme().

apy_v_md <- function(data, title) {
  
  ggplot(data) +
  aes(x = APY, 
      y = Min_Deposit, 
      label = Institution, 
      col = Category) + 
  geom_text(position = position_jitter()) + 
  labs(x = 'APY', 
       y = 'Minimum Deposit', 
       col = '',
       title = title,
       caption = 'Source: bankrate.com') +
  theme_light() + 
  theme(panel.grid.minor   = element_blank(),
        panel.grid.major.x = element_blank())
  
}

We make the graph interactive with ggplotly() from plotly.

g1 <- apy_v_md(links_df, title = 'APY vs. Minimum Deposits')

ggplotly(g1)

The outliers–i.e., those with a minimum deposit of 10,000 or above–seem to be masking the underlying relationship between APY and minimum deposits. Filtering them out of the graph reveals a downward trend, as well as the observation that the 3- and 5-year CD rates tend to have higher APY and lower minimum deposits

g2 <- links_df %>%
  filter(Min_Deposit < 10^4) %>%
  apy_v_md(title = 'APY vs. Minimum Deposits (Outliers Removed)')

ggplotly(g2)

Conclusion

In conclusion, we web mined bankrate.com using the rvest functions read_html(), html_node(), and html_table(). The results were presented in tables via flextable and graphs via ggplot2 and plotly. I have observed that there is a general inverse relationship between APY and minimum deposits, where 3- and 5-year CD rates tend to have higher of the former and lower of the latter.

Feel free to modify and expand the code presented for your own projects!