Introduction
The purpose of this blog post is to demonstrate how to web mine the bankrate.com, primarily focusing on extracting and graphing with the R programming language the APY and minimum deposits for 1-year1, 3-year2, and 5-year3 CD Rates.
Setup
Before the analysis, some necessary libraries will be loaded. First, tidyverse
4 and magrittr
5 for their data management functions; second, flextable
6 for table formatting; third, rvest
7 for web mining; and fourth, plotly
8 to display interactive graphs.
libs <- c('tidyverse', 'magrittr', 'flextable',
'rvest', 'plotly')
# If the library has not been installed,
## install and load it.
for (i in libs) {
if (!require(i, character.only = TRUE)) {
install.packages(i)
library(i, character.only = TRUE)
}
}
Web Scraping
Now, we can begin the web scraping by setting up a data frame
of URLs from which to scrape.
links <- data.frame(category = c('1-Year CD Rates',
'3-Year CD Rates',
'5-Year CD Rates'),
url = c('https://www.bankrate.com/banking/cds/best-1-year-cd-rates/',
'https://www.bankrate.com/banking/cds/best-3-year-cd-rates/',
'https://www.bankrate.com/banking/cds/best-5-year-cd-rates/'),
stringsAsFactors = FALSE)
To scrape the websites, I use the rvest functions read_html()
, html_node()
, and html_table()
. The first imports the URL into R; the second searches for a specific node (in this case, a table); and the third converts the information into a data frame. Then I use purrr
’s set_names()
to rename the key columns for easier referencing, followed by creating a new column based on the category
column in the links
data frame.
links_df <- with(links, map2_df(url, category, ~ {
read_html(.x) %>%
html_node('table') %>%
html_table() %>%
.[, 1:3] %>% # Only these three columns are necessary.
set_names(c('Institution', 'APY', 'Min_Deposit')) %>%
mutate(Category = .y)
}))
To know what the dataset looks like at this point, we employ the flextable()
function from flextable
.
links_df %>%
arrange(Category, Institution) %>%
flextable() %>%
set_caption(caption = 'Table 1: Initial Dataset') %>%
flextable::footnote(part = 'header',
value = as_paragraph('Source: bankrate.com')) %>%
autofit()
Institution1 | APY1 | Min_Deposit1 | Category1 |
Ally Bank | 1.50% | $0 | 1-Year CD Rates |
Amerant Bank | 1.65% | $10,000 | 1-Year CD Rates |
BMO Harris Bank | 1.65% | $5,000 | 1-Year CD Rates |
Capital One | 1.50% | $0 | 1-Year CD Rates |
CIT Bank | 1.70% | $1,000 | 1-Year CD Rates |
Citizens Access | 1.50% | $5,000 | 1-Year CD Rates |
Comenity Direct | 1.70% | $1,500 | 1-Year CD Rates |
Discover Bank | 1.50% | $2,500 | 1-Year CD Rates |
Limelight Bank* | 1.65% | $1,000 | 1-Year CD Rates |
Live Oak Bank | 1.90% | $2,500 | 1-Year CD Rates |
Marcus by Goldman Sachs | 1.60% | $500 | 1-Year CD Rates |
Synchrony Bank | 1.50% | $2,000 | 1-Year CD Rates |
TIAA Bank | 1.50% | $5,000 | 1-Year CD Rates |
Ally Bank | 1.55% Rate as of 04/16/2020. This rate is updated periodically and may become temporarily out of date. | $0 | 3-Year CD Rates |
American Express National Bank | 1.55% Rate as of 04/16/2020. This rate is updated periodically and may become temporarily out of date. | $0 | 3-Year CD Rates |
Comenity Direct | 1.75% | $1,500 | 3-Year CD Rates |
Delta Community Credit Union | 1.75% | $1,000 | 3-Year CD Rates |
Discover Bank | 1.60% Rate as of 04/16/2020. This rate is updated periodically and may become temporarily out of date. | $2,500 | 3-Year CD Rates |
First Internet Bank | 1.61% Rate as of 04/16/2020. This rate is updated periodically and may become temporarily out of date. | $1,000 | 3-Year CD Rates |
Marcus by Goldman Sachs | 1.60% | $500 | 3-Year CD Rates |
SchoolsFirst Federal Credit Union | 1.65% | $20,000 minimum for this APY | 3-Year CD Rates |
TIAA Bank | 1.60% Rate as of 04/16/2020. This rate is updated periodically and may become temporarily out of date. | $5,000 | 3-Year CD Rates |
American Express National Bank | 1.70% Rate as of 04/16/2020. This rate is updated periodically and may become temporarily out of date. | $0 | 5-Year CD Rates |
CIT Bank | 1.70% Rate as of 04/16/2020. This rate is updated periodically and may become temporarily out of date. | $1,000 | 5-Year CD Rates |
Comenity Direct | 1.80% | $1,500 | 5-Year CD Rates |
Delta Community Credit Union | 2.00% | $1,000 | 5-Year CD Rates |
First Internet Bank of Indiana | 1.71% | $1,000 | 5-Year CD Rates |
Golden 1 Credit Union | 1.80% | $1,000 | 5-Year CD Rates |
Marcus by Goldman Sachs | 1.65% | $500 | 5-Year CD Rates |
SchoolsFirst Federal Credit Union | 1.85% | $20,000 | 5-Year CD Rates |
TIAA Bank | 1.70% Rate as of 04/16/2020. This rate is updated periodically and may become temporarily out of date. | $5,000 | 5-Year CD Rates |
1Source: bankrate.com |
Data Cleaning
Because the non-numeric symbols in APY and Min_Deposit
make those columns character instead of numeric, we need to conduct some data cleaning so that they are numeric for the plotting.
# Specific hard-coding for this particular institution.
links_df$Min_Deposit[links_df$Min_Deposit == '$20,000 minimum for this APY'] <- 20000
# Make APY and Min_Deposit numeric
links_df[, c('APY', 'Min_Deposit')] %<>%
map_df(~ as.numeric(gsub('%|\\$|,', '', .x)))
## Warning in .f(.x[[i]], ...): NAs introduced by coercion
The following table is our new, cleaned dataset.
links_df %>%
arrange(Category, Institution) %>%
flextable() %>%
set_caption(caption = 'Table 2: Cleaned Dataset') %>%
flextable::footnote(part = 'header',
value = as_paragraph('Source: bankrate.com')) %>%
autofit()
Institution1 | APY1 | Min_Deposit1 | Category1 |
Ally Bank | 1.50 | 0 | 1-Year CD Rates |
Amerant Bank | 1.65 | 10000 | 1-Year CD Rates |
BMO Harris Bank | 1.65 | 5000 | 1-Year CD Rates |
Capital One | 1.50 | 0 | 1-Year CD Rates |
CIT Bank | 1.70 | 1000 | 1-Year CD Rates |
Citizens Access | 1.50 | 5000 | 1-Year CD Rates |
Comenity Direct | 1.70 | 1500 | 1-Year CD Rates |
Discover Bank | 1.50 | 2500 | 1-Year CD Rates |
Limelight Bank* | 1.65 | 1000 | 1-Year CD Rates |
Live Oak Bank | 1.90 | 2500 | 1-Year CD Rates |
Marcus by Goldman Sachs | 1.60 | 500 | 1-Year CD Rates |
Synchrony Bank | 1.50 | 2000 | 1-Year CD Rates |
TIAA Bank | 1.50 | 5000 | 1-Year CD Rates |
Ally Bank | NA | 0 | 3-Year CD Rates |
American Express National Bank | NA | 0 | 3-Year CD Rates |
Comenity Direct | 1.75 | 1500 | 3-Year CD Rates |
Delta Community Credit Union | 1.75 | 1000 | 3-Year CD Rates |
Discover Bank | NA | 2500 | 3-Year CD Rates |
First Internet Bank | NA | 1000 | 3-Year CD Rates |
Marcus by Goldman Sachs | 1.60 | 500 | 3-Year CD Rates |
SchoolsFirst Federal Credit Union | 1.65 | 20000 | 3-Year CD Rates |
TIAA Bank | NA | 5000 | 3-Year CD Rates |
American Express National Bank | NA | 0 | 5-Year CD Rates |
CIT Bank | NA | 1000 | 5-Year CD Rates |
Comenity Direct | 1.80 | 1500 | 5-Year CD Rates |
Delta Community Credit Union | 2.00 | 1000 | 5-Year CD Rates |
First Internet Bank of Indiana | 1.71 | 1000 | 5-Year CD Rates |
Golden 1 Credit Union | 1.80 | 1000 | 5-Year CD Rates |
Marcus by Goldman Sachs | 1.65 | 500 | 5-Year CD Rates |
SchoolsFirst Federal Credit Union | 1.85 | 20000 | 5-Year CD Rates |
TIAA Bank | NA | 5000 | 5-Year CD Rates |
1Source: bankrate.com |
Graphs
Before we plot the results, a function will be written to generalize the desired graphing procedure: APY on the x-axis, minimum deposit on the y-axis, Institutions labeled on the graph, and colors being noted by their CD rates category (1-, 3-, and 5-year). I aim for a more minimalistic type graph with theme_light()
and removing specific gridlines with panel.grid.minor
and panel.grid.major.x
within theme()
.
apy_v_md <- function(data, title) {
ggplot(data) +
aes(x = APY,
y = Min_Deposit,
label = Institution,
col = Category) +
geom_text(position = position_jitter()) +
labs(x = 'APY',
y = 'Minimum Deposit',
col = '',
title = title,
caption = 'Source: bankrate.com') +
theme_light() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank())
}
We make the graph interactive with ggplotly()
from plotly
.
g1 <- apy_v_md(links_df, title = 'APY vs. Minimum Deposits')
ggplotly(g1)
The outliers–i.e., those with a minimum deposit of 10,000 or above–seem to be masking the underlying relationship between APY and minimum deposits. Filtering them out of the graph reveals a downward trend, as well as the observation that the 3- and 5-year CD rates tend to have higher APY and lower minimum deposits
g2 <- links_df %>%
filter(Min_Deposit < 10^4) %>%
apy_v_md(title = 'APY vs. Minimum Deposits (Outliers Removed)')
ggplotly(g2)
Conclusion
In conclusion, we web mined bankrate.com using the rvest
functions read_html()
, html_node()
, and html_table()
. The results were presented in tables via flextable
and graphs via ggplot2
and plotly
. I have observed that there is a general inverse relationship between APY and minimum deposits, where 3- and 5-year CD rates tend to have higher of the former and lower of the latter.
Feel free to modify and expand the code presented for your own projects!