How to Screen Scrape Using R

February 2, 2018 Laura Ellis One comment

STEP 1: BACKGROUND

Screen scraping is an effective technique that can be used to gather data off of web pages. Typically the data is gathered for further analysis or aggregation.

In this tutorial, we are going to screen scrape Googles “Best of 2017” App lists. We are using screen scraping as a technique to automate copying data off of websites. For data wranglers, there are a number of libraries and packages that have been developed to make screen scraping relatively straightforward. In Python, the package Beautiful Soup has a large following. In R, the package rvest has been getting a lot of traction.

In this tutorial, we will use the rvest package to scrape data from the Google Best Apps of 2017 website and store it in a data frame. We will then use a few of R packages to analyze the dataset further.

Note: The full R code can be downloaded here.

STEP 2: GATHER THE DATA

Install and load packages

R packages contain a grouping of R data functions and code that can be used to perform your analysis. We need to install and load them in our environment so that we can call upon them later.

#install packages - do this one time
install.packages("rvest")
install.packages("plyr")
install.packages("ggplot2")
install.packages("treemap")
install.packages("data.table")
# Load the relevant libraries - do this every time
library(rvest)
library(plyr)
library(ggplot2)
library(treemap)
library(data.table)

Create a screen scraping function and then call the function for all review pages.

Screen scraping is a very in-depth topic and it can get incredibly complicated depending on how the page you would like to scrape is formatted. Luckily the page we are trying to scrape allows the data objects we want to be referenced relatively easily. Since there are 6 pages total that we would like to scrape (all with the same format), I made a function that we can call for every page. Creating a function helps to prevent repeated blocks of code. This function takes two input parameters: the website URL that you want the data from and the category name you would like to assign to this data set. It then retrieves the apps title, rating count, download count, content rating (mature, teen etc), write up (description) and assigns the category you provided. Finally, it returns a data frame to you with all of the information nicely packed up.

###########  CREATE Function for Google Screen Scrape ################
scrapeGoogleReviews <- function(url, categoryName ){
#Specifying the url for desired website to be scraped
webpage <- read_html(url)
df <-data.frame(
app_title = html_text(html_nodes(webpage,'.id-app-title')),
rating_count = html_text(html_nodes(webpage,'.rating-count')),
download_count = html_text(html_nodes(webpage,'.download-count')),
html_text(html_nodes(webpage,'.content-rating-title')),
write_up = html_text(html_nodes(webpage,'.editorial-snippet')),
category = categoryName)
df
return(df)
}
###########  Call Function for Screen Scrape ################
#Call the function just created (scrapeGoogleReviews) for every "Best Apps" List 
#Each web page hosts it's own "Best of 2017 App" list such s "Best Social", "Most Influential", "Best for Kids" etc.  The function is called for each web page and the results are placed in their own data frame.  We then put all the data frames together into one combined data frame with the rbind command. 
df1 <-scrapeGoogleReviews('https://play.google.com/store/apps/topic?id=campaign_editorial_apps_bestof2017&hl=en','Winner')
df2 <-scrapeGoogleReviews('https://play.google.com/store/apps/topic?id=campaign_editorial_apps_entertaining_bestof2017&hl=en','Most Entertaining')
df3 <-scrapeGoogleReviews('https://play.google.com/store/apps/topic?id=campaign_editorial_apps_social_bestof2017&hl=en','Best Social')
df4 <-scrapeGoogleReviews('https://play.google.com/store/apps/topic?id=campaign_editorial_apps_productivity_bestof2017&hl=en','Daily Helper')
df5 <-scrapeGoogleReviews('https://play.google.com/store/apps/topic?id=campaign_editorial_apps_innovative_bestof2017&hl=en','Most Innovative')
df6 <-scrapeGoogleReviews('https://play.google.com/store/apps/topic?id=campaign_editorial_apps_Hiddengem_bestof2017&hl=en','Hidden Gem')
df7 <-scrapeGoogleReviews('https://play.google.com/store/apps/topic?id=campaign_editorial_apps_kids_bestof2017&hl=en','Best for Kids')
df8 <-scrapeGoogleReviews('https://play.google.com/store/apps/topic?id=campaign_editorial_3002de9_apps_bestof_mostpopular2017&hl=en','Most Popular')
#Combine all of the data frames
fulldf <- rbind(df1,df2, df3, df4, df5, df6, df7, df8)
#Peek at the data frame
head(fulldf)

STEP 3: FORMAT YOUR DATA

Download the full data set

Screen scraping is incredibly fragile as web page structure can change at any time. To ensure that those doing the tutorial can continue on with the visualizations, download the full dataset from GH

fulldf <- fread('https://raw.githubusercontent.com/lgellis/GoogleBestOf2017AppsAnalysis/master/GoogleBestApps.csv')

Convert to numeric

The downfall to screen scraping is that we often have to reformat the data to suit our needs. For example, in our data frame we have the two numeric variables: download_count and rating_count. As much as they look like numbers in the preview above, they are actually text with some pesky commas included in them that make conversion to numeric slightly more complicated. Below we create two new columns with the numeric version of these variables. The conversion is performed by first removing any non-numeric values with the gsub function and then converting to numeric with the as.numeric function.

###########  Extra formatting ################
# Remove commas and convert to numeric
fulldf$rating_count_numeric <- as.numeric(gsub("[^0-9]", "", fulldf$rating_count))
fulldf$download_count_numeric <- as.numeric(gsub("[^0-9]", "", fulldf$download_count))
attach(fulldf)

Create helper variables for easier analysis and visualization

There are some things just off the bat, that I know we will want for visualization. To start, it would be nice to have the percent of overall downloads for each app within the data set.

# Add percent downloads
totalDownload <- sum(download_count_numeric)
fulldf$percentDownloadApp <- round(download_count_numeric/totalDownload *100, 2)
attach(fulldf)

Next, we want to bin our download and rating totals. “Binning” is a way of grouping values within a particular range into the same group. This is an easy way for us to be able to look at volumes more

#Binning by downloads
breaks <- c(0,10000,1000000,10000000, 100000000)
fulldf$download_total_ranking = findInterval(download_count_numeric,breaks)
#Binning by rating totals
breaks2 <- c(10,100,1000,100000, 10000000)
fulldf$rating_total_ranking = findInterval(rating_count_numeric,breaks2)
attach(fulldf)
#peek at the data
head(fulldf)

STEP 4: ANALYZE YOUR DATA

Create a treemap to represent all app download volumes – Treemaps are perfect for visualizing volumes in a creative way. The treemap function offers the added benefit of not including titles for categories with minimal data.

#Treemap without category
treemap(fulldf, #Your data frame object
index=c("app_title" ), 
vSize = "download_count_numeric", 
type="index",
palette = "Blues", 
title="Treemap of App Download Volumes", 
fontsize.title = 14 
)

Create a bubble chart showing the top downloaded apps and the number of ratings they received.

We use the ggplot function again to create this bubble plot. The size of the bubbles represent the volume of downloads, the color represents the number of ratings. Given the disparity of download volumes, we only look at the top apps. We filter the data set to only include apps that have greater than 10,000,000 downloads.

ggplot(data = fulldf[download_count_numeric>10000000, ], mapping = aes(x = category, y = rating_count_numeric)) +
geom_point(aes(size = download_count_numeric), alpha = 1/3) +
geom_point(aes(colour = rating_count_numeric)) +
geom_text(aes(label=app_title), size=2.5, vjust = -2) #vjust moves text up

Full post available here

tagged with dataviz, R, screenscraping