4 min read

Web Scraping and Text Mining. An intuition.

library(tidyverse)
library(rvest)
library(RColorBrewer)
library(wordcloud)
library(tidytext)

Table of Content

  • 1 Introduction
  • 2 What is Web Scraping?
  • 3 Annotation
  • 4 The web scraping process
  • 5 A few graphic analysis
  • 6 A brief insight of text mining
  • 7 Conclusion

1 Introduction

The amount of data and information available on the Internet is growing exponentially. The amount of data available on the Web opens up new possibilities for a data scientist, such as web scraping. In today’s world, all the data we need is already available on the Internet. The only thing that prevents some people from using them is accessing them. With the help of this post, this barrier can be overcome.

2 What is Web Scraping?

Web scraping is a technique for converting the data present in unstructured format (HTML tags) over the web to the structured format which can easily be accessed and used. In this post, the most popular feature films of the year 2018 will be removed from the “IMDb” website.

3 Annotation

For users who are not very familiar with HTML and CSS, I recommend using the open source software called Selector Gadget, which is more than enough for anyone to do web scraping. You can download the Selector Gadget extension “here”.

4 The web scraping process

url <- "http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature"
webpage <- read_html(url)


rank_data_html <- html_nodes(webpage,'.text-primary')
rank_data <- html_text(rank_data_html)
rank_data<-as.numeric(rank_data)

title_data_html <- html_nodes(webpage,'.lister-item-header a')
title_data <- html_text(title_data_html)

description_data_html <- html_nodes(webpage,'.ratings-bar+ .text-muted')
description_data <- html_text(description_data_html)
description_data<-gsub("\n","",description_data)

runtime_data_html <- html_nodes(webpage,'.text-muted .runtime')
runtime_data <- html_text(runtime_data_html)
runtime_data<-gsub(" min","",runtime_data)
runtime_data<-as.numeric(runtime_data)

genre_data_html <- html_nodes(webpage,'.genre')
genre_data <- html_text(genre_data_html)
genre_data<-gsub("\n","",genre_data)
genre_data<-gsub(" ","",genre_data)
genre_data<-gsub(",.*","",genre_data)
genre_data<-as.factor(genre_data)

rating_data_html <- html_nodes(webpage,'.ratings-imdb-rating strong')
rating_data <- html_text(rating_data_html)
rating_data<-as.numeric(rating_data)

votes_data_html <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')
votes_data <- html_text(votes_data_html)
votes_data<-gsub(",","",votes_data)
votes_data<-as.numeric(votes_data)

directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)')
directors_data <- html_text(directors_data_html)
directors_data<-as.factor(directors_data)

actors_data_html <- html_nodes(webpage,'.lister-item-content .ghost+ a')
actors_data <- html_text(actors_data_html)
actors_data<-as.factor(actors_data)




movies_df<-data.frame(Rank = rank_data, Title = title_data,
                      
                      Description = description_data, Runtime = runtime_data,
                      
                      Genre = genre_data, Rating = rating_data,
                      
                      Votes = votes_data, Director = directors_data, Actor = actors_data)



glimpse(movies_df)
## Observations: 100
## Variables: 9
## $ Rank        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Title       <fct> Phantastische Tierwesen und wo sie zu finden sind,...
## $ Description <fct>     The adventures of writer Newt Scamander in New...
## $ Runtime     <dbl> 133, 106, 123, 117, 108, 116, 147, 133, 105, 116, ...
## $ Genre       <fct> Adventure, Adventure, Action, Horror, Action, Dram...
## $ Rating      <dbl> 7.3, 7.4, 6.1, 7.3, 8.0, 7.0, 7.8, 7.8, 5.8, 7.9, ...
## $ Votes       <dbl> 338250, 233381, 505590, 298641, 780515, 293013, 53...
## $ Director    <fct> David Yates, Jon Favreau, David Ayer, M. Night Shy...
## $ Actor       <fct> Eddie Redmayne, Neel Sethi, Will Smith, James McAv...

5 A few graphic analysis

Now we are able to analyze the pulled data. Below are some examples:

qplot(data = movies_df,Runtime,fill = Genre, bins = 30)

ggplot(movies_df,aes(x=Runtime,y=Rating))+
  geom_point(aes(size=Votes,col=Genre))

ggplot(movies_df,aes(x=Runtime,y=Votes))+
  geom_point(aes(size=Rating,col=Genre))

movies_df_2 <- movies_df %>% count(Genre)
ggplot(movies_df_2, aes(x=Genre, y=n)) +
  geom_bar(stat="identity", width=0.5) + coord_flip()

6 A brief insight of text mining

I am interested in the words used within the movie description.

movies_df$Description <- as.character(movies_df$Description)
text <- movies_df %>% select(Description)
tidy_text <- text %>% unnest_tokens(word, Description)
tidy_text %>% count(word, sort = TRUE)
## # A tibble: 1,254 x 2
##    word      n
##    <chr> <int>
##  1 a       163
##  2 the     161
##  3 to       91
##  4 of       89
##  5 in       66
##  6 and      58
##  7 his      31
##  8 with     28
##  9 is       26
## 10 an       25
## # ... with 1,244 more rows

Ok, we see some unwanted stop words… Let’s fix this:

data("stop_words")

tidy_text <- tidy_text %>% anti_join(stop_words)
## Joining, by = "word"
tidy_text %>% count(word, sort = TRUE)
## # A tibble: 1,056 x 2
##    word         n
##    <chr>    <int>
##  1 world       12
##  2 life        10
##  3 american     7
##  4 father       7
##  5 journey      7
##  6 city         6
##  7 save         6
##  8 story        6
##  9 death        5
## 10 home         5
## # ... with 1,046 more rows

We can also visualize this

tidy_text %>% count(word, sort = TRUE) %>% filter(n > 4) %>% mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

tidy_text %>% anti_join(stop_words) %>% count(word) %>% with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"

7 Conclusion

One might wonder, what you actually web scraping need for. The answer es as short as simple: the possibilities with web scraping are almost unlimited.