Messing around with STM – Part I, Web Scraping

Introduction

As the second semester of my Master’s degree is finally wrapping up (two more to go), I suddenly found myself with some spare time, which I decided to employ to mess around a bit with Structural Topic Model (STM) and its R implementation, aptly named stm.

As I mentioned in my previous post, it is time to think about the dissertation, and although nothing is yet settled, I will probably focus on possible implementations of STM.

STM is basically a text mining technique, initially conceived for the analysis of political texts, which has been extensively adopted in social sciences (here you can find a list of the main publications that have adopted STM). As other topic models, like Latent Dirichlet Allocation, it basically allows to identify abstract “topics” that occur in a collection of documents, but compared to other models, it allows the analysis of relationships with document metadata, in form of co-variates, either in terms of the degree of association of a document to a topic, either of the association of a word to a topic. As an example, it is possible to take a bunch of posts published on different political blogs in the months before an election, and see which topics were prevalent in the posts of blogs of a certain political leaning (in this case, political leaning of the blog is used as co-variate for topical prevalence), or to see how words associated to the treatment of a specific topic change depending to the political affiliation (in this case, political leaning of the blog is used as co-variate for topical content – you can refer to the R package vignette for more details and references to the original papers).

As many topics in data analysis, it is something easier to do rather than to explain, and something that can be really understood only when you get your hands dirty with some data, which is what prompted me to try this. In this and the following posts, I will try my hand and post the results of some attempts I am making with STM, more specifically testing the model on some job offers extracted from Indeed UK. As this is a work in progress, I will post the results of my work as I get through them, so I am not really sure where this will lead, but I hope to have some fun along the way.

As I will stress all over, as the sample is totally arbitrary, any result will not have any statistical validity whatsoever. This is simply meant to be an attempt to explore a technique (and a R package) which can have several potential applications, both in terms of analytical purposes, both in terms of information visualisation (allowing for example to get over the use and abused word cloud).

Part I – web scraping

In this first post I will not really get my hands on STM yet, but I will illustrate how I obtained the textual data I will use in the rest of the work. As mentioned above, I decided to focus on job offers: there is no specific reason for this, other than that I considered this a good example of texts that offer some metadata to incorporate in the analysis (type of job, salary, location), and whose topicality identification could represent a good test for the model. The choice fell on indeed.co.uk also for no specific reason, and I stuck to it after I noticed that scraping it was relatively easy (although the quality of metadata, as we will see later on, is not the best we could have hoped for).

Obviously, if I had access to indeed.co.uk API this whole process would have been probably quicker, but since I don’t, I scraping the site was the only feasible option.

The first step was to create three accessory functions to obtain from each offer page the information needed. This was relatively easily done thanks to htmlParse from XML package (yes, I know this is not a nice way to incorporate code snippets on a webpage but my present wordpress plan doesn’t allow for fancy stuff, so bear with me, in my github profile is also available the version as R jupyter notebook):

#load necessary libraries
library(rvest)
library("xml2")
library("XML")
library("stringr")
library(dplyr)
library(naniar)
#Scrape the info from pages:
#metadata
getmetadataindeed%xpathSApply( "//*[contains(@class,'jobsearch-JobMetadataHeader-iconLabel')]", xmlValue)
if (is.list(meta)) {meta%xpathSApply( "//*[contains(@id,'jobDescriptionText')]", xmlValue)%>%paste(collapse=', ')%>%str_replace_all("\n", " ")
}
#job title
getjobtitle%xpathSApply( "//*[contains(@class, 'icl-u-xs-mb--xs icl-u-xs-mt--none jobsearch-JobInfoHeader-title')]", xmlValue)
if (is.list(tit)) {
tit<-NA
}
tit
}

As these functions need to be fed with the specific URLs of the job offer webpages, and do this by hand for hundreds of offers is not really an option, I created a second function to directly scrape the URLs from the results page of a search:

getlinks%xmlRoot()%>%xpathSApply("//*[contains(@class,'title')]", xmlGetAttr, 'href')
linksb[sapply(linksb, is.null)] <- NULL
linksb<-as.character(linksb)
linksb<-paste("https://www.indeed.co.uk", linksb,sep="")
}

Interestingly, as some sponsored links are always present in the results page, the total number of job offer URLs scraped is slightly superior to the expected number, which for our purpose doesn’t really create particular issues. The final step was to put everything together:

scrapeindeed<-function(urlres) {
linksbb%mutate_if(is.factor,as.character)
jobsdesc%plyr::ldply(rbind)%>%mutate_if(is.factor,as.character)
jobmeta%plyr::ldply(rbind)%>%mutate_if(is.factor,as.character)
tobemoved <- grepl("£", jobmeta[,2])#as salary can fall in the second column if location is missing, single out all the salary entries…
jobmeta[tobemoved,3 ]<-jobmeta[tobemoved,2 ] #...to move them to the third column
jobmeta[tobemoved,2 ] <-NA #leaving a NA on their place in the second column
final%`colnames<-`(c("Title", "Description", "Location","Type","Salary"))
final
}

This function, fed with the URL of the results page, can extract all the data needed and store them in a dataframe of five columns. The last step was to fed the function with all the results pages relevant, which I did (in this case manually), for all the job offers published in the last three days before Saturday 18th May within 25 miles of the postcode NE18 (Newcastle Upon Tyne). The results were then merged together, the data are available here in .txt format.
As you can see, there is still much to do in terms of data cleaning before starting to work, which is what we will see in the next post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s