Messing around with STM – Part II – Data cleaning

This is the second of a series of posts where I explore the application of structural topic models and its R implementation. The first post is here; while here is the github repository with the jupyter notebook containing basically the contents of the posts and the code chunks, and some of the functions I used. 

At the end of the last post we ended up with a rather messy dataframe requiring some data cleaning before being ready to be used. Although the description part can be considered OK to be processed within STM or some other package specifically meant to work with textual data (like quanteda), the metadata requires further massaging. Here I focus mostly on location (as I am interested in trying to perform some kind of spatial data analysis), and the salary column, as it contains quantitative information that is basically impossible to analyse in its present form.

Location

The location metadata is rather vague and not uniform (for most offers, the municipality is provided, but for other entries the region is mentioned instead, only a very few report the postcode), so what we can in fact do with it is rather limited. However, as this is an experiment to have fun and explore, we can try to make the most out of it nevertheless, testing what could then be achieved with better data.
What I try here is basically to extrapolate the coordinates corresponding to the location of each offer, so that it can easily be geo-referred. This can be achieved relatively easily with nominatim_osm, a little sweet function by Dmitry Kisler which exploit the OpenStreetMap API (there is an alternative based on google geocoding api implemented in ggmap, but it has a limit of 2500 requests per day). The function works best with full addresses, but the municipality name is enough to get the coordinates of this (as far as it is present in OSM database). On the other hand, it doesn’t really like postcodes, so we preliminary remove them from the location column, extracting all the strings containing a digit (for convenience I have loaded the dataset prepared before, with the offers published in the last three days before Saturday 18th May within 25 miles of the postcode NE18):

totaljobs<-read.table("totaljobs.txt")# we recover the data previously scraped
totaljobs<-totaljobs[!duplicated(totaljobs),] # get rid of duplicates
totaljobs$Location<-(str_extract(totaljobs$Location, "(\\b[^\\d]+\\b)")) # extract only the words without digits

Now we can directly fed the dataframe to the nominatim_osm function (the version I use here has only a small modification to ignore NA values, note that it might be a bit time consuming depending on speed of connection and size of the dataset):

nominatim_osmMod <- function(address = NULL)
{
if(suppressWarnings(is.na(address)))
return(data.frame())
tryCatch(
d <- jsonlite::fromJSON(
gsub('\\@addr\\@', gsub('\\s+', '\\%20', address),
'http://nominatim.openstreetmap.org/search/@addr@?format=json&addressdetails=0&limit=1')
), error = function(c) return(data.frame())
)
if(length(d) == 0) return(data.frame())
return(data.frame(lon = as.numeric(d$lon), lat = as.numeric(d$lat)))
}#slightly modified to deal with NA instead of NULL

JobsCoord<- lapply(totaljobs$Location, nominatim_osmMod)
filter<-lapply(JobsCoord, is_empty)
JobsCoord[unlist(filter)]<-0

JobsCoorddf <- ldply(JobsCoord, data.frame)
JobsCoorddf[3]<-1:length(JobsCoorddf[,1])#ordinal index for coord

totaljobs[,6]<-(1:length(totaljobs[,1]))#ordinal index for totaljobs

names(totaljobs)[6]<-"OrdIndex"
names(JobsCoorddf)[3]<-"OrdIndex"

totaljobsCoord <- merge(totaljobs,JobsCoorddf, by="OrdIndex")

The resulting dataset has the coordinates for the entity in the location column, in a similar fashion:

Location Type Salary lon lat
23 Newcastle upon Tyne NA £21,414 – £22,658 a year -1.613157 54.97385

There is a final issue to be noted, as in case of homonyms the OSM API returns the most relevant result, which might not be the one relevant in our case. In the data we are using, for example, there is a “Washington” which happens to be a town in Tyne and Wear, which is unlikely to be considered the most relevant result by osm. As the entries in question are relatively few, I opted for manual intervention for most cases, an alternative workaround could be to add “, UK” at the end of each location cell, although it is a solution not devoid of side effects.

Salary

In its present form, the salary column is basically useless for any practical purposes, as we have rates by different units (from hours to year), in some cases a single value is given, in others a range.
The first step I took to sort this out is to extract the numeric values from the column. This can be done with grepexpr and string, which are conveniently combined in the Numextract function by Ramnath Vaidyanathan (here I used the dataframe with the coordinates columns, but of course this procedure will work on the original dataframe as well):

totaljobsCoord$Salary<-as.character(totaljobsCoord$Salary)
totaljobsCoord$Salary<-gsub(",", "", totaljobsCoord$Salary)#remove commas from digits in thousands

Numextract <- function(string){
unlist(regmatches(string,gregexpr("[[:digit:]]+\\.*[[:digit:]]*",string)))
}#curtesy of http://stla.github.io/stlapblog/posts/Numextract.html
totaljobsCoord$rate<-lapply(totaljobsCoord$Salary, Numextract)%>%plyr::ldply(rbind)

The results are stored in a dataframe within a dataframe, which we’d better store in two distinct columns for ease of use:

totaljobsCoord$minrate<-as.numeric(as.character(unlist(totaljobsCoord$rate[1])))#min rate
totaljobsCoord$maxrate<-as.numeric(as.character(unlist(totaljobsCoord$rate[2])))#max rate

totaljobsCoord$maxrate[!complete.cases(totaljobsCoord$maxrate)]<-totaljobsCoord$minrate[!complete.cases(totaljobsCoord$maxrate)]#in case there is no range but only a single entry, the same value is repeted in both columns

totaljobsCoordRates <- select(totaljobsCoord, -rate)#get rid of the original

The last step is to create a column extracting the factor by which the rate is computed, which can be relatively easy done with str_extract. We can preliminary check which factors to look for:

print(unique(gsub("\\b\\d+\\b","",totaljobsCoord$Salary)) #find out the possible factors
)

totaljobsCoordRates$rateby<-(str_extract(totaljobsCoord$Salary, "(\\b year|day|month|week|hour\\b)"))

This makes the data more easily comparable:

Salary minrate maxrate rateby
£28283 a year 28283.0 28283.0 year
£8.50 – £8.70 an hour 8.5 8.7 hour
£18000 a year 18000.0 18000.0 year

Depending on the type of analysis we want to perform, there is still further cleaning and massaging to be done, but for now this can be considered a workable starting point. In the next sessions I will start to use STM to do some actual analysis on the data.

For now, I invite you to take a look at the jupyter notebook, which hopefully makes the understanding of the code a bit clearer. In the github repository you will also find the procedures I used here coded as autonomous functions (one for extrapolating the location, one for formatting the salary column, and one including both). As usual, comments and suggestions will be welcome!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s