Using xG data from FBref

Using the acciotables API, and some minor data transformation in R, it’s possible to get Statsbomb/FBref expected goals data for every EPL match since 2017-18.

This standalone guide will produce a dataset and visualisation to use as a starting point for more detailed analysis and graphics. You don’t need your own data, just an installation of R and RStudio.

Getting data with the acciotables API

FBref stores a summary of all 380 matches in a single html table. We just need the page_url and selector_id, and acciotables will produce the data in a nice html format.

Looking in the page source gives the selector_id: %23sched_ks_3232_1

page_url <- "https://fbref.com/en/comps/9/3232/schedule/"
selector_id <- "%23sched_ks_3232_1"
url <- paste0("http://acciotables.herokuapp.com/?page_url=",page_url,"&content_selector_id=",selector_id)

cat(paste0("API url: ",url))
## API url: http://acciotables.herokuapp.com/?page_url=https://fbref.com/en/comps/9/3232/schedule/&content_selector_id=%23sched_ks_3232_1

Check the API is working in your browser.

Importing into R

There’s a guide in the acciotables readme.

matches_import <- url %>%
  read_html() %>%
  html_table() %>%
  extract2(1) # unnest the data_frame from the list

head(matches_import)
##   Wk Day       Date          Time           Home  xG Score  xG            Away
## 1  1 Fri 2019-08-09 20:00 (19:00)      Liverpool 1.7   4–1 1.0    Norwich City
## 2  1 Sat 2019-08-10 12:30 (11:30)       West Ham 0.8   0–5 3.1 Manchester City
## 3  1 Sat 2019-08-10 15:00 (14:00)        Burnley 0.7   3–0 0.8     Southampton
## 4  1 Sat 2019-08-10 15:00 (14:00)        Watford 0.9   0–3 0.7        Brighton
## 5  1 Sat 2019-08-10 15:00 (14:00)    Bournemouth 1.0   1–1 1.0   Sheffield Utd
## 6  1 Sat 2019-08-10 15:00 (14:00) Crystal Palace 0.7   0–0 1.0         Everton
##   Attendance                 Venue        Referee Match Report Notes
## 1     53,333               Anfield Michael Oliver Match Report      
## 2     59,870        London Stadium      Mike Dean Match Report      
## 3     19,784             Turf Moor   Graham Scott Match Report      
## 4     20,245 Vicarage Road Stadium   Craig Pawson Match Report      
## 5     10,714      Vitality Stadium   Kevin Friend Match Report      
## 6     25,151         Selhurst Park  Jonathan Moss Match Report

Tidy the data

There’s a few things to sort out to make the raw data more usable.

  • There are two columns called xG. That is definitely going to make something go wrong.
names(matches_import) <-
  names(matches_import) %>%
  make.unique(sep="_")

matches_tidy1 <-
  matches_import %>%
  rename("HomexG"="xG","AwayxG"="xG_1")

gt(head(matches_tidy1))
Wk Day Date Time Home HomexG Score AwayxG Away Attendance Venue Referee Match Report Notes
1 Fri 2019-08-09 20:00 (19:00) Liverpool 1.7 4–1 1.0 Norwich City 53,333 Anfield Michael Oliver Match Report
1 Sat 2019-08-10 12:30 (11:30) West Ham 0.8 0–5 3.1 Manchester City 59,870 London Stadium Mike Dean Match Report
1 Sat 2019-08-10 15:00 (14:00) Burnley 0.7 3–0 0.8 Southampton 19,784 Turf Moor Graham Scott Match Report
1 Sat 2019-08-10 15:00 (14:00) Watford 0.9 0–3 0.7 Brighton 20,245 Vicarage Road Stadium Craig Pawson Match Report
1 Sat 2019-08-10 15:00 (14:00) Bournemouth 1.0 1–1 1.0 Sheffield Utd 10,714 Vitality Stadium Kevin Friend Match Report
1 Sat 2019-08-10 15:00 (14:00) Crystal Palace 0.7 0–0 1.0 Everton 25,151 Selhurst Park Jonathan Moss Match Report
  • The dataset has some non-data lines. There should be 380 lines of data, one for each match in the season.
matches_tidy2 <-
  matches_tidy1 %>%
  filter(Wk!="Wk",Wk!="")

cat(paste0("rows: ",dim(matches_tidy2)[1],"\n","columns: ",dim(matches_tidy2)[2]))
## rows: 380
## columns: 14
  • Don’t care about the attendance, referee etc.
matches_tidy3 <-
  matches_tidy2 %>%
  select(-c("Attendance":"Notes"))

gt(head(matches_tidy3))
Wk Day Date Time Home HomexG Score AwayxG Away
1 Fri 2019-08-09 20:00 (19:00) Liverpool 1.7 4–1 1.0 Norwich City
1 Sat 2019-08-10 12:30 (11:30) West Ham 0.8 0–5 3.1 Manchester City
1 Sat 2019-08-10 15:00 (14:00) Burnley 0.7 3–0 0.8 Southampton
1 Sat 2019-08-10 15:00 (14:00) Watford 0.9 0–3 0.7 Brighton
1 Sat 2019-08-10 15:00 (14:00) Bournemouth 1.0 1–1 1.0 Sheffield Utd
1 Sat 2019-08-10 15:00 (14:00) Crystal Palace 0.7 0–0 1.0 Everton
  • The Score column looks something like 4-1. We want two columns: home goals and away goals.
matches_tidy4 <-
  matches_tidy3 %>%
  separate("Score",c("HomeGls","AwayGls"),sep="[:punct:]",fill="right")

gt(head(matches_tidy4))
Wk Day Date Time Home HomexG HomeGls AwayGls AwayxG Away
1 Fri 2019-08-09 20:00 (19:00) Liverpool 1.7 4 1 1.0 Norwich City
1 Sat 2019-08-10 12:30 (11:30) West Ham 0.8 0 5 3.1 Manchester City
1 Sat 2019-08-10 15:00 (14:00) Burnley 0.7 3 0 0.8 Southampton
1 Sat 2019-08-10 15:00 (14:00) Watford 0.9 0 3 0.7 Brighton
1 Sat 2019-08-10 15:00 (14:00) Bournemouth 1.0 1 1 1.0 Sheffield Utd
1 Sat 2019-08-10 15:00 (14:00) Crystal Palace 0.7 0 0 1.0 Everton
  • All the data has the character datatype. type_convert will handily auto-detect which ones should be numeric, date etc.
matches_tidy5 <-
  matches_tidy4 %>%
  type_convert()

This is much easier to work with, but there’s one more change that will help a lot later.

Transform into long data

Tidy data has one observation per row. Each line in this table has two independent(ish) observations: home score and away score. By transforming to long (as opposed to wide) format, working with the data becomes much easier.

What’s happening here:

  • pivot_longer to separate home and away into two rows.
  • The original matches data frame joined back to the new data - a bit of a trick to get a sort of metadata, for filtering or labelling. Don’t worry too much about this if it doesn’t make sense.
  • Then if_else to sort out the rest of the home/away data.
  • relocate the columns used in the final plot to the left to make it easier to see.

This is a useful transformation, so it’s all in a function called make_long_matches.

make_long_matches <- function(matches){
  
  long_matches <-
    matches %>%
    pivot_longer(cols=c(Home,Away),
                 names_to="HA",
                 values_to="Squad") %>%
    left_join(matches) %>% # join the old data frame to the new one
    mutate(
      Opposition=ifelse(HA=="Home",Away,Home),
      GlsF=ifelse(HA=="Home",HomeGls,AwayGls),
      GlsA=ifelse(HA=="Home",AwayGls,HomeGls),
      xGF=ifelse(HA=="Home",HomexG,AwayxG),
      xGA=ifelse(HA=="Home",AwayxG,HomexG)
    ) %>%
    relocate("Squad","Opposition":"xGA","HA")
  
  return(long_matches)
}

matches_long <- make_long_matches(matches_tidy5)

gt(head(matches_long))
Squad Opposition GlsF GlsA xGF xGA HA Wk Day Date Time HomexG HomeGls AwayGls AwayxG Home Away
Liverpool Norwich City 4 1 1.7 1.0 Home 1 Fri 2019-08-09 20:00 (19:00) 1.7 4 1 1.0 Liverpool Norwich City
Norwich City Liverpool 1 4 1.0 1.7 Away 1 Fri 2019-08-09 20:00 (19:00) 1.7 4 1 1.0 Liverpool Norwich City
West Ham Manchester City 0 5 0.8 3.1 Home 1 Sat 2019-08-10 12:30 (11:30) 0.8 0 5 3.1 West Ham Manchester City
Manchester City West Ham 5 0 3.1 0.8 Away 1 Sat 2019-08-10 12:30 (11:30) 0.8 0 5 3.1 West Ham Manchester City
Burnley Southampton 3 0 0.7 0.8 Home 1 Sat 2019-08-10 15:00 (14:00) 0.7 3 0 0.8 Burnley Southampton
Southampton Burnley 0 3 0.8 0.7 Away 1 Sat 2019-08-10 15:00 (14:00) 0.7 3 0 0.8 Burnley Southampton
cat(paste0("rows: ",dim(matches_long)[1],"\n","columns: ",dim(matches_long)[2]))
## rows: 760
## columns: 17

Now there are 760 rows, one for each team in each match.

Filter

Next the data can be filtered get a subset for analysis. In this case it’s the data for all matches played so far by a single team.

team_name <- "Liverpool"

matches_team <-
  matches_long %>%
  filter(Squad==team_name) %>% # filter team
  filter(!is.na(HomeGls)) # only matches which have been played

Some work to use later - create an new column to use as an X axis label, and use fct_reorder to get the X axis in date order.

matches_team <-
  matches_team %>%
  mutate(Match=paste0(Opposition," ",HA," ",GlsF,"-",GlsA)) %>% # make X axis labels
  mutate(Match=fct_reorder(Match, Date)) # order by date

Plot

Lots of ways to do this, here’s a ggplot with some lines, some points, a theme, and some slightly better labels made with ggtext.

matches_team %>%
  ggplot(aes(x=Match,group=1)) +
  geom_point(aes(y=xGF),size=2,colour="black",fill="darkred",shape=21) +
  geom_smooth(aes(y=xGF),colour="darkred",se=FALSE) +
  geom_point(aes(y=xGA),size=2,colour="black",fill="royalblue",shape=21) +
  geom_smooth(aes(y=xGA),colour="royalblue",se=FALSE) +
  theme_bw() +
  theme(
    plot.title=element_markdown(),
    axis.title.y=element_markdown(),
    axis.text.x=element_text(size=6,angle=60,hjust=1)
  ) +
  labs(
    title=paste0(team_name," <b style='color:darkred'>attack</b> / <b style='color:royalblue'>defence</b> xG trend"),
    x=element_blank(),
    y=("Expected goals <b style='color:darkred'>for</b> / <b style='color:royalblue'>against</b>")
  ) +
  scale_x_discrete(expand=expansion(add=c(0.5))) +
  scale_y_continuous(limits=c(0,NA),expand=expansion(add=c(0,0.1)))

Further reading / references / acknowledgements

One more thing

It’s friendly to give credit to the data providers at Statsbomb and FBref. Statsbomb’s media guide is here, and if you want to add their logos to your images, you can use Cowplot for that.

Written on October 28, 2020