The suspension of the tennis season has given tennis fans time for a number of side projects. In this post, I share the story of my side endeavour to make a central database of the results of all Grand Slams singles matches ever played. Thanks to Wikipedia and weeks of data wrangling, I can now share the first analysis on this blog that includes the entire history at the Majors: an analysis of the longest matches of all time.
Many weeks ago I naively embarked on the project of scraping all Grand Slam draws from the pages of Wikipedia. Thanks to Wikipedia’s Grand Slam Project, there is a page with singles match results for every men’s and women’s major, from the first match played at the 1877 Wimbledon Championship thru to the finals of the 2020 Australian Open.
Changes in tournament structure over time and peculiarities in how editors tabulated those results across the event pages made scraping a more onerous task than I had hoped. It wasn’t a pretty process and I thought about throwing in the towel a number of times. But, if the editors of Wikipedia could put in the time to put this data together, I could soldier on for the sake of tennis history.
The results of those efforts is the repo wikidraws. There you can find both the scraping programs and the current sets of datasets organized by event. In addition to the event data, the datasets include the games won by each player as well as tiebreak points, where tiebreaks were played.
I took some pains to validate the match data. There were multiple instances where my checks revealed mistakes on the Wiki pages that, after verifying from the primary source, I corrected manually. I feel quite confident in the information that is there, but I suspect there may still be some errors or incomplete information. So all suggested corrections are welcome.
Now, you might ask why anyone needs the
wikidraws database when we already have the information on Wikipedia. The main reason is that without having all of that information in one place, where it can be simultaneously queried, the kinds of questions we can ask are pretty limited.
Most of us know, for example, that the Van Alen tiebreaker was introduced at the Grand Slams in the 1970s. But how many of us knew that for several of the years when the tiebreak was first played at Wimbledon that it wasn’t triggered until 8-all in the set? Or how many of us knew that a match at the 1969 Wimbledon Championships between Pancho Gonzales and Charlie Pasarell, which lasted 112 games, was the main impetus for adopting the tiebreak?
wikidraws dataset isn’t the only way we could make these discoveries, but it can make it easier to find out about these and other curiosities of tennis history.
So, in keeping with the theme of match lengths, I have put together a chart of the 1% of the longest men’s and women’s matches, in terms of games played, for each of the majors. As a reference, the boxplot in grey shows the five-number summary stats for the games played in all completed matches. The points show the outliers at each event. Pass your cursor over any match to see the details of the year and players involved.
There is a clear right skew to the distribution at all of events. But it is interesting to see how clumped together most of the 1% of the most extreme match lengths still are. Few matches have come close to the Gonzales-Pasarell or Isner-Mahut epics at Wimbledon, and scanning across those that come closest we see that many of them were pre-Open Era.
A chart like this also allows some interesting gender comparisons. The most extreme women’s matches, in terms of match lengths, have still been no more than one-third the length of the record-breaking Isner-Mahut marathon. Yet, past years that included best of 5 matches for women show that many women have played lengths equal to the top 25% of men’s matches at the Grand Slams.
This chart puts a spotlight on the rare group of games lasting 100 games or more. By my count, there have been just four. Three at Wimbledon. Can you find each? At which event was the fourth played?
I’ll leave these questions for the interested reader to explore and hope even more discoveries are to come.
Special thanks to Yan Holtz for the d3 boxplot code that was the starting point for this post’s chart.