Dinesh Vatvani

Analysis of a WhatsApp chat log

2020-10-25T16:00:00+00:00

Introduction

With the ever-increasing role of technology and software services in our modern lives, we’re passively creating an increasingly large digital footprint. Our browsers store our history by default. Our favourite map apps store our location history by default. Our banks keep records of our transactions. Our phones and smart-watches keep track of the number of footsteps we take every day along with heart rate, and various other bits of biometric data. Youtube, Netflix, Spotify, Amazon video and similar services all keep track our media consumption history. There are also many other apps or services that keep track of a myriad of other interesting habits and behaviours. I personally enjoy downloading copies of my digital footprint and analysing them. I think there’s a lot that can be learned from examining my own behaviours and patterns to help me reflect on some unconscious choices I’m making and help me obtain a better and more objective understanding of myself. It also serves as a useful digital diary since I can automatically collate data from several sources nightly to give me a fairly good data-driven summary of what I was doing on any given day. To that end, I’ve written several scripts across the past 6 years to download and analyse my digital footprint from a range of different sources.

In this blog post, I’ll be talking about specifically about the output from a script I wrote to analyse WhatsApp chat logs. You can run this analysis on your own chat logs by running the Python Notebook which can be found here. If all that looks or sounds too technical for you, I created a WebApp where you can drag and drop a WhatsApp chat log to run a slightly more limited version of this analysis (I’ve had to remove some of the more memory-intensive parts due to memory constraints in the cloud platform I’m using to host this). There are more detailed instructions on how to replicate this work on your own chat logs towards the end of this blog post.

Sample analysis results

Here are the results from an analysis of an anonymised chat log I’m part of (with permission from the relevant group chat members) to showcase some interesting things that could be done with WhatsApp chat log data and the types of insights that could be gained relatively easily with some basic Natural Language Processing (NLP).

Summary table

4,955 total messages from 4 people, from 2016-12-07 to 2020-10-20

	Isambard	Lysander	Perseus	Seraphina
Contribution
Total N messages	1,921	942	1,035	1,057
Total N words	19,979	12,325	14,062	11,390
Total N characters	101,607	63,472	72,766	57,491
Message type
Text	94.4%	96.3%	92.2%	94.1%
Media	2.2%	2.7%	5.1%	3.8%
Link	3.4%	1.0%	2.7%	2.1%
Message content
Sentences per message	1.40	1.57	1.38	1.41
Words per message	11.1	13.6	14.8	11.6
Characters per message	54.1	69.2	74.1	56.5
Messages containing emoji	2.9%	0.2%	0.2%	2.5%
Messages containing profanity	1.9%	0.0%	5.6%	0.4%

The summary table is already a pretty useful overview of the chat log, but we can visualise the data and delve slightly deeper into the patterns in the chat log.

Message types

We can start with some plots on the types of messages sent to visualise who favours media messages (audio, video, images, or gifs) and who tends to share external links in the chat

Message contribution

We can plot the overall message contribution from each person.

The last plot above contains a few notable literary works to contextualise word counts better. I’ve included a longer list of reference literary works in the code that generates this plot, so you will likely see a different set of reference works for context if you choose to re-run this analysis on your own chat logs appropriate to the magnitude of word counts in the chat log being analysed.

Contribution over time

We can also look at when the each person has contributed to the conversation. This can be done as either a calendar view…

…or can also be viewed as a timeseries

The last plot can be represented as a relative plot rather than an absolute plot if want to see who has been contributing more/less relative to everyone else over time.

Another thing we can do with the timeseries data that could be of interest is group the activity by day of week and time of day to look at daily and weekly patterns.

Note that the curves in the Activity by time of day plot are measured every minute across the 24 hours and smoothed with a Gaussian convolution. The smooth curves are not an artifact of interpolation.

Conversation Dynamics

We can gain some insight into the group conversation dynamics by looking at who tends to respond to each user. This doesn’t take into account the content of the message to infer which message is being responded to. It is based purely on who the previous message was from every time a message is sent. As such, it can include replies to oneself. Contiguous messages from one person have been excluded if they are posted within 3 minutes of the previous message, as this is deemed to effectively be a single message split across multiple messages.

In a similar vein, we can also examine the response times for each user.

Grammatical and linguistic preferences

Most of the analysis above looks at patterns in when people message and a very rough overview of the nature of the messages. It can also be insightful to examine the content of the messages more closely to identify people’s grammatical and linguistic preferences. Below are a few examples of the types in insights that can be obtained by parsing and analysing the content of the messages.

Let’s start by looking at the use of punctuation, emoticons (Emoji) and profanity.

Note: Profanity detection is done using the profanity-check library

We can also look at the distribution of word lengths from each person.

In this instance it turns out not to vary all that much across users, largely due to the composition of the group, but that is not always the case.

An interesting analysis we can perform is to compare the relative usage frequency of different words from each user against the natural prevalence rate of those words in the English language (based on occurrence in the web on websites in English) to find the words that each user uses disproportionately often.

Using the same concept of words’ natural prevalence rate in English-language websites above, we can determine the average log(natural prevalence frequency) of all words used by each person. This is a measure for how obscure/niche the words used by each person is, with a lower average log(frequency) indicating more frequent use of rare words. The plot above can be interpreted as some proxy for vocabulary complexity/specificity. An interesting measure to accompany that is vocabulary breadth. To do that, we can plot the cumulative unique word count against cumulative total word count for each user.

Early in the conversation, we expect most words used to be new. As the conversation continues, we expect the number of unique words to slowly tail-off as many of the words used in the conversation will be repeated words. Theoretically, if the conversation goes on infinitely long and the topics of conversation covered in the chat are exhaustive, we expect this curve to asymptote towards a value that represents some approximation of the scope of each person’s total vocabulary size. The plot above displays the early part of that curve. In practice, most WhatsApp conversations will be far too small to reach the vocabulary size asymptote, and will often have limited topics of conversation covered. Moreover, many people will likely modulate the tone and complexity of the language they choose to use in casual WhatsApp conversations in ways that make it unrepresentative of their true vocabulary breadth. These curves have been left in the analysis because I believe they are interesting, but it is important to emphasize that they are only broadly indicative of each person’s vocabulary scope specifically as observed in the particular chat log which, for many reasons, will not be representative of their true overall vocabulary size.

Finally, we can look into how similar the observed vocabulary is between the participants of the group. This can be done by taking all unique words used by each person and comparing it to the unique words used by every other person. The vocabulary similarity between 2 people can be defined as the number of words they use in common divided by total number of words used by either person, and is known as the Jaccard index (in set notation, it’s defined by |A∩B|/|A∪B|, where A and B are the sets of words used by each person).

Potential areas for future development

NLP is a rich and expansive field and there is plenty more that could be done with this dataset using NLP tools and techniques. Some ideas for potential future developments could be:

Automatic topic of conversation detection, to pick out who tends to favour discussions on particular topics and mapping when in time different topics get discussed
The comparison of vocabulary similarity between people could be improved (currently using Jaccard Similarity based on a unique words used by each person)
Linguistic style profiles could be added by assessing vocabulary similarity or prose style similarity to external text datasets e.g. Gossip magazines, tabloid newspapers, broad-sheet newspapers, tech magazines, Victorian novels, scientific journal publications, legal documents, etc.

I may come back to implement these in the future, but if anyone would like to try to train any of these models, please let me know in the comments or DM me on Twitter. I’d be more than happy to collaborate or review pull requests on GitHub.

Generate your own chat log reports

If you’d like to analyse your own chat logs and create plots like the ones above, then follow the instructions below.

Extracting a WhatsApp chat log

In order to analyse your own chat logs, you’ll first need to export your own chat log from WhatsApp. These instructions are for exporting a chat log from an Android device. Doing it from an iOS device should be similar, but if in doubt, I’m sure there will be other instructions online telling you how to export a WhatsApp chat log from iOS.

Open any conversation on WhatsApp, click on the kebab icon (3 vertical dots icon on the top right of the conversation screen), then More... and Export chat. This should bring up a prompt on whether you want to export with or without media. Export the chat log without media. When the export is ready, it should bring up another selection of which app to use to deal with the files. Either select a file browser to save the files or any email client app and send the exported chat log to yourself. The export may contain multiple files. The one of interest will be named something like “WhatsApp chat with {group_name}.txt”. We’ll be using this in the next step

Analysing the chat log

If you’re comfortable with running Python scripts and notebooks, then the Jupyter (Python) notebook that I used for the analysis above can be found here. Clone the repository, install the dependencies (listed in the requirements.txt file), then place your chat log in the chat_logs folder and run the notebook. Remember to update the chat log file name in the notebook to the one you just added there. Once the script has finished running, the plots will be visible in-line in the notebook, but all plots are also saved in the outputs folder under a subfolder name that corresponds to the chat log name.

If you’re NOT comfortable with running python scripts on your own, then I’ve set up a minimalist WebApp where you can upload or drag and drop your WhatsApp chat log and have it processed for you. It’s hosted on the free-tier of a cloud platform and shared among all readers, so may be quite slow, depending on how many people are using it at any given time. I’ve had to remove a couple of the plots for this WebApp due to memory constraints in the cloud platform. None of your data will be stored, so save a local copy of the output if you’d like to keep hold of it or want to share it with anyone. Attempting to share a link to the results by copying the URL will not work (since they would require a copy of the output to be saved).

Thanks to:

[REDACTED] : For letting me use an anonymised version of our group chat in this blog post.

An analysis of board games: Part III - Mapping the board game landscape

2020-09-04T00:10:00+01:00

This is part III in my series on analysing BoardGameGeek data. Other parts can be found here:

Part I: Introduction and general trends
Part II: Complexity bias in BGG
Part III: Mapping the board game landscape

Introduction

Previous posts in this series cover how we generated a dataset from BoardGameGeek, explored general trends in the tabletop games landscape over time and looked at complexity bias inherent in the BGG dataset. This post explores a comparison of game ratings at an individual user level to determine which games are similar and use that to create a map of the board games landscape.

Data collection

Before we are able to perform any user-level ratings analysis, we need to collect a dataset that contains game ratings at an individual user level since the previous dataset in parts I and II used an average rating for each game. Extracting individual-account-level information from Board Game Geek is possible using their XML API, but is more challenging and time-consuming than extracting game-level aggregates due to some constraints in the API (e.g. limited to 100 user-level ratings per request). As such, obtaining a comprehensive list of all game ratings by each user for all games in the BGG database was not considered a viable approach. Instead, the individual user level ratings were obtained for 500 of the most populer (by Ownership) games on BGG, with an additional 53 hand-picked to sample some of the more recent successful titles, including Wingspan, Res Arcana, etc. Those criteria bring down the total number of user-level ratings to be collected considerably, but still amounts to 7.5 million individual game ratings at a user level. Those 7.5 million user-level game ratings covering 553 successful games were collected and were found to contain ratings from 265,374 unique BGG accounts.

The dataset is currently in a SQLite database. If anyone would like a copy of the data, please let me know.

User-Driven Similarity

Having collected individual game ratings per BGG user, we can take any pair of games, find the users that have provided ratings for both of these games and see how the ratings across games are related. There are a few examples below showing that BGG users who tend to like Monopoly tend to also like Risk. Similarly, users who like Yahtzee tend to also like UNO. On the other hand, users that like Monopoly aren’t any more likely to enjoy Twilight Struggle, and users liking UNO tells us nothing about their affinity for Scythe. The extent to which ratings of games are correlated indicates how likely it is that users will like one game if they like the other. It’s important to highlight that when we say a user “likes” a game here, we are always talking in relative terms. It means that users that like game A more than average are likely to enjoy game B more than average too if their ratings are positively correlated.

The correlation of user-level ratings between games can be interpreted as some form of similarity between games. After all, if the users who tend to like one tend to also like the other, there will presumably be something similar between the games. However, the similarity between the games may not be obvious based on a traditional board games classification taxonomy. What the correlation captures is essentially games that “scratch the same itch” or tap into a similar core appeal. This could be the feeling of solving an abstract puzzle, a thematic appeal, the social-component, the rewarding feeling of building an elegant engine, the feeling of cooperating with friends, or any other. The games may have very different mechanics, themes, complexity levels or even overall average ratings, but likely tap into a similar core appeal, and that core appeal will resonate with some groups of BGG users more than others.

Scaling up the comparisons

Now that we’ve introduced the concept of game similarity based on user-rating correlations, we can calculate the pairwise correlations for all 152,628 unique pairs of games in our dataset. Despite the user-level correlation approach to assessing how similar games are knowing nothing about the games’ type, genre, mechanics, complexity level, rating, designers, or anything tangible about the game, the similarity approach is able to identify that remakes or alternate versions of the same game are very similar (e.g. Codenames, Codenames: Pictures and Codenames: Duet, or Brass: Lancashire and Brass: Birmingham). This approach also finds, rather reassuringly, that games that we would intuitively class as being broadly similar tend to have high user-level rating correlations as well. For example, One Night Ultimate Warewolf, Secret Hitler, Coup, and The Resistance are all light party games based on communication and deception. They all end up with high correlations with one another. Similarly, word-games like Boggle, Scrabble, Taboo, Scattergories, Pictionary, Bananagrams also group together in the same way. Another example is the “Easy to learn. Hard to master” strategy cluster of Chess, Go and Diplomacy. These correlations and their general alignment with games that we’d intuitively consider similar allows us to build a simple recommendation system that displays the most similar games to any other game (refer to Dashboard below for an implementation of this)

Mapping the board game landscape

The full grid of 152,628 game similarities is non-trivial to visualise in its native form. To accurately display the similarities between all games in that matrix as distances between points, we would need a 552-dimensional (N-1) graph. Obviously, that’s not really a tractable solution. Fortunately for us, there are machine learning techniques that provide us with an adequate solution to this problem. A technique known as t-Distributed Stochastic Neighbour Embedding (commonly abbreviated as t-SNE) allows us to create a lower-dimensionality projection, or more correctly, a manifold, of the 553 x 553 correlation matrix that attempts to keep points that are close together in the high-dimensionality space close together in the reduced-dimensionality space too. What this means is that we can obtain a set of points in 2D that best preserves adjacency between points close together in high-dimensional space, therefore keeping similar games together. Below is an interactive visualision of the results using this approach. You can hover over any point to get more information on it and a list of its most similar games. There is also an interactive dashboard (see next section) where you can search for individual games to highlight them in the plot.

We can see that the games that we mentioned as being similar above are close to each other in this visualisation. This visualisation also shows that games percieved to be good “Gateway games” such as Catan, Carcassone and Ticket to Ride are also in close proximity to each other (bottom of the light blue group), despite not having many common themes or mechanics between them. Similarly, many pre-1960s traditional family games such as Monopoly, Risk, Battleship or Clue cluster together as well (bottom of the red group). Navigating the plot reveals several groups of games that are intuitively grouped together e.g. Economic Games, Visual party games, communication-based party games, hidden information card games. Interestingly, I found 2 game designers whose games tend to cluster together: Vlaada Chvátil (near the top right of the orange area) and Reiner Knizia (top left of dark blue area). It’s also interesting that in both of these cases, despite there being a very distinct cluster for their games, they each have games that do not belong in their own cluster e.g. Codenames does not appear to belong with the other Vlaada Chvátil games. Similarly, The Quest for El Dorado does not belong with the other Reiner Knizia games. There are many other interesting features in the plot, but they are best left for the readers to explore and discover.

Interactive Dashboard

I’ve built a basic interactive dashboard with more control over the visualisation of the BG landscape seen above, as well as a basic recommendation system that lists the most similar games to any game of interest. It can be found using the link below:

Link to dashboard

Closing remarks

I hope that the framework presented here helps nudge the discussion around tabletop games and their classification towards a consideration of the games’ core appeal rather than a classification based on some of the games’ trappings and mechanics e.g. “Wargame”, “Thematic game”, “Hex and Counter game”, etc. This analysis also had a useful byproduct of allowing us to create a rudimentary game recommender system based on user-level ratings correlations (recommender available in the dashboard) that will hopefully be useful to some people, despite the limited scope of 553 games.

Thanks to:

Elizabeth Hargrave: Elizabeth Hargrave suggested that it might be interesting to do a gender-level analysis on the BGG dataset following my previous analysis on board games. That motivated the collection of a user-rating-level dataset, which eventually sparked this idea.
Colm Seeley for introducing me to the world of modern board games, countless discussions and ideas on interesting things to do with the dataset, and for helping me identify and name many of the clusters in the mapped board game landscape.
Yihui Fan for suggesting some interesting neural-network-based analysis ideas that could be performed on this data.

Making aesthetically pleasing dot density Venn diagrams

2019-04-14T20:00:00+01:00

Introduction

Venn diagrams are a very common and intuitive way to visualise sets and relative population sizes of different cuts of data. From a data visualisation perspective, Venn diagrams are used in several different ways to present data:

Euler diagrams: A qualitiative overview of which sets overlap with others, and which sets are subsets of others (Euler diagrams are technically not Venn diagrams, but I have included them here because these types of diagrams are colloquially still referred to by many as Venn diagrams)

Source: Wikipedia

Labelled population sizes in the diagram: These are a straight forward way to present the data, but from a perceptual standpoint, our brains aren’t very good at intuitively processing this. It’s only marginally better than presenting the data in the form of a table

source: Geckoboard

Area-proportional or scaled Venn diagram: These aim to scale the area of different regions of a Venn diagram so that they are proportional to the population of that segment. This can be quite a useful way to convey relative population sizes of the regions of the Venn or Euler diagrams, but geometric restrictions means that this can’t be accurately done with circles for cases with more than 2 overlapping sets (the number of degrees of freedom from altering relative size and distance between circles will be lower than the number of distinct regions in the Venn diagram for all cases with n>2). There are ways around this problem using triangles or irregular shapes for the 3-set or higher case, but it is likely that you will run into geometric limitations when presenting information in this way

source: StackOverflow post

Dot density Venn diagram: Another way to present more quantitative information is by populating the regions of the Venn diagram with icons or dots that represent the relative population of the region of the Venn diagram. This is a flexible way to present quantitative information that is also perceptually easy to process.

source: Robert Allison’s website

I generally like the latter as a visualisation approach because of its flexibility and perceptual interpretability. However, the way it is done is typically with randomly sampled points for each region or manually placed points in arbitrary locations within a region. I have always thought that these could look nicer if the points distribution within a region were approximately evenly spaced, so this blog post is my attempt at solving that problem.

Lloyd’s algorithm for pseudo-random sampling

Lloyd’s algorithm is designed to generate roughly evenly spaced points in space, so I’ll be using this as the key process for the pseudo-random sampling to create evenly distributed points. The way it works is heavily reliant on Voronoi tessellation. If you want to learn more about Voronoi tessellation, I can recommend this DataGenetics post introducing the concept.

Lloyd’s algorithm starts with a set of randomly distributed points, and then recursively generates the Voronoi cells for that set of points and moves the points to the centroids of the Voronoi cells. Each iteration of this process increases the uniformity of the spacing between points. Each step is visualised below:

Start with a set of random points
Determine the Voronoi tesselation for that set of points
Move each point (orange) to the centroid (blue) of its Voronoi cell

We can see that this process increases the distance between points that are close together.

This process can be done recursively to keep increasing the distance between points that are closest together until the system reaches an equilibrium point, thereby generating an approximately uniformly distributed set of points. The animation below shows the effect of cycling through 30 iterations of Lloyd’s algorithm

This approach can be applied to all regions in a dot density Venn diagram to turn the figure on the left into the figure on the right.

That looks much nicer to me and it doesn’t lose any perceptual accuracy. I think this might become my default choice for visualising population sizes in sets in the future.

If you’re interested in generating similar graphs, the code I used wrote to generate the Lloyd-relaxed dot density Venn diagram can be found here in the form of a Jupyter Notebook (Python).

An analysis of board games: Part II - Complexity bias in BGG

2018-12-09T02:30:00+00:00

This is part II in my series on analysing BoardGameGeek data. Other parts can be found here:

Part I: Introduction and general trends
Part II: Complexity bias in BoardGameGeek
Part III: Mapping the board game landscape

Introduction

In Part I, I describe how I generated a dataset from BoardGameGeek and explored general trends in the rate of release, ratings and complexity. It also looked at the prevalence of different mechanics and themes throughout the hobby and how this has changed in the past 30 years. In this post, we’ll investigate complexity bias in BGG ratings.

Complexity bias in ratings

BoardGameGeek’s top 100 list is a very visible “beacon” for the hobby and many players will use this list to make decisions about which games to try or buy. It is comparable to the IMDb top 250 in the role it plays in shaping what the community perceives as the apex of Board Game experiences. However, one of the problems with the BGG top 100 is that it is disproportionately dominated by big and complex games. This makes it less useful for a sizeable majority of board game players looking for good new games to play, since many of the games on that list will look inaccessible and daunting. The relationship between a game’s complexity and how highly rated it is on BGG is not just limited to the top 100. In fact, there is a pretty clear correlation between how complex a game is and how highly rated it is on BoardGameGeek, as shown below.

Note: The above graph only includes games with > 100 votes for game weight

The existence of this correlation in the BGG dataset makes it easier to understand why the top 100 is disproportionately populated with big, complex games.

It is worth making a couple of comments based on the graph above:

This graph does not necessarily mean that more complex board games are inherently better. While the graph above does show a clear (and statistically significant) relationship between perceived complexity and overall rating, we need to appreciate that there is a strong sampling bias present in our dataset that leads to this result i.e. Complex board games disproportionately appeal to the BGG user base.
A curious feature of the graph above is the tail of games of low complexity and low ratings at the bottom left of the plot. This “tail of spite” consists of relatively old mass-appeal games. Every single game in the tail of spite was released pre-1980, with many being considerably older than that. The games that form the tail of spite are shown in the table below:

Name	Avg. rating	Avg. weight	Year published
Tic-Tac-Toe	2.6	1.11	-1300
Monopoly	4.4	1.68	1933
Trouble	3.7	1.05	1965
Pay Day	4.7	1.23	1975
Checkers	4.9	1.79	1150
Pachisi	4.5	1.21	400
Sorry!	4.5	1.17	1929
Battleship	4.5	1.23	1931
Mouse Trap	4.1	1.12	1963
Connect Four	4.8	1.20	1974
The Game of Life	4.1	1.19	1960
Operation	4.0	1.08	1965
Guess Who?	4.8	1.12	1979
Candy Land	3.2	1.05	1949
Snakes and Ladders	2.8	1.00	-200
Twister	4.6	1.09	1966
Pick Up Sticks	4.2	1.05	1850
Bingo	2.7	1.02	1530
Memory	4.7	1.16	1959

Correcting for the complexity bias

Since the regression in the graph above reveals how games’ ratings are related to complexity within the BGG dataset, we can artificially correct for the correlation by adjusting the game ratings to penalize complex games and reward simpler games. For the more mathematically inclined among you, I’m referring to the residuals of the regression between rating and complexity.

A short illustration goes a long way to intuitively explain what the process does.

Applying that artificial correction gives us a “complexity-agnostic” rating for all games. Below is an interactive plot showing the rating vs complexity after the rating correction. Hover over any point to see the name of the game and the game’s new BGG rank and rating.

Hover your mouse over (or tap if you’re on mobile) any point for more information about the game

We can use these corrected ratings to re-rank all of the games and obtain a complexity-agnostic top 100 list. Note that BGG use something called a Bayesian mean to rank their games instead of taking the raw average ratings. What this does is effectively give each game a certain number of additional “average” rating votes. This is designed to push games with a very low number of ratings towards the average to prevent the top games list being dominated by games with only a couple of perfect score ratings. I’ve used a similar approach, using the same Bayesian prior as BGG (Bayesian prior of about 5.5 with a weight of around 1,000 ratings). As a result, there may be some cases where a game with a higher average rating end up having a lower rank than a game with a slightly lower average rating if the second has significantly more rating votes. The re-ranked BGG list using these corrected ratings has the complex games evenly spread throughout the ranked list of games rather than disproportionately skewed towards the top, thereby allowing some of the great, but less complex, games to shine through to the top 100.

I have applied the complexity-bias correction to all games with over 30 rating votes. Below is an interactive table that allows you to navigate the full list. It also includes a search function to find the impact of the complexity-bias correction on specific games.

Note: This table only includes games with >= 30 rating votes

Some of the games experienced a fairly substantial push up/down the rankings ladder as a result of the complexity bias correction. Some of the games that benefitted the most from this rating correction and have risen to the top 100 are Skull, BANG! The Dice Game, Love Letter: Batman, No Thanks!, Time's Up!, Spyfall and Sushi Go!. Conversely, some of the games that have been penalized the most are Twilight Imperium (Third Edition), Alchemists, War of the Ring (first edition), A Game of Thrones: The Board Game (Second Edition), Through the Ages: A Story of Civilization and Caverna: The Cave Farmers.

Looking at the revised top 100 from the list above, I still have some reservations about it, but it looks much more reasonable to me than the original BGG top 100 list. I suspect that for most board game players looking to try out new good games, this list would look far more approachable, while still being filled with excellent games.

I hope that you’ve enjoyed learning about the complexity bias inherent in the BoardGameGeek dataset and how we can correct for it. The discussion on whether or not complex games really are better is far from over, but hopefully people looking for some of the lighter great games to play will find this more welcoming take on the BGG top 100 useful.

The code I wrote for this analysis can be found here in the form of a Jupyter Notebook (Python).

Thanks to:

Colm Seeley for co-authoring this work with me
Catherine Maddox for great feedback on the writing and presentation of the post
Quintin Smith (Quinns) from Shut Up & Sit Down for allowing me to use material from one of his talks in a presentation of this analysis
GitHub user TheWeatherman for creating the BGG scraper that I modified to collect the data used for this analysis.
_GitHub user vividvilla for building the useful CSVtoTable tool

If you enjoyed reading this, you may also enjoy:

An analysis of board games: Part I - Introduction and general trends

2018-12-08T03:30:00+00:00

This is part I in my series on analysing BoardGameGeek data. Other parts can be found here:

Part I: Introduction and general trends
Part II: Complexity bias in BGG
Part III: Mapping the board game landscape

Introduction

Over the last few years, board games have become one of my favoured pastimes. My journey of discovery in this space has been very enjoyable, but the deeper I delve down the rabbit hole, the more it makes me wonder about the board game landscape as a whole, particularly about the genres I haven’t tried, the different types of mechanics I’ve not been exposed to, games that have an unusual pairing of mechanics and how the board game landscape as a whole has evolved over time. I found a few different bits of analysis on Kaggle, in forums and blogs that scratched the surface of these topics, but not enough to relieve the itch of my curiosity, so I decided to get my hands dirty and rummage through the data mine myself.

Data collection and description.

BoardGameGeek is a fantastic database for board game information, so it seemed like a no-brainer to me to use this as the main source of the data for my analysis. There are pre-scraped and ready to use BGG datasets on Kaggle and GitHub, but neither of those suited my purpose since the Kaggle dataset is limited to the top 5000 board games on BGG and the GitHub dataset is 2 years old and is also missing some data fields that I was interested in such as a list of mechanics for each board game. I decided to re-run a modified version of the scraper I found on GitHub to allow me to fetch additional fields such as a list of mechanics, categories and designers that were not collected by the original scraper and obtain a slightly richer and more up-to-date board games dataset. This generated a dataset containing 76,597 board games and 13,675 board game expansions. The modified scraper and the scraped dataset can both be found here

For the analysis in this post, we’ll be focusing on base board games only, not expansions.

A note on sampling bias in the dataset

Before we delve into any any serious analysis, we should highlight that any patterns or observations found here reflect patterns observed within the boardgamegeek dataset. Depending on the context in the analysis, these observations may or may not be reflective of the board game industry as a whole, since the perspective and behaviour of boardgamegeek users will not always accurately represent all board game players. People who have a boardgamegeek account and actively log plays and rate games are very likely to be more invested and informed on board games compared to the average board game player. A good demonstration of this bias can be seen in the list of most owned games on boardgamegeek.

Whilst the exact sales figures are hard to come by, it is generally agreed that the most popular (by ownership, not rating) board games include Chess, Monopoly, Risk, Scrabble, Pictionary, Cluedo, Trivial Pursuit, etc. (sources: 1, 2, 3, 4). All of these games are under-represented in the BGG dataset due to the aforementioned bias. There will be many other cases in the analysis where this bias is likely having an effect, but I’ll address them as they come.

A golden age of board games

There has been a lot of discussion suggesting that we are currently in a board game golden age (sources: 1, 2, 3). I thought it would be interesting to see if the data supports this view.

Board game publication rate over time

Historically, there has been a broadly exponential increase in board games coming out each year.
Based on the exponential growth of board game publications observed so far, we expect the number of board games published over the course of a year to double every 12.6 years. This is the board games analogue of Moore’s law
We are currently observing the release of around 3,500 new board games every year, and that number is increasing by around 5.7% each year.
The growth of the industry appears to have stagnated between the mid 1980s and late 1990s. There was, however, a disproportionate surge in growth of number of board games released per year from 1999 to 2005 that made up for the stagnation observed in the previous years. It’s not entirely clear to me why the stagnation or surge occurred during those years, but given that the transition between the stagnation and surge aligns with the release of the boardgamegeek website (first available in 2000), it’s possible that these changes in new games published per year are an artefact in the data due to the availability of boardgamegeek (i.e. obscure board games before the existence of boardgamegeek may have been lost to the sands of time, whereas after the existence of boardgamegeek, it’s more likely that obscure games will still make it to the database).
It is worth remembering that this is just referring to new games, and it doesn’t even include expansions!
Overall, whilst the number of new board games released per year currently appears to be very high, it is currently in near perfect alignment with what’s expected given the historic growth trends of the board game industry. There is nothing particularly unique or different about the rate of release of new board games to support the view that we’re currently in a board game golden age. However, the rate of release of board games is just one of many aspects that could lead people to believe that we’re in a board game golden age. We can have a look at some other factors too.

Board game ratings by publication year

This data suggests that board games have been getting steadily better since around 2002, but that there’s been a disproportionate improvement in in game ratings in the last couple of years. While the last couple of years certainly appears to have seen the release of great games such as Pandemic Legacy, Gloomhaven, Scythe, etc., it’s not clear to me what caused the games to get disproportionately better in the last couple of years. Perhaps the consumer market for board games increased in size noticeably, leading to more resources poured into game development, but unfortunately, I don’t have the data to test this.

It’s also worth noting that there may be an element of ratings inflation present in this data i.e. the baseline rating for an average game has increased, because people might have perceived a rating of 5 to be average a decade ago, but now they might perceive a rating of 7 to be average, so average ratings could be increasing over time despite games not necessarily getting better.

Other industry trends over time

Complexity

The “complexity” of a board game is a relatively loosely defined term, since it encompasses different types of learning and decision making characteristics involved in learning how to play as well as playing a game. To give a quick example, Chess/Go are relatively simple games in terms of their rules. In both cases, the rules can all be concisely explained and understood in a few minutes. However, getting a full grasp of all the strategies and tactics made possible by these simple rules can take a very long time as there is a considerable amount of complexity born from the number of different moves possible each turn as well as the fact that every move affects the available possible moves in future turns (i.e. turns are not independent). Boardgamegeek contains a “weight” score board games (rated by users) that provides a reduced, all-encompassing sense of the complexity of a game, based on users’ perception. We can look at how the complexity scores of board games have evolved based on when games were released. I’ve focussed on games post 1995 since the dataset of games that have enough weight ratings before then starts to get a bit thin before then.

It appears that board games have not only been getting better in terms of ratings, but also more complex since the mid 1990s. The trends in the complexity mirror the trends we see in overall ratings, with games appearing to have gotten steadily more complex since the early 2000s, and the last couple of years exhibiting a disproportionate growth in complexity.

The parallels between the trends in overall ratings and complexity beg for a more direct comparison between them, but that’s a fairly substantial topic in itself, so I’ll address that in a future post of this series where the analysis has more room to breathe.

Mechanics

Mechanics are the basic constructs of rules or methods that allow you to interact with a board game to allow gameplay. These can be simple things such as dice rolling or drawing (e.g. Pictionary), to slightly more involved cases such as Card drafting or Route/Network building. I’ve listed the mechanics on BGG below by how often they’re found on board games. Many of the mechanics’ names are intuitive, but some require more explanation. Descriptions and examples of all mechanics can be found here

Themes

The BGG classification taxonomy for games is a little odd. At the highest level, they contain a game Type classification (e.g. Strategy, Thematic, Wargame, Party, etc.). There aren’t very many classification types, and more than 75% of games on BGG don’t even contain a Type classification, so I won’t be analysing it in this post, but if you’re interested in looking of most prevalent game types, etc. they can be found in the raw analysis Jupyter notebook file. The next level in the taxonomy is Category. A cursory look at the values in this field will show that it’s a disorganised mix of themes, game “types” and mechanics (as an example, Codenames contains the Categories: Party Game, Card Game, Word Game, Deduction, Spies/Secret Agents). I’ve manually filtered the list of tags in the Category classification level down to only include themes, since they were the elements that appeared most consistently under the Category classification. BGG also contains further levels of classification taxonomy, but I’ll only be looking into themes derived from tags in the Type field for this analysis.

The most popular themes in BGG are:

Similar to the mechanics, we can also see which themes have become more/less popular over the last couple of decades.

Looking at the most popular themes, there is a big shift occurring in the dominant themes present in games. In fact, 8 out of the top 10 most popular themes are either in the top 5 most rapidly rising or top 5 most rapidly declining lists, suggesting a strong and clear shift in the themes that engage the current generation of board game players. The themes that are on the rise include Fantasy, Science fiction and Fighting, whereas Trivia, Movies / TV / Radio theme, Sports, Racing and Economic are all on the decline. This seems to suggest that the themes that capture our imagination today are slightly less grounded in reality (at least compared to 20 years ago).

More analysis on the BGG dataset can be found in part II (Coming soon) of this series of analysis posts, in which we explore the relationship between mechanics, themes and game ratings.

I’ve left a lot of smaller bits of analysis that didn’t fit into the structure of this write-up, but if you’re interested in learning more about the board game dataset, you can find all of my analysis, including the code here in the form of a Jupyter Notebook (Python).

Thanks to:

Colm Seeley for introducing me to the world of modern board games, countless discussions and ideas on interesting things to do with the dataset, helping me structure the analysis and for providing some manually collected data. I look forward to co-presenting this analysis with him in the future.
Catherine Maddox for great feedback on the writing and presentation of the post
Yihui Fan and Hugh Thompson for helpful feedback on the clarity and aesthetics of the graphs.
GitHub user TheWeatherman for creating the BGG scraper that I modified for this analysis.

If you enjoyed reading this, you may also enjoy:

Boardgamegeek dataset on Kaggle, with multiple users’ analysis on the dataset
Are Boardgames Getting Better? An Empirical Analysis by Opinionated Gamers
By the Numbers - BGG Rank Data + Analysis by Oliver Kiley

TV show episode ratings

2016-05-28T22:00:00+01:00

This post is about a simple visualisation of the episode ratings of TV shows. The idea behind this is heavily borrowed from Graph TV. I use that site often and really like it, but the plots it generates are based on IMDb rating data. I’ve always wanted something similar but using Trakt.tv rating data instead, so I decided to write a script to do just that.

Below are the episode ratings for the top 10 most popular shows, according to Trakt.tv. The plots are interactive. You can hover over a point to get more information on the episode or pan/zoom on the data using the tools on the bottom left of each plot.

I will likely create a small web app to make it easier to generate the plots online for any tv show at some stage in the future, but if anyone is interested in generating similar plots for other shows now, the Python code to generate the plots is available here on GitHub. A Jupyter notebook with the code can also be found here.

A Song of Ice and Fire : Chapter ratings

2016-04-10T23:12:00+01:00

This post relates to Game of Thrones, or more specifically to the series of books the show is based on: A Song of Ice and Fire.

The website Tower of the Hand contains ratings for each chapter in the series of books. Chapters’ ratings are generated by users. Each chapter has ratings from typically around 150 people so there will still be a reasonable amount of uncertainty around each chapter rating, but there is still enough information in here to give us broad ideas about the overall progression in how interesting the books are, the most interesting books and the most interesting POV characters.

Let’s start by having a quick overview of the progression of chapter ratings across the entire series.

Chapter ratings in the entire series

This overview suggests that:

A Game of Thrones is fairly consistent in its chapter ratings
The final quarter of A Clash of Kings is comparatively dull
A Storm of Swords gets better as the book progresses
A Feast for Crows is not as good as the other books, but it gets better as the book progresses
A Dance with Dragons is the most inconsistent in terms of chapter ratings

Having read the books, I’m inclined to agree with the overview provided by the ratings so far.

Let’s break the chapters down by book to have a slightly closer look at the ratings.

Chapter ratings by book

Chapter ratings by POV character

It can be interesting to break down the chapter ratings by the point of view characters to see how the various plot lines progress in terms of maintaining reader interest.

Daenerys : With the exception of one strong chapter, Daenerys’ chapters in the final book are not very good.
Ned : As we all know, he was a short lived character, but his chapters were consistently great
Brienne : Many people complain about Brienne’s chapters in AFFC. It’s interesting to see that Brienne’s chapters start out being dull, but appear to get more interesting as the book progresses.
Tyrion : Goes from having several strong chapters in the previous books to having a weak showing in ADWD. The drop in the quality of his storyline is particularly jarring considering the strength of his chapters at the end of ASOS.
Theon/Reek : One of the few consistently solid POV characters in ADWD

Chapter rating distributions by book

Book	mean	std. dev.
AGOT	8.21	0.56
ACOK	7.75	0.70
ASOS	7.99	0.63
AFFC	7.55	0.52
ADWD	8.03	0.69

If we rank the books by the average ratings of the chapters in each book, they rank in the order AGOT > ADWD ≈ ASOS > ACOK > AFFC. The overall book ratings on Goodreads, however, suggest that ASOS > AGOT > ACOK > ADWD > AFFC. Personally, my views on the quality of the books are more aligned with the Goodreads ratings, but it’s likely because the overall experience of a book is not well represented by the average of its chapters.

Chapter rating distributions by POV character

We can also have a look at the distributions of chapter ratings in each book to see which of the POV characters have the better chapters.

The distributions are ranked by average chapter rating, with the highest average on the left. The top few POV characters are all characters with a single POV chapter so far. From the characters that have multiple POV chapters, Ned Stark has the most interesting chapters. It goes some way to explain why he’s such a fan favourite character.

If you have any ideas about what might be interesting to do with this dataset, let me know in the comments. The Jupyter notebook that was used to generate all the plots in this blog post can be found here.

Solving the 8 Queens problem with python

2016-03-28T00:45:00+01:00

This is my approach to solving the 8 Queens puzzle with Python.

For anyone unfamiliar with the 8 Queens puzzle, it is the problem of placing eight queens on a standard (8x8) chessboard such that no queen is in a position that can attack any other. This post will have the solutions to the puzzle, so if you’d like to attempt to solve it on your own, now would be a good time to stop reading this post.

I was first made aware of the existence of this puzzle in a pub one evening with some friends. One of my friends started trying to solve the puzzle manually and found a solution in about 10 minutes. This inspired me to attempt to tackle the problem with Python to see if I would have been able to find a solution faster. I took me around 15 minutes to solve the puzzle using python, but found 92 solutions (there are 12 if you eliminate symmetrically related solutions).

This original code I wrote to solve the problem looked like this:

from itertools import permutations, combinations

text = input('How big is your chess board?')
n = int(text)
x = range(1, n+1)

def is_diagonal(point1, point2):
    x1 = point1[0]
    y1 = point1[1]
    x2 = point2[0]
    y2 = point2[1]
    gradient = (y2-y1)/(x2-x1)
    if gradient == -1 or gradient == 1:
        return(True)
    else:
        return(False)

list_of_permutations = []

for permuation in permutations(range(1, n+1)):
    y = permuation
    all_permutations = list(zip(x,y))
    list_of_permutations.append(all_permutations)

for possible_solution in list_of_permutations:
    solutions = []
    for piece1, piece2 in combinations(possible_solution, 2):
        solutions.append(is_diagonal(piece1, piece2))

    if True not in solutions:
        print(possible_solution)

I’ve since expanded it to make it easier to understand, abstracting some useful functions and added some code to remove equivalent solutions and help visualise the solutions, but the code above contains the main logic that runs at the heart of the approach I took. The expanded version of the code can be found here.

Let’s break it down a little bit to explain what’s happening.

We know that no two queens can attack each other. This means that there must be 1 queen per row. Similarly, there must be 1 queen per column. In this approach, we’re going to take 8 queens, assign them to the rows 1-8 and determine what columns they must each be in in order for the puzzle criteria to be satisfied. Since there are 8 queens and 8 column positions, there are 40,320 (nPr with n=r=8) ways to arrange 8 queens on a chessboard such that there is one queen per row and 1 queen per column. Since we already know what none of the queens will be attacking any other horizontally or vertically, all we need to do is to check each of the 40,320 arrangements to see if any queen is diagonally threatening any other. This takes about a second to run in total (1.06 seconds on my mid-range 5-year-old Desktop computer) for all 40,320 possible queen arrangements and returns 92 solutions that fit the criteria of having no queen attacking any other. Some of these will be symmetrically related. For example, here are 8 solutions from the set of 92 that are related to each other through 90 or 180 degree rotations; or mirror planes (i.e. they are horizontal, vertical or diagonal mirror images of each other).

When we remove the solutions that are related, we are left with the 12 unique solutions for the 8x8 board case, shown below:

The Jupyter notebook containing the current version of the code is available here

Thanks to my friends:

Daniele Tomerini for introducing me to this puzzle
Hugh Thompson, whose attempts at solving this puzzle manually inspired me to tackle it using python