Analysis of a WhatsApp chat log

Posted on Sun 25 October 2020 in Analysis

Introduction

With the ever-increasing role of technology and software services in our modern lives, we’re passively creating an increasingly large digital footprint. Our browsers store our history by default. Our favourite map apps store our location history by default. Our banks keep records of our transactions. Our phones and smart-watches keep track of the number of footsteps we take every day along with heart rate, and various other bits of biometric data. Youtube, Netflix, Spotify, Amazon video and similar services all keep track our media consumption history. There are also many other apps or services that keep track of a myriad of other interesting habits and behaviours. I personally enjoy downloading copies of my digital footprint and analysing them. I think there’s a lot that can be learned from examining my own behaviours and patterns to help me reflect on some unconscious choices I’m making and help me obtain a better and more objective understanding of myself. It also serves as a useful digital diary since I can automatically collate data from several sources nightly to give me a fairly good data-driven summary of what I was doing on any given day. To that end, I’ve written several scripts across the past 6 years to download and analyse my digital footprint from a range of different sources.

In this blog post, I’ll be talking about specifically about the output from a script I wrote to analyse WhatsApp chat logs. You can run this analysis on your own chat logs by running the Python Notebook which can be found here. If all that looks or sounds too technical for you, I created a WebApp where you can drag and drop a WhatsApp chat log to run a slightly more limited version of this analysis (I’ve had to remove some of the more memory-intensive parts due to memory constraints in the cloud platform I’m using to host this). There are more detailed instructions on how to replicate this work on your own chat logs towards the end of this blog post.

Sample analysis results

Here are the results from an analysis of an anonymised chat log I’m part of (with permission from the relevant group chat members) to showcase some interesting things that could be done with WhatsApp chat log data and the types of insights that could be gained relatively easily with some basic Natural Language Processing (NLP).

Summary table

4,955 total messages from 4 people, from 2016-12-07 to 2020-10-20

Isambard Lysander Perseus Seraphina
Contribution
Total N messages 1,921 942 1,035 1,057
Total N words 19,979 12,325 14,062 11,390
Total N characters 101,607 63,472 72,766 57,491
Message type
Text 94.4% 96.3% 92.2% 94.1%
Media 2.2% 2.7% 5.1% 3.8%
Link 3.4% 1.0% 2.7% 2.1%
Message content
Sentences per message 1.40 1.57 1.38 1.41
Words per message 11.1 13.6 14.8 11.6
Characters per message 54.1 69.2 74.1 56.5
Messages containing emoji 2.9% 0.2% 0.2% 2.5%
Messages containing profanity 1.9% 0.0% 5.6% 0.4%

The summary table is already a pretty useful overview of the chat log, but we can visualise the data and delve slightly deeper into the patterns in the chat log.

Message types

We can start with some plots on the types of messages sent to visualise who favours media messages (audio, video, images, or gifs) and who tends to share external links in the chat

Message contribution

We can plot the overall message contribution from each person.



The last plot above contains a few notable literary works to contextualise word counts better. I’ve included a longer list of reference literary works in the code that generates this plot, so you will likely see a different set of reference works for context if you choose to re-run this analysis on your own chat logs appropriate to the magnitude of word counts in the chat log being analysed.

Contribution over time

We can also look at when the each person has contributed to the conversation. This can be done as either a calendar view…

…or can also be viewed as a timeseries


The last plot can be represented as a relative plot rather than an absolute plot if want to see who has been contributing more/less relative to everyone else over time.

Another thing we can do with the timeseries data that could be of interest is group the activity by day of week and time of day to look at daily and weekly patterns.

Note that the curves in the Activity by time of day plot are measured every minute across the 24 hours and smoothed with a Gaussian convolution. The smooth curves are not an artifact of interpolation.

Conversation Dynamics

We can gain some insight into the group conversation dynamics by looking at who tends to respond to each user. This doesn’t take into account the content of the message to infer which message is being responded to. It is based purely on who the previous message was from every time a message is sent. As such, it can include replies to oneself. Contiguous messages from one person have been excluded if they are posted within 3 minutes of the previous message, as this is deemed to effectively be a single message split across multiple messages.

In a similar vein, we can also examine the response times for each user.

Grammatical and linguistic preferences

Most of the analysis above looks at patterns in when people message and a very rough overview of the nature of the messages. It can also be insightful to examine the content of the messages more closely to identify people’s grammatical and linguistic preferences. Below are a few examples of the types in insights that can be obtained by parsing and analysing the content of the messages.

Let’s start by looking at the use of punctuation, emoticons (Emoji) and profanity.

Note: Profanity detection is done using the profanity-check library

We can also look at the distribution of word lengths from each person.

In this instance it turns out not to vary all that much across users, largely due to the composition of the group, but that is not always the case.

An interesting analysis we can perform is to compare the relative usage frequency of different words from each user against the natural prevalence rate of those words in the English language (based on occurrence in the web on websites in English) to find the words that each user uses disproportionately often.

Using the same concept of words’ natural prevalence rate in English-language websites above, we can determine the average log(natural prevalence frequency) of all words used by each person. This is a measure for how obscure/niche the words used by each person is, with a lower average log(frequency) indicating more frequent use of rare words.
The plot above can be interpreted as some proxy for vocabulary complexity/specificity. An interesting measure to accompany that is vocabulary breadth. To do that, we can plot the cumulative unique word count against cumulative total word count for each user.

Early in the conversation, we expect most words used to be new. As the conversation continues, we expect the number of unique words to slowly tail-off as many of the words used in the conversation will be repeated words. Theoretically, if the conversation goes on infinitely long and the topics of conversation covered in the chat are exhaustive, we expect this curve to asymptote towards a value that represents some approximation of the scope of each person’s total vocabulary size. The plot above displays the early part of that curve. In practice, most WhatsApp conversations will be far too small to reach the vocabulary size asymptote, and will often have limited topics of conversation covered. Moreover, many people will likely modulate the tone and complexity of the language they choose to use in casual WhatsApp conversations in ways that make it unrepresentative of their true vocabulary breadth. These curves have been left in the analysis because I believe they are interesting, but it is important to emphasize that they are only broadly indicative of each person’s vocabulary scope specifically as observed in the particular chat log which, for many reasons, will not be representative of their true overall vocabulary size.

Finally, we can look into how similar the observed vocabulary is between the participants of the group. This can be done by taking all unique words used by each person and comparing it to the unique words used by every other person. The vocabulary similarity between 2 people can be defined as the number of words they use in common divided by total number of words used by either person, and is known as the Jaccard index (in set notation, it’s defined by |A∩B|/|A∪B|, where A and B are the sets of words used by each person).

Potential areas for future development

NLP is a rich and expansive field and there is plenty more that could be done with this dataset using NLP tools and techniques. Some ideas for potential future developments could be:

  • Automatic topic of conversation detection, to pick out who tends to favour discussions on particular topics and mapping when in time different topics get discussed
  • The comparison of vocabulary similarity between people could be improved (currently using Jaccard Similarity based on a unique words used by each person)
  • Linguistic style profiles could be added by assessing vocabulary similarity or prose style similarity to external text datasets e.g. Gossip magazines, tabloid newspapers, broad-sheet newspapers, tech magazines, Victorian novels, scientific journal publications, legal documents, etc.

I may come back to implement these in the future, but if anyone would like to try to train any of these models, please let me know in the comments or DM me on Twitter. I’d be more than happy to collaborate or review pull requests on GitHub.

Generate your own chat log reports

If you’d like to analyse your own chat logs and create plots like the ones above, then follow the instructions below.

Extracting a WhatsApp chat log

In order to analyse your own chat logs, you’ll first need to export your own chat log from WhatsApp. These instructions are for exporting a chat log from an Android device. Doing it from an iOS device should be similar, but if in doubt, I’m sure there will be other instructions online telling you how to export a WhatsApp chat log from iOS.

Open any conversation on WhatsApp, click on the kebab icon (3 vertical dots icon on the top right of the conversation screen), then More... and Export chat. This should bring up a prompt on whether you want to export with or without media. Export the chat log without media. When the export is ready, it should bring up another selection of which app to use to deal with the files. Either select a file browser to save the files or any email client app and send the exported chat log to yourself. The export may contain multiple files. The one of interest will be named something like “WhatsApp chat with {group_name}.txt”. We’ll be using this in the next step

Analysing the chat log

If you’re comfortable with running Python scripts and notebooks, then the Jupyter (Python) notebook that I used for the analysis above can be found here. Clone the repository, install the dependencies (listed in the requirements.txt file), then place your chat log in the chat_logs folder and run the notebook. Remember to update the chat log file name in the notebook to the one you just added there. Once the script has finished running, the plots will be visible in-line in the notebook, but all plots are also saved in the outputs folder under a subfolder name that corresponds to the chat log name.

If you’re NOT comfortable with running python scripts on your own, then I’ve set up a minimalist WebApp where you can upload or drag and drop your WhatsApp chat log and have it processed for you. It’s hosted on the free-tier of a cloud platform and shared among all readers, so may be quite slow, depending on how many people are using it at any given time. I’ve had to remove a couple of the plots for this WebApp due to memory constraints in the cloud platform. None of your data will be stored, so save a local copy of the output if you’d like to keep hold of it or want to share it with anyone. Attempting to share a link to the results by copying the URL will not work (since they would require a copy of the output to be saved).

Thanks to:

  • [REDACTED] : For letting me use an anonymised version of our group chat in this blog post.