My First Data Science Project: A Word Cloud Comparing the 2020 Presidential Debates

Published in

The Startup

7 min readNov 8, 2020

As a social scientist, I am drawn to the language people choose to express themselves. Language can implicitly reveal values, group identities, and even geographical location.

But can language reveal political affiliation too?

This question sparked my first data science project.

Of course, if I were actually trying to answer this question like a real scientist, I would gather a large sample of Republicans, Democrats and non-affiliated members and analyze their conversations based on the same prompts.

Instead, I thought it would be more fun to create a visual representation, aka a Word Cloud, and sample the language used by the two people nominated to represent their parties on a national stage: Democrat Joe Biden and Republican Donald Trump. (2020)

Biden and Trump face off in the first presidential debate of 2020.

Before I begin, let me outline the path I took:

Gathering the Data (aka the Transcript)
Cleaning up the Data (Parsing through the Text)
The Word Cloud
The Results

Gathering the Data

I found a website called Rev.com that transcribes most political speeches in America. The quality of these transcripts are excellent — they are both accurate and reliable. I made sure of it by listening to clips of the speech on Youtube and reading through the text myself.

Next, I had to make the transcript “useable”. This meant that the text had to be “parse-able” and in a format I could use. I originally planned on using html tags to find and download the textbut when I actually looked at the transcript, it was easier to just copy & paste the text into a text file onto my computer.

Cleaning up the Data

Next, I took a good look at the text so I could find patterns, anomalies, and have a general understanding of my dataset. (This is always a good step to take in order to avoid major errors in the end).

I knew I only wanted words so I took out punctuations and extra spaces.

transcript = transcript.lower()
transcript = transcript.replace('"', ' ')
transcript = transcript.replace(",", " ")
transcript = transcript.replace('?', " ")
transcript = transcript.replace('!', " ")
transcript = transcript.replace(".", "")
transcript = transcript.replace("-", "")
transcript = transcript.replace("'\'", "")
transcript = transcript.replace("\\", "")
transcript = transcript.split("    ")

After scanning the text file, I noticed that each time a candidate spoke, the text started with their name (i.e. “President Donald J. Trump” or “Vice President Joe Biden”).

Transcript with highlighted candidate names

Thus, all I had to do was find the candidate names to extract their speeches. Since my code read character-by-character, I looped through each letter and if the length of the candidate’s name matched the name of the candidate, it would give me back the next few characters until it reached the next candidate’s name. (I extracted Chris Wallace’s speech as well so his did not get mixed into another candidate’s speech).

speaker_list = []
for x in transcript:
    namelen = len(name)
    if x[0:namelen+1] == (' ' + name) or x[0:namelen] == name:    
        speaker_list.append(x)

Each extracted line of text was then added to a new list.

Then, I joined the text together so that my code would parse word-by-word rather than character-by-character.

speaker_list = ' '.join(speaker_list).split()

Next, I picked a stopwords list from google. A stopwords list is a catalog of all the “fluff” words (“the”, “an”, “are”) that do not have much meaning by themselves. Since they do not hold much meaning, I did not want to count them up for my word cloud.

So using a stopwords list, I parsed over each word in the candidate’s text in order to extract out the stopwords. If the word in the candidates’ text was not in the stopwords list, this word was added to a new list called “stripped_speech”.

stripped_speech = []
for i in speaker_list:
    if i not in stopwords:
        stripped_speech.append(i)

Then, I iterated over the “stripped_speech” list and counted how often each word appeared in the list.

speaker_count = {}
for word in stripped_speech:
    if word in speaker_count.keys():
        speaker_count[word] = speaker_count[word] + 1
    else:
        speaker_count[word] = 1

Finally, I returned only the 30 top most common words in the new list.

d = collections.Counter(speaker_count)
tot = len(speaker_list)
    
top_30 = d.most_common(30)
    
print(name + " spoke a total number of " + str(tot) + " words.")
for word, count in d.most_common(30):
     print(word, ": ", count)

The Word Cloud

Luckily, there is already a Word Cloud code out there (see the code below).

wc = WordCloud(background_color="white", max_words=1000, mask=char_mask, stopwords=stopwords, contour_width=3, contour_color='lightgrey')

This code spits out a generic Word Cloud that looks fine, but I wanted to add an artistic flair. I decided I wanted my Word Cloud to be pasted over the profiles of Donald Trump and Joe Biden.

In order to do this, the image backgrounds had to be white (r=255, g=255, b=255). This is because Word Cloud has to be able to distinguish the image outline.

Since I could not find an image of Biden or Trump on a white background, I used Adobe Photoshop to extract their profiles.

Candidate profiles with white backgrounds

Then I converted the images into an array to allow the code to read the image.

char_mask = np.array(Image.open(picture))

Finally, I printed off the Word Clouds.

text = ' '.join(stripped_speech)
wc.generate(text)
print(wc)plt.figure(figsize=[30,20])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

Word Cloud of the First Presidential Debate 2020

The Results

With any tool, a Word Cloud is limited in its application. Word Clouds are great visual representations of text or the preliminary exploration of data. Most of all, they can help you see a broad picture of your data. However, Word Clouds cannot take the place of deeper linguistic analysis. Also, Word Clouds cannot explain the “why”.

With that being said, these Word Clouds did prove a very interesting point. Just look at the difference between the 1st presidential debate and the 2nd presidential debate:

Just from comparing these two debates, one can see that the 1st debate’s Word Cloud looked less substantial. In other words, the 1st debate had less “meaningful” words or less words in general.

The reason is because the Commission on Presidential Debates decided to mute the mics for the 2nd debate.

These Word Clouds captured this change beautifully.

import collections
from collections import Counter
import numpy as np
from wordcloud import WordCloud 
from PIL import Image
import matplotlib.pyplot as plt#/////////////////////////////////
def find_speaker(name):
    speaker_list = []
    for x in transcript:
        namelen = len(name)
        if x[0:namelen+1] == (' ' + name) or x[0:namelen] == name:    
            speaker_list.append(x)
           
    speaker_list = ' '.join(speaker_list).split() 
    #print(speaker_list)
        
    stripped_speech = []
    for i in speaker_list:
        if i not in stopwords:
            stripped_speech.append(i)
    
    speaker_count = {}
    for word in stripped_speech:
        if word in speaker_count.keys():
            speaker_count[word] = speaker_count[word] + 1
        else:
            speaker_count[word] = 1    d = collections.Counter(speaker_count)
    tot = len(speaker_list)
    
    top_30 = d.most_common(30)
    
    print(name + " spoke a total number of " + str(tot) + " words.")
    for word, count in d.most_common(30):
        print(word, ": ", count)    #WORD CLOUD    picture = name + ".jpg"    char_mask = np.array(Image.open(picture))   
    
    # Create a word cloud image
    wc = WordCloud(background_color="white", max_words=1000, mask=char_mask, stopwords=stopwords, contour_width=3, contour_color='lightgrey')    # Generate a wordcloud
    text = ' '.join(stripped_speech)
    wc.generate(text)
    print(wc)    # show
    plt.figure(figsize=[30,20])
    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")
    plt.show()#///////////////////////////////////////# program starts
doc = open('FirstDebateTranscript.txt', 'r')
transcript = doc.read()
doc.close()s = open('stopwords.txt', 'r')
stopwords = s.read()
s.close()transcript = transcript.lower()
transcript = transcript.replace('"', ' ')
transcript = transcript.replace(",", " ")
transcript = transcript.replace('?', " ")
transcript = transcript.replace('!', " ")
transcript = transcript.replace(".", "")
transcript = transcript.replace("-", "")
transcript = transcript.replace("'\'", "")
transcript = transcript.replace("\\", "")
transcript = transcript.split("    ")# end program here
#//////////////////////////////////////find_speaker("chris wallace")

My First Data Science Project: A Word Cloud Comparing the 2020 Presidential Debates

Gathering the Data

Cleaning up the Data

The Word Cloud

The Results

Written by Tory White