In [1]:
import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline

State of the Union Addresses

In [2]:
with open("data/stateoftheunion1790-2017.txt", "r") as f:
    text = f.read()

Recall that the speeches are separated by ***. We can separate the speeches by splitting on 3 asterisks.

In [3]:
records = text.split("***")

After the ***, a speech is formatted as


State of the Union Address
George Washington
January 8, 1790

Fellow-Citizens of the Senate and House of Representatives:

I embrace with great satisfaction the opportunity which now presents itself...

If we split on new lines, i.e., \n then we can extract the name and date easily. The rest of the lines can be joined together into one string containing the text of the speech.

Trying with Regex

In [4]:
with open("data/stateoftheunion1790-2017.txt", "r") as f:
    text = f.read()
    import re
    print("Speeches", len(re.findall(r"Fellow-Citizens[\s\S]*?\*\*\*", text)))
Speeches 56
In [5]:
with open("data/stateoftheunion1790-2017.txt", "r") as f:
    text = f.read()
    import re
    print("Number of ***", len(re.findall(r"\*\*\*", text)))
Number of *** 227

String Munging

In [6]:
with open("data/stateoftheunion1790-2017.txt", "r") as f:
    text = f.read()
    
records = text.split("***")    

def extract_parts(line):
    parts = line.split("\n")
    name = parts[3].strip()
    date = parts[4].strip()
    text = "\n".join(parts[5:]).strip()
    return [name, date, text]

df = pd.DataFrame([extract_parts(l) for l in records[1:]], 
                  columns=["Name", "Date", "Text"])
print("Length:", len(df))
df.head()
Length: 227
Out[6]:
Name Date Text
0 George Washington January 8, 1790 Fellow-Citizens of the Senate and House of Rep...
1 George Washington December 8, 1790 Fellow-Citizens of the Senate and House of Rep...
2 George Washington October 25, 1791 Fellow-Citizens of the Senate and House of Rep...
3 George Washington November 6, 1792 Fellow-Citizens of the Senate and House of Rep...
4 George Washington December 3, 1793 Fellow-Citizens of the Senate and House of Rep...

We perform a few simple text cleaning tasks. We convert characters to lower case, eliminate the new lines, and drop all punctuation.

In [7]:
df['clean text'] = (
    df['Text']
    .str.replace("\n", " ")
    .str.lower()
    .str.replace(r"[^a-z\s]", " ")
)
df.head()
Out[7]:
Name Date Text clean text
0 George Washington January 8, 1790 Fellow-Citizens of the Senate and House of Rep... fellow citizens of the senate and house of rep...
1 George Washington December 8, 1790 Fellow-Citizens of the Senate and House of Rep... fellow citizens of the senate and house of rep...
2 George Washington October 25, 1791 Fellow-Citizens of the Senate and House of Rep... fellow citizens of the senate and house of rep...
3 George Washington November 6, 1792 Fellow-Citizens of the Senate and House of Rep... fellow citizens of the senate and house of rep...
4 George Washington December 3, 1793 Fellow-Citizens of the Senate and House of Rep... fellow citizens of the senate and house of rep...

We can then use sklearn to create a word vector for each speech, which contains the counts of all words in a speech. By all words, we mean the set of all unique words used across all 226 speeches. We can think of each word vector as a record so we have 226 records and thousands of variables (word counts).

We can try to examine the relationship between speeches by reducing the dimensionality of the data. We take an approach that is a kind of Principle Component Analysis for word vectors. Specifically, we measure the distance between speeches via a metric on the word vectors. This metric normalizes by the rarity of a word.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
tfidf = vec.fit_transform(df['clean text'])

This gives us a 226 by 226 matrix of the distances between all pairs of speeches. Then we use SVD to decompose the matrix and plot the first two column vectors of the resulting decomposition (these are similar in nature to the first two principle components).

In [9]:
np.random.seed(42)
import scipy as sp
(u, s, vt) = sp.sparse.linalg.svds(tfidf, k=2)
In [10]:
df['Year'] = df['Date'].str[-4:].astype('int')
df['x'] = u[:,0]
df['y'] = u[:,1]
In [11]:
df['clean text'].head()
Out[11]:
0    fellow citizens of the senate and house of rep...
1    fellow citizens of the senate and house of rep...
2    fellow citizens of the senate and house of rep...
3    fellow citizens of the senate and house of rep...
4    fellow citizens of the senate and house of rep...
Name: clean text, dtype: object
In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
tfidf = vec.fit_transform(df['clean text'])

np.random.seed(42)
import scipy as sp
(u, s, vt) = sp.sparse.linalg.svds(tfidf, k=2)
In [13]:
sns.lmplot(x = 'x', y = 'y', data = df, hue='Year', legend=False, 
           fit_reg=False, palette="Blues", aspect=1.6)
(u,v) = df[df['Name'] == "Donald J. Trump"][['x', 'y']].values[0]
plt.plot(u,v,'*', markersize=20, color="orange")
plt.savefig("SOTUspeeches.pdf")
In [14]:
!pip install plotly
Requirement already satisfied: plotly in /Users/nolan/anaconda3/lib/python3.7/site-packages (4.2.1)
Requirement already satisfied: retrying>=1.3.3 in /Users/nolan/anaconda3/lib/python3.7/site-packages (from plotly) (1.3.3)
Requirement already satisfied: six in /Users/nolan/anaconda3/lib/python3.7/site-packages (from plotly) (1.12.0)
In [15]:
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
In [16]:
colors = np.array(["rgba({0},{1},{2},1)".format(*c) for c in sns.color_palette("Blues", len(df))])
colors[-1] = "rgba(.99,.5,.2,1.)"
py.iplot([go.Scatter(x = df['x'], y = df['y'], mode='markers', marker=dict(color=colors), text=df['Name'])])

Each point represents a speech. Notice that the speeches are roughly align chronologically. Speeches long ago are more similar to one another than current speeches. The most unusual speech is by Herbert Hoover. George Bush also has a few unusual speeches. Trumps speech is close to ealier speeches by Ronald Reagan and George Bush and Bill Clinton.

In [ ]: