import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline

State of the Union Addresses¶

with open("data/stateoftheunion1790-2017.txt", "r") as f:
    text = f.read()

Recall that the speeches are separated by ***. We can separate the speeches by splitting on 3 asterisks.

records = text.split("***")

After the ***, a speech is formatted as


State of the Union Address
George Washington
January 8, 1790

Fellow-Citizens of the Senate and House of Representatives:

I embrace with great satisfaction the opportunity which now presents itself...

If we split on new lines, i.e., \n then we can extract the name and date easily. The rest of the lines can be joined together into one string containing the text of the speech.

Trying with Regex¶

with open("data/stateoftheunion1790-2017.txt", "r") as f:
    text = f.read()
    import re
    print("Speeches", len(re.findall(r"Fellow-Citizens[\s\S]*?\*\*\*", text)))

Speeches 56

with open("data/stateoftheunion1790-2017.txt", "r") as f:
    text = f.read()
    import re
    print("Number of ***", len(re.findall(r"\*\*\*", text)))

Number of *** 227

String Munging¶

with open("data/stateoftheunion1790-2017.txt", "r") as f:
    text = f.read()
    
records = text.split("***")    

def extract_parts(line):
    parts = line.split("\n")
    name = parts[3].strip()
    date = parts[4].strip()
    text = "\n".join(parts[5:]).strip()
    return [name, date, text]

df = pd.DataFrame([extract_parts(l) for l in records[1:]], 
                  columns=["Name", "Date", "Text"])
print("Length:", len(df))
df.head()

Length: 227

We perform a few simple text cleaning tasks. We convert characters to lower case, eliminate the new lines, and drop all punctuation.

df['clean text'] = (
    df['Text']
    .str.replace("\n", " ")
    .str.lower()
    .str.replace(r"[^a-z\s]", " ")
)
df.head()

We can then use sklearn to create a word vector for each speech, which contains the counts of all words in a speech. By all words, we mean the set of all unique words used across all 226 speeches. We can think of each word vector as a record so we have 226 records and thousands of variables (word counts).

We can try to examine the relationship between speeches by reducing the dimensionality of the data. We take an approach that is a kind of Principle Component Analysis for word vectors. Specifically, we measure the distance between speeches via a metric on the word vectors. This metric normalizes by the rarity of a word.

from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
tfidf = vec.fit_transform(df['clean text'])

This gives us a 226 by 226 matrix of the distances between all pairs of speeches. Then we use SVD to decompose the matrix and plot the first two column vectors of the resulting decomposition (these are similar in nature to the first two principle components).

np.random.seed(42)
import scipy as sp
(u, s, vt) = sp.sparse.linalg.svds(tfidf, k=2)

df['Year'] = df['Date'].str[-4:].astype('int')
df['x'] = u[:,0]
df['y'] = u[:,1]

df['clean text'].head()

0    fellow citizens of the senate and house of rep...
1    fellow citizens of the senate and house of rep...
2    fellow citizens of the senate and house of rep...
3    fellow citizens of the senate and house of rep...
4    fellow citizens of the senate and house of rep...
Name: clean text, dtype: object

from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
tfidf = vec.fit_transform(df['clean text'])

np.random.seed(42)
import scipy as sp
(u, s, vt) = sp.sparse.linalg.svds(tfidf, k=2)

sns.lmplot(x = 'x', y = 'y', data = df, hue='Year', legend=False, 
           fit_reg=False, palette="Blues", aspect=1.6)
(u,v) = df[df['Name'] == "Donald J. Trump"][['x', 'y']].values[0]
plt.plot(u,v,'*', markersize=20, color="orange")
plt.savefig("SOTUspeeches.pdf")

!pip install plotly

Requirement already satisfied: plotly in /Users/nolan/anaconda3/lib/python3.7/site-packages (4.2.1)
Requirement already satisfied: retrying>=1.3.3 in /Users/nolan/anaconda3/lib/python3.7/site-packages (from plotly) (1.3.3)
Requirement already satisfied: six in /Users/nolan/anaconda3/lib/python3.7/site-packages (from plotly) (1.12.0)

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff

colors = np.array(["rgba({0},{1},{2},1)".format(*c) for c in sns.color_palette("Blues", len(df))])
colors[-1] = "rgba(.99,.5,.2,1.)"
py.iplot([go.Scatter(x = df['x'], y = df['y'], mode='markers', marker=dict(color=colors), text=df['Name'])])

Each point represents a speech. Notice that the speeches are roughly align chronologically. Speeches long ago are more similar to one another than current speeches. The most unusual speech is by Herbert Hoover. George Bush also has a few unusual speeches. Trumps speech is close to ealier speeches by Ronald Reagan and George Bush and Bill Clinton.

	Name	Date	Text
0	George Washington	January 8, 1790	Fellow-Citizens of the Senate and House of Rep...
1	George Washington	December 8, 1790	Fellow-Citizens of the Senate and House of Rep...
2	George Washington	October 25, 1791	Fellow-Citizens of the Senate and House of Rep...
3	George Washington	November 6, 1792	Fellow-Citizens of the Senate and House of Rep...
4	George Washington	December 3, 1793	Fellow-Citizens of the Senate and House of Rep...