import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
with open("data/stateoftheunion1790-2017.txt", "r") as f:
text = f.read()
Recall that the speeches are separated by ***. We can separate the speeches by splitting on 3 asterisks.
records = text.split("***")
After the ***, a speech is formatted as
State of the Union Address
George Washington
January 8, 1790
Fellow-Citizens of the Senate and House of Representatives:
I embrace with great satisfaction the opportunity which now presents itself...
If we split on new lines, i.e., \n then we can extract the name and date easily. The rest of the lines can be joined together into one string containing the text of the speech.
with open("data/stateoftheunion1790-2017.txt", "r") as f:
text = f.read()
import re
print("Speeches", len(re.findall(r"Fellow-Citizens[\s\S]*?\*\*\*", text)))
with open("data/stateoftheunion1790-2017.txt", "r") as f:
text = f.read()
import re
print("Number of ***", len(re.findall(r"\*\*\*", text)))
with open("data/stateoftheunion1790-2017.txt", "r") as f:
text = f.read()
records = text.split("***")
def extract_parts(line):
parts = line.split("\n")
name = parts[3].strip()
date = parts[4].strip()
text = "\n".join(parts[5:]).strip()
return [name, date, text]
df = pd.DataFrame([extract_parts(l) for l in records[1:]],
columns=["Name", "Date", "Text"])
print("Length:", len(df))
df.head()
We perform a few simple text cleaning tasks. We convert characters to lower case, eliminate the new lines, and drop all punctuation.
df['clean text'] = (
df['Text']
.str.replace("\n", " ")
.str.lower()
.str.replace(r"[^a-z\s]", " ")
)
df.head()
We can then use sklearn
to create a word vector for each speech, which contains the counts of all words in a speech. By all words, we mean the set of all unique words used across all 226 speeches. We can think of each word vector as a record so we have 226 records and thousands of variables (word counts).
We can try to examine the relationship between speeches by reducing the dimensionality of the data. We take an approach that is a kind of Principle Component Analysis for word vectors. Specifically, we measure the distance between speeches via a metric on the word vectors. This metric normalizes by the rarity of a word.
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
tfidf = vec.fit_transform(df['clean text'])
This gives us a 226 by 226 matrix of the distances between all pairs of speeches. Then we use SVD to decompose the matrix and plot the first two column vectors of the resulting decomposition (these are similar in nature to the first two principle components).
np.random.seed(42)
import scipy as sp
(u, s, vt) = sp.sparse.linalg.svds(tfidf, k=2)
df['Year'] = df['Date'].str[-4:].astype('int')
df['x'] = u[:,0]
df['y'] = u[:,1]
df['clean text'].head()
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
tfidf = vec.fit_transform(df['clean text'])
np.random.seed(42)
import scipy as sp
(u, s, vt) = sp.sparse.linalg.svds(tfidf, k=2)
sns.lmplot(x = 'x', y = 'y', data = df, hue='Year', legend=False,
fit_reg=False, palette="Blues", aspect=1.6)
(u,v) = df[df['Name'] == "Donald J. Trump"][['x', 'y']].values[0]
plt.plot(u,v,'*', markersize=20, color="orange")
plt.savefig("SOTUspeeches.pdf")
!pip install plotly
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
colors = np.array(["rgba({0},{1},{2},1)".format(*c) for c in sns.color_palette("Blues", len(df))])
colors[-1] = "rgba(.99,.5,.2,1.)"
py.iplot([go.Scatter(x = df['x'], y = df['y'], mode='markers', marker=dict(color=colors), text=df['Name'])])
Each point represents a speech. Notice that the speeches are roughly align chronologically. Speeches long ago are more similar to one another than current speeches. The most unusual speech is by Herbert Hoover. George Bush also has a few unusual speeches. Trumps speech is close to ealier speeches by Ronald Reagan and George Bush and Bill Clinton.