1/85
Introduction to Text Mining Alliance Summer School 2019 Elliott Ash - - PowerPoint PPT Presentation
Introduction to Text Mining Alliance Summer School 2019 Elliott Ash - - PowerPoint PPT Presentation
Introduction to Text Mining Alliance Summer School 2019 Elliott Ash 1/85 Social Science meets Data Science We are seeing a revolution in social science : new datasets : administrative data, digitization of text archives, social media
2/85
Social Science meets Data Science
◮ We are seeing a revolution in social science:
◮ new datasets: administrative data, digitization of text archives, social media ◮ new methods: natural language processing, machine learning
◮ In particular:
◮ many important human behaviors consist of text – millions and millions of lines of it. ◮ we cannot read these texts – somehow we must teach machines to read them for us.
4/85
Readings
◮ Google Developers Guide to Text Classification:
◮ https://developers.google.com/machine-learning/ guides/text-classification/
◮ “Analyzing polarization in social media: Method and application to tweets on 21 mass shootings” (2019).
◮ Demszky, Garg, Voigt, Zou, Gentzkow, Shapiro, and Jurafsky
◮ Natural Language Processing in Python ◮ Hands-on Machine Learning with Scikit-learn & TensorFlow 2.0
5/85
Programming
◮ Python is ideal for text data and machine learning.
◮ I recommend Anaconda 3.6: continuum.io/downloads
◮ For relatively small corpora, R is also fine:
◮ see the quanteda package.
6/85
Text as Data
◮ Text data is a sequence of characters called documents. ◮ The set of documents is the corpus. ◮ Text data is unstructured:
◮ the information we want is mixed together with (lots of) information we don’t. ◮ How to separate the two?
8/85
Dictionary Methods
◮ Dictionary methods use a pre-selected list of words or phrases to analyze a corpus. ◮ Corpus-specific
◮ count words related to your analysis
◮ General
◮ e.g. LIWC (liwc.wpengine.com) has lists of words across categories. ◮ Sentiment Analysis: count sets of positive and negative words (doesn’t work very well)
9/85
Measuring uncertainty in macroeconomy
Baker, Bloom, and Davis
◮ Baker, Bloom, and Davis measure economic policy uncertainty using Boolean search of newspaper articles. (See http://www.policyuncertainty.com/). ◮ For each paper on each day since 1985, submit the following query:
◮ 1. Article contains “uncertain” OR “uncertainty”, AND ◮ 2. Article contains “economic” OR “economy”, AND ◮ 3. Article contains “congress” OR “deficit” OR “federal reserve” OR “legislation” OR “regulation” OR “white house”
◮ Normalize resulting article counts by total newspaper articles that month.
10/85
Measuring uncertainty in macroeconomy
Baker, Bloom, and Davis
12/85
Goals of Featurization
◮ The goal: produce features that are
◮ predictive in the learning task ◮ interpretable by human investigators ◮ tractable enough to be easy to work with
13/85
Pre-processing
◮ Standard pre-processing steps:
◮ drop capitalization, punctuation, numbers, stopwords (e.g. “the”, “such”) ◮ remove word stems (e.g., “taxes” and “taxed” become “tax”)
14/85
Parts of speech
◮ Parts of speech (POS) tags provide useful word categories corresponding to their functions in sentences:
◮ Content: noun (NN), verb (VB), adjective (JJ), adverb (RB) ◮ Function: determinant (DT), preposition (IN), conjunction (CC), pronoun (PR).
◮ Parts of speech vary in their informativeness for various functions:
◮ For categorizing topics, nouns are usually most important ◮ For sentiment, adjectives are usually most important.
15/85
N-grams
◮ N-grams are phrases, sequences of words up to length N.
◮ bigrams, trigrams, quadgrams, etc.
◮ capture information and familiarity from local word order.
◮ e.g. “estate tax” vs “death tax”
16/85
Filtering the Vocabulary
◮ N-grams will blow up your feature space: filtering out uninformative n-grams is necessary.
◮ Google Developers recommend vocab size = m =20,000; I have gotten good performance from m =2,000.
- 1. Drop phrases that appear in few documents, or in almost all
documents, using tf-idf weights: tf-idf(w) = (1+log(cw))×log( N dw )
◮ cw= count of phrase w in corpus, N = number of documents, dw = number of documents where w appears.
- 2. filter on parts of speech (keep nouns, adjectives, and verbs).
- 3. filter on pointwise mutual information to get collocations (Ash
JITE 2017, pg. 2)
- 4. supervised feature selection: select phrases that are predictive
- f outcome.
17/85
A decent baseline for featurization
◮ Tag parts of speech: keep nouns, verbs, and adjectives. ◮ Drop stopwords, capitalization, punctuation. ◮ Run snowball stemmer to drop word endings. ◮ Make bigrams from the tokens. ◮ Take top 10,000 bigrams based on tf-idf weight. ◮ Represent documents as tf-idf frequencies over these bigrams.
19/85
Cosine Similarity
cos_sim(v1,v2) = v1 ·v2 ||v1||||v2|| where v1 and v2 are vectors, rep- resenting documents (e.g., IDF- weighted frequencies). ◮ each document is a non-negative vector in an m-space (m = size of dictionary):
◮ closer vectors form smaller angles: cos(0) = +1 means identical documents. ◮ furthest vectors are orthogonal: cos(π/2) = 0 means no words in common.
◮ For n documents, this gives n ×(n −1) similarities.
20/85
Text analysis of patent innovation
Kelly, Papanikolau, Seru, and Taddy (2018)
“Measuring technological innovation over the very long run” ◮ Data:
◮ 9 million patents since 1840, from U.S. Patent Office and Google Scholar Patents. ◮ date, inventor, backward citations ◮ text (abstract, claims, and description)
◮ Text pre-processing:
◮ drop HTML markup, punctuation, numbers, capitalization, and stopwords. ◮ remove terms that appear in less than 20 patents. ◮ 1.6 million words in vocabulary.
21/85
Measuring Innovation
Kelly, Papanikolau, Seru, and Taddy (2018)
◮ Backward IDF weighting of word w in patent i: BIDF(w,i) = # of patents prior to i log (1 + # patents prior to i that include w)
◮ down-weights words that appeared frequently before a patent.
◮ For each patent i:
◮ compute cosine similarity ρij to all future patents j, using BIDF of i.
◮ 9m×9m similarity matrix = 30TB of data.
◮ enforce sparsity by setting similarity < .05 to zero (93.4% of pairs).
22/85
Novelty, Impact, and Quality
Kelly, Papanikolau, Seru, and Taddy (2018)
◮ “Novelty” is defined by dissimilarity (negative similarity) to previous patents: Noveltyj = −
- i∈B(j)
ρij where B(j) is the set of previous patents (in, e.g., last 20 years). ◮ “Impact” is defined as similarity to subsequent patents: Impacti =
- j∈F(i)
ρij where F(i) is the set of future patents (in, e.g., next 100 years). ◮ A patent has high quality if it is novel and impactful: logQualityk = logImpactk +logNoveltyk
23/85
Validation
Kelly, Papanikolau, Seru, and Taddy (2018)
◮ For pairs with higher ρij, patent j more likely to cite patent i. ◮ Within technology class (assigned by patent office), similarity is higher than across class. ◮ Higher quality patents get more cites:
24/85
Most Innovative Firms
Kelly, Papanikolau, Seru, and Taddy (2018)
25/85
Breakthrough patents: citations vs quality
Kelly, Papanikolau, Seru, and Taddy (2018)
26/85
Breakthrough patents and firm profits
Kelly, Papanikolau, Seru, and Taddy (2018)
28/85
Topic Models in Social Science
◮ Topic models developed in computer science and statistics:
◮ summarize unstructured text using words within document ◮ useful for dimension reduction
◮ Social scientists use topics as a form of measurement
◮ how observed covariates drive trends in language ◮ tell a story not just about what, but how and why ◮ topic models are more interpretable than other methods, e.g. principal components analysis.
29/85
Latent Dirichlet Allocation (LDA)
◮ Idea: documents exhibit each topic in some proportion.
◮ Each document is a distribution over topics. ◮ Each topic is a distribution over words.
◮ Latent Dirichlet Allocation (e.g. Blei 2012) is the most poular topic model in this vein because it is easy to use and (usually) provides great results.
◮ Maintained assumptions: Bag of words/phrases, fix number of topics ex ante.
30/85
A statistical highlighter
31/85
Topic modeling Federal Reserve Bank transcripts
Hansen, McMahon, and Prat (QJE 2017)
◮ Use LDA to analyze speech at the FOMC (Federal Open Market Committee).
◮ private discussions among committee members at Federal Reserve (U.S. Central Bank) ◮ transcripts: 150 meetings, 20 years, 26,000 speeches, 24,000 unique words.
◮ Pre-processing:
◮ drop stopwords, stems, etc. ◮ Drop words with low TF-IDF weight
32/85
LDA Training
Hansen, McMahon, and Prat (QJE 2017)
◮ K = 40 topics selected for interpretability / topic coherence.
◮ the “statistically optimal” K = 70, but these were less interpretable.
◮ hyperparemeters α = 50/K and η = .025 to promote sparse word distributions (and more interpretable topics).
33/85
34/85
Pro-Cyclical Topics
Hansen, McMahon, and Prat (QJE 2017)
35/85
Counter-Cyclical Topics
Hansen, McMahon, and Prat (QJE 2017)
36/85
FOMC Topics and Policy Uncertainty
Hansen, McMahon, and Prat (QJE 2017)
37/85
Effect of Transparency
Hansen, McMahon, and Prat (QJE 2017)
◮ In 1993, there was an unexpected transparency shock where transcripts became public. ◮ Increasing transparency results in:
◮ higher discipline / technocratic language (probably beneficial) ◮ higher conformity (probably costly)
◮ Highlights tradeoffs from transparency in bureaucratic
- rganizations.
38/85
Structural Topic Model = LDA + Metadata
◮ STM provides two ways to include contextual information:
◮ Topic prevalence can vary by metadata
◮ e.g. Republicans talk about military issues more then Democrats
◮ Topic content can vary by metadata
◮ e.g. Republicans talk about military issues differently from Democrats.
◮ stm package in R provides easy syntax, complete workflow, going from raw texts to figures.
40/85
What is machine learning?
◮ In classical computer programming, humans input the rules and the data, and the computer provides answers. ◮ In machine learning, humans input the data and the answers, and the computer learns the rules.
41/85
A baseline for machine learning using text
- 1. Take tf-idf-weighted POS-filtered bigrams (from above) as
inputs X.
- 2. Train a machine learning model predict outcome y:
2.1 For classification, regularized logistic regression (or gradient boosted classifier). 2.2 For regression, use elastic net (or gradient boosted regressor).
- 3. Use cross-validation grid search in training set to select model
hyperparameters.
- 4. Evaluate model in held-out test set:
4.1 For classification, use F1 score and confusion matrix. 4.2 For regression, use R squared and calibration plot.
42/85
Predicting Policy Topics from Text
Ash, Morelli, and Osnabruegge (2018)
◮ Comparative Manifesto Project:
◮ 44,020 annotated English-language political statements ◮ hundreds of political party platforms from English-speaking countries.
◮ Each statement gets a CMP code, e.g. “decentralization”, “education”
◮ We want to classify text to one of 19 topics.
43/85
Regularized Logistic Regression
Ash, Morelli, and Osnabruegge (2018)
◮ N rows, M text features, K policy topics ◮ Probability model for policy topic Yi P(Yi = c) = eβcXi
K
k=1 eβkXi ,
c ∈ 1,...,K are the topic labels, Xi is the matrix of tf-idf frequencies, and β is an M ×K matrix of parameters. ◮ Cost function: J(β) = − 1 N
N
- i=1
K
- k=1
1{Yi = k}log eβkXi
K
l=1 eβlXi
- +γ
M
- j=1
K
- k=1
β2
jk
◮ γ = strength of L2 penalty
◮ γ∗ = 1/2 selected by 3-fold cross-validation grid search.
44/85
Agriculture/Education Topics: Predictive Features
Ash, Morelli, and Osnabruegge (2018)
45/85
Model Accuracy
Ash, Morelli, and Osnabruegge (2018)
◮ Given a chunk of text, the logistic model computes a probability distribution over policy topics.
◮ harnesses expert knowledge about political topics from Manifesto Project
◮ Validation of accuracy: predict the CMP code in a held-out sample of manifesto corpus statements
◮ Out-of-sample accuracy = 53%
◮ quite good given there are 19 policy areas – choosing randomly would be correct 5% of the time; choosing top category (other topic) would be correct 15% of the time.
46/85
Confusion Matrix
Ash, Morelli, and Osnabruegge (2018)
Admin Agric Culture Econ Educ Intl Other Tech Welfare Administration 348 5 8 117 12 47 9 20 10 94 21 43 1 33 118 886 Agriculture 6 110 2 34 2 2 5 1 6 23 4 49 14 18 276 Culture 12 155 14 13 15 5 1 17 52 2 16 8 15 44 369 Economics 59 12 3 961 11 18 21 7 21 107 35 95 1 43 176 1570 Education 24 1 12 30 481 6 6 5 9 66 14 6 35 88 783 Freedom 51 10 24 240 29 14 23 69 28 3 6 2 59 558 Internationalism 15 3 2 41 2 16 453 12 17 75 14 21 4 8 44 727 Law and Order 18 1 19 10 13 24 361 9 83 12 2 1 14 58 625 Nat’l Way of Life 18 1 8 31 14 25 20 14 133 73 26 26 3 3 91 486 Other 55 18 15 128 61 34 52 64 35 1239 33 86 11 57 185 2073 Party Politics 19 3 3 51 10 18 17 7 22 90 181 25 1 5 59 511 Quality of Life 40 24 4 130 4 7 24 2 10 96 11 619 2 43 45 1061 Target Groups 16 1 7 13 13 7 8 10 13 35 3 1 57 7 71 262 Technology 28 7 6 63 29 3 6 4 8 73 7 52 397 48 731 Welfare 67 4 10 151 52 29 16 26 38 161 20 22 12 34 1454 2096 Total Predicted 776 190 245 1807 714 480 695 548 371 2336 411 1066 107 710 2558 13014 Free- dom Law / Order Way of Life Party Politics Quality
- f Life
Target Groups Total True
47/85
Experiment: Electoral Reform in New Zealand
Ash, Morelli, and Osnabruegge (2018)
◮ A 1993 reform in New Zealand moved from majoritarian to
proportional representation: ◮ Majoritarian (first past the post): two parties, single party controls parliament. ◮ Proportional representation: many minority parties, coalition governments.
◮ How did it affect speech topics in the New Zealand Parliament?
48/85
Change in Parliament Attention due to Reform
Ash, Morelli, and Osnabruegge (2018)
education no topic administration internationalism infrastructure culture target groups agriculture national way of life law and order quality of life freedom economics party politics −2 −1 1 2 3
49/85
Example “Party Politics” Speech
Ash, Morelli, and Osnabruegge (2018)
“I have seen seven Opposition leaders in my time, but I have never seen a leader as relentlessly negative as Helen Clark. . . . How could anybody be so negative, day in, day out? It could get into the Guinness Book of Records. She does not have a positive word to say about anything. It is all negative, negative, negative.” ◮ Parliamentarian Richard Prebble, 15 Feb 1999
51/85
Word Embedding: Language as Data
Word Embedding is a technology from computational linguistics that represents words and phrases as vectors in a geometric space, where locations and directions encode meaning.
52/85
Why word vectors?
◮ Once words are represented as vectors, we can use linear algebra to understand the relationships between words:
◮ Words that are geometrically close to each other are similar: e.g. “student” and “pupil.” ◮ More intriguingly, word2vec algebra can depict conceptual, analogical relationships between words.
◮ Consider the analogy: man is to king as woman is to ____ ◮ With word2vec, we have vec(king)−vec(man)+vec(woman) ≈ vec(queen)
53/85
How does it work?
◮ How does word2vec learn the meaning of the word “fox”? The quick brown fox jumps over the lazy dog ◮ By reading in every example of the word “fox” and tries to predict what other words will be in the context window.
◮ the prediction weights on these other words (after dimension reduction) are the word vectors.
54/85
Most similar words to dog, depending on window size
◮ Small windows pick up substitutable words; large windows pick up topics.
55/85
Measuring Emotionality in Politician Speeches
Gennaro and Ash (2019)
◮ Dictionary: a new domain-appropriate list of words for:
◮ Cognition Processing (“thinking”): insight, causation, discrepancy, tentativeness, certainty, inhibition, inclusion, and exclusion ◮ Emotion Processing (“feeling”): positive and negative emotions, pleasure, pain, happiness, anxiety, anger, and sadness.
◮ Use word embeddings to get the “direction” in language space corresponding to emotionality (most emotional, least cognitive).
56/85
Cognition Language ◮ Top cognition sentences:
◮ "In my judgment, neither is
true in the case of this amendment.”
◮ “Is that correct?” ◮ “R. 15 contains a provision
that is similar but, in fact, broader in scope.”
Emotion Language ◮ Top emotion sentences:
◮ “There is nothing to trouble
any heart, nothing to hurt at all; death is only a quiet door, in an old garden wall.”
◮ “With joy in his heart and a
smile on his face he graced practically every social
- ccasion with a song.”
◮ “We Democrats may disagree,
but we love our fellow men and we never hate them.”
57/85
Emotion language in Congress, 1914-2010
58/85
Analyzing Gender Bias with Word Embeddings
Garg, Schiebinger, Jurafsky, and Zou (2018)
Women’s occupation relative percentage vs. embedding bias in Google News vectors.
59/85
Ethnic groups ↔ Occupations
Garg, Schiebinger, Jurafsky, and Zou (2018)
The top 10 occupations most closely associated with each ethnic group in the Google News embedding.
61/85
Dependency Structure
◮ Dependency structures represent grammatical relations between words in a sentence:
◮ head-dependent relations (directed arcs) ◮ functional categories (arc labels)
62/85
Extracting Information from Legal Texts
◮ Syntactic dependency parsers allow computers to read texts and parse subjects, actions, and other useful information.
◮ In particular, modal verbs shall, must, will, may, and can encode obligations and entitlements in legal language.
63/85
Ash, MacLeod, and Naidu (2019)
◮ Data:
◮ new corpus of 30,000 collective bargaining agreements from Canada from 1986 through 2015
◮ Agent (subject) categories:
◮ worker, union, company, and manager.
◮ Encode contract statements as (subject, modal, action).
64/85
Most Frequent Subject-Modal-Verb Tuples
Ash, MacLeod, and Naidu (2019)
Subject - Modal - Verb agreement_shall_be arbitrator_shall_have board_shall_have case_may_be committee_shall_meet company_shall_pay company_shall_provide company_will_pay company_will_provide decision_shall_be employee_may_request Subject - Modal - Verb employee_shall_be employee_shall_be_allowed employee_shall_be_considered employee_shall_be_entitled employee_shall_be_given employee_shall_be_granted employee_shall_be_laid_off employee_shall_be_paid employee_shall_be_required employee_shall_continue employee_shall_lose Subject - Modal - Verb employee_shall_receive employee_shall_retain employee_will_be employee_will_be_allowed employee_will_be_entitled employee_will_be_given employee_will_be_granted employee_will_be_paid employee_will_be_required employee_will_have employer_shall_grant
65/85
Determinants of Relative Worker Control
Ash, MacLeod, and Naidu (2019)
◮ All specifications use within-firm, within-industry-year variation:
◮ Personal Income Tax (Non-Wage Compensation) ↑ ◮ Unemployment Rate (Bargaining Power) ↓ ◮ Number of Employers (Labor Market Competition) ↑
67/85
Analyzing polarization in social media: Method and application to tweets on 21 mass shootings
Demszky, Garg, Voigt, Zou, Gentzkow, Shapiro, and Jurafsky
◮ Research Object:
◮ use NLP to understand four “dimensions” of social media polarization: topic choice, framing, affect, modality.
◮ Context:
◮ tweets in response to mass shooting events.
◮ Research question:
◮ does political partisanship manifest in polarized responses to violent/polarizing events.
68/85
Dataset
◮ 21 mass shooting events, 2015-2018, from Gun Violence Archive ◮ tweets about those events, identified by:
◮ location keywords (e.g. chattanooga, roseburg, san bernardino, fresno, etc.) ◮ event keywords (lemmas): shoot, gun, kill, attack, massacre, victim ◮ filter out retweets and tweets from deactivated accounts ◮ N = 10,000 (out of 4.4 million tweets from the firehose archive).
69/85
Identifying party affiliation of Twitter users
◮ Party affiliation identified off of whether you follow more Democrats or Republicans, from a list of Twitter accounts associated with legislators, presidential candidates, and party
- rganizations (Volkova et al 2014).
◮ at least 51% of tweets for each event can be assigned partisanship this way.
◮ For geolocated users this matches up pretty well with party vote shares by state (R2 = .82):
70/85
Partisanship
◮ Leave-one-out estimator from Gentzkow et al (2019), applied to each shooting event: π = 1 2( 1 |D|
- i∈D
ˆ qi · ˆ ρ−i + 1 |R|
- i∈R
ˆ qi ·(1− ˆ ρ−i))
◮ ˆ qi = token frequencies for user i, drawn from set of democrats D and set of republicans R ◮ ˆ ρ−i has elements ρ−i = qD
i
qD
i +qR i
empirical poster probabilities computed from all other users.
◮ π is an estimate for expected posterior probability that a Bayesian observer would correctly predict party after observing
- ne randomly sampled token.
◮ consistency assumes tokens are drawn from multinomial logit.
71/85
Tweets about mass shootings are polarized
◮ comparable to π = .53 in Congressional speeches (GST 2019). ◮ The increase in polarization over time is not statistically significant.
72/85
Tweet Embeddings for Topic Assignment
- 1. Make a new vocabulary:
1.1 Sample 10,000 tweets from each event 1.2 vocabulary of stemmed words occuring at least ten times in at least three events (N = 2000)
- 2. Train GloVe embeddings on random samples of tweets from
each event
- 3. Create Arora et al (2017) embeddings:
3.1 for each tweet t, compute weighted average vectors vt for each word, weighted by inverse frequency. 3.2 take out first principal component of matrix whose rows are vt
73/85
Topics = Embedding Clusters
- 1. Cluster the embeddings using k-means
◮ k-means clustering separates documents into k groups based
- n distance in embedding space.
◮ different from a topic model because a document is in a single topic, rather than a distribution across topics.
- 2. Drop hard-to-classify tweets:
2.1 compute ratio of distance to closest topic and distance to second-closest topic. 2.2 drop tweets above the 75th percentile.
◮ Validation using Amazon Mechanical Turk:
◮ Identify word intruder: five from one cluster, one from another cluster. ◮ Identify tweet intruder: three from one cluster, and one from another cluster.
74/85
Topic Content
◮ The embedding method resulted in more coherent topics (better MTurk validation for words and tweets) than a topic
- model. k = 8 got best coherence.
◮ Appendix reports samples of tweets for each topic.
75/85
Between-topic vs within-topic polarization
◮ Within-topic polarization: compute π separately by the tweet clusters. ◮ Between-topic polarization: Compute π using cluster counts, rather than token counts.
76/85
Within-topic polarization
◮ Most polarized topics: shooter’s identity & ideology (.55), laws & policy (.54)
77/85
Partisanship of Topics, by Race of Shooter
78/85
Partisan Framing Devices: Words
◮ Partisanship of phrases from GST model: ◮ Partisan valence of “terrorist” and “crazy” flip depending on race of shooter (these words have the largest racial difference in the joint vocabulary).
79/85
Partisan Framing Devices: Events
◮ Partisanship of keywords for previous events from GST model: ◮ Democrats invoke white shooters, Republicans invoke POC shooters.
80/85
Affect
◮ Starting point: Emotion lexicon from Mohammad and Turney (2013), available at saifmohammad.com.
◮ 14,182 words assigned to sentiment (positive/negative) and emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust).
◮ Domain propagation:
◮ pick 5-11 representative words per emotion category (Appendix E) ◮ for each word in vocabulary, compute average distance to each member of each category. take 30 closest words as lexicon.
81/85
Partisanship of Affect Categories
◮ Compute GST partisanship scores using affect-category counts: ◮ Disgust affect flips along partisan lines depending on race of shooter.
82/85
Modality
◮ Count the four most frequent necessity modals in the data: should, must, have to, need to.
◮ in this context, they are used as calls to action.
◮ Democrats use modals more than Republicans; Republicans are more fatalistic.
84/85
Text as Data
◮ Unstructured text is and will be an important data source going forward in the social sciences. ◮ But these new data and methods are not a substitute for a good research question or a good research design – both are still necessary.
85/85