Introduction to Text Mining Alliance Summer School 2019 Elliott Ash - - PowerPoint PPT Presentation

introduction to text mining
SMART_READER_LITE
LIVE PREVIEW

Introduction to Text Mining Alliance Summer School 2019 Elliott Ash - - PowerPoint PPT Presentation

Introduction to Text Mining Alliance Summer School 2019 Elliott Ash 1/85 Social Science meets Data Science We are seeing a revolution in social science : new datasets : administrative data, digitization of text archives, social media


slide-1
SLIDE 1

1/85

Introduction to Text Mining

Alliance Summer School 2019 Elliott Ash

slide-2
SLIDE 2

2/85

Social Science meets Data Science

◮ We are seeing a revolution in social science:

◮ new datasets: administrative data, digitization of text archives, social media ◮ new methods: natural language processing, machine learning

◮ In particular:

◮ many important human behaviors consist of text – millions and millions of lines of it. ◮ we cannot read these texts – somehow we must teach machines to read them for us.

slide-3
SLIDE 3

4/85

Readings

◮ Google Developers Guide to Text Classification:

◮ https://developers.google.com/machine-learning/ guides/text-classification/

◮ “Analyzing polarization in social media: Method and application to tweets on 21 mass shootings” (2019).

◮ Demszky, Garg, Voigt, Zou, Gentzkow, Shapiro, and Jurafsky

◮ Natural Language Processing in Python ◮ Hands-on Machine Learning with Scikit-learn & TensorFlow 2.0

slide-4
SLIDE 4

5/85

Programming

◮ Python is ideal for text data and machine learning.

◮ I recommend Anaconda 3.6: continuum.io/downloads

◮ For relatively small corpora, R is also fine:

◮ see the quanteda package.

slide-5
SLIDE 5

6/85

Text as Data

◮ Text data is a sequence of characters called documents. ◮ The set of documents is the corpus. ◮ Text data is unstructured:

◮ the information we want is mixed together with (lots of) information we don’t. ◮ How to separate the two?

slide-6
SLIDE 6

8/85

Dictionary Methods

◮ Dictionary methods use a pre-selected list of words or phrases to analyze a corpus. ◮ Corpus-specific

◮ count words related to your analysis

◮ General

◮ e.g. LIWC (liwc.wpengine.com) has lists of words across categories. ◮ Sentiment Analysis: count sets of positive and negative words (doesn’t work very well)

slide-7
SLIDE 7

9/85

Measuring uncertainty in macroeconomy

Baker, Bloom, and Davis

◮ Baker, Bloom, and Davis measure economic policy uncertainty using Boolean search of newspaper articles. (See http://www.policyuncertainty.com/). ◮ For each paper on each day since 1985, submit the following query:

◮ 1. Article contains “uncertain” OR “uncertainty”, AND ◮ 2. Article contains “economic” OR “economy”, AND ◮ 3. Article contains “congress” OR “deficit” OR “federal reserve” OR “legislation” OR “regulation” OR “white house”

◮ Normalize resulting article counts by total newspaper articles that month.

slide-8
SLIDE 8

10/85

Measuring uncertainty in macroeconomy

Baker, Bloom, and Davis

slide-9
SLIDE 9

12/85

Goals of Featurization

◮ The goal: produce features that are

◮ predictive in the learning task ◮ interpretable by human investigators ◮ tractable enough to be easy to work with

slide-10
SLIDE 10

13/85

Pre-processing

◮ Standard pre-processing steps:

◮ drop capitalization, punctuation, numbers, stopwords (e.g. “the”, “such”) ◮ remove word stems (e.g., “taxes” and “taxed” become “tax”)

slide-11
SLIDE 11

14/85

Parts of speech

◮ Parts of speech (POS) tags provide useful word categories corresponding to their functions in sentences:

◮ Content: noun (NN), verb (VB), adjective (JJ), adverb (RB) ◮ Function: determinant (DT), preposition (IN), conjunction (CC), pronoun (PR).

◮ Parts of speech vary in their informativeness for various functions:

◮ For categorizing topics, nouns are usually most important ◮ For sentiment, adjectives are usually most important.

slide-12
SLIDE 12

15/85

N-grams

◮ N-grams are phrases, sequences of words up to length N.

◮ bigrams, trigrams, quadgrams, etc.

◮ capture information and familiarity from local word order.

◮ e.g. “estate tax” vs “death tax”

slide-13
SLIDE 13

16/85

Filtering the Vocabulary

◮ N-grams will blow up your feature space: filtering out uninformative n-grams is necessary.

◮ Google Developers recommend vocab size = m =20,000; I have gotten good performance from m =2,000.

  • 1. Drop phrases that appear in few documents, or in almost all

documents, using tf-idf weights: tf-idf(w) = (1+log(cw))×log( N dw )

◮ cw= count of phrase w in corpus, N = number of documents, dw = number of documents where w appears.

  • 2. filter on parts of speech (keep nouns, adjectives, and verbs).
  • 3. filter on pointwise mutual information to get collocations (Ash

JITE 2017, pg. 2)

  • 4. supervised feature selection: select phrases that are predictive
  • f outcome.
slide-14
SLIDE 14

17/85

A decent baseline for featurization

◮ Tag parts of speech: keep nouns, verbs, and adjectives. ◮ Drop stopwords, capitalization, punctuation. ◮ Run snowball stemmer to drop word endings. ◮ Make bigrams from the tokens. ◮ Take top 10,000 bigrams based on tf-idf weight. ◮ Represent documents as tf-idf frequencies over these bigrams.

slide-15
SLIDE 15

19/85

Cosine Similarity

cos_sim(v1,v2) = v1 ·v2 ||v1||||v2|| where v1 and v2 are vectors, rep- resenting documents (e.g., IDF- weighted frequencies). ◮ each document is a non-negative vector in an m-space (m = size of dictionary):

◮ closer vectors form smaller angles: cos(0) = +1 means identical documents. ◮ furthest vectors are orthogonal: cos(π/2) = 0 means no words in common.

◮ For n documents, this gives n ×(n −1) similarities.

slide-16
SLIDE 16

20/85

Text analysis of patent innovation

Kelly, Papanikolau, Seru, and Taddy (2018)

“Measuring technological innovation over the very long run” ◮ Data:

◮ 9 million patents since 1840, from U.S. Patent Office and Google Scholar Patents. ◮ date, inventor, backward citations ◮ text (abstract, claims, and description)

◮ Text pre-processing:

◮ drop HTML markup, punctuation, numbers, capitalization, and stopwords. ◮ remove terms that appear in less than 20 patents. ◮ 1.6 million words in vocabulary.

slide-17
SLIDE 17

21/85

Measuring Innovation

Kelly, Papanikolau, Seru, and Taddy (2018)

◮ Backward IDF weighting of word w in patent i: BIDF(w,i) = # of patents prior to i log (1 + # patents prior to i that include w)

◮ down-weights words that appeared frequently before a patent.

◮ For each patent i:

◮ compute cosine similarity ρij to all future patents j, using BIDF of i.

◮ 9m×9m similarity matrix = 30TB of data.

◮ enforce sparsity by setting similarity < .05 to zero (93.4% of pairs).

slide-18
SLIDE 18

22/85

Novelty, Impact, and Quality

Kelly, Papanikolau, Seru, and Taddy (2018)

◮ “Novelty” is defined by dissimilarity (negative similarity) to previous patents: Noveltyj = −

  • i∈B(j)

ρij where B(j) is the set of previous patents (in, e.g., last 20 years). ◮ “Impact” is defined as similarity to subsequent patents: Impacti =

  • j∈F(i)

ρij where F(i) is the set of future patents (in, e.g., next 100 years). ◮ A patent has high quality if it is novel and impactful: logQualityk = logImpactk +logNoveltyk

slide-19
SLIDE 19

23/85

Validation

Kelly, Papanikolau, Seru, and Taddy (2018)

◮ For pairs with higher ρij, patent j more likely to cite patent i. ◮ Within technology class (assigned by patent office), similarity is higher than across class. ◮ Higher quality patents get more cites:

slide-20
SLIDE 20

24/85

Most Innovative Firms

Kelly, Papanikolau, Seru, and Taddy (2018)

slide-21
SLIDE 21

25/85

Breakthrough patents: citations vs quality

Kelly, Papanikolau, Seru, and Taddy (2018)

slide-22
SLIDE 22

26/85

Breakthrough patents and firm profits

Kelly, Papanikolau, Seru, and Taddy (2018)

slide-23
SLIDE 23

28/85

Topic Models in Social Science

◮ Topic models developed in computer science and statistics:

◮ summarize unstructured text using words within document ◮ useful for dimension reduction

◮ Social scientists use topics as a form of measurement

◮ how observed covariates drive trends in language ◮ tell a story not just about what, but how and why ◮ topic models are more interpretable than other methods, e.g. principal components analysis.

slide-24
SLIDE 24

29/85

Latent Dirichlet Allocation (LDA)

◮ Idea: documents exhibit each topic in some proportion.

◮ Each document is a distribution over topics. ◮ Each topic is a distribution over words.

◮ Latent Dirichlet Allocation (e.g. Blei 2012) is the most poular topic model in this vein because it is easy to use and (usually) provides great results.

◮ Maintained assumptions: Bag of words/phrases, fix number of topics ex ante.

slide-25
SLIDE 25

30/85

A statistical highlighter

slide-26
SLIDE 26

31/85

Topic modeling Federal Reserve Bank transcripts

Hansen, McMahon, and Prat (QJE 2017)

◮ Use LDA to analyze speech at the FOMC (Federal Open Market Committee).

◮ private discussions among committee members at Federal Reserve (U.S. Central Bank) ◮ transcripts: 150 meetings, 20 years, 26,000 speeches, 24,000 unique words.

◮ Pre-processing:

◮ drop stopwords, stems, etc. ◮ Drop words with low TF-IDF weight

slide-27
SLIDE 27

32/85

LDA Training

Hansen, McMahon, and Prat (QJE 2017)

◮ K = 40 topics selected for interpretability / topic coherence.

◮ the “statistically optimal” K = 70, but these were less interpretable.

◮ hyperparemeters α = 50/K and η = .025 to promote sparse word distributions (and more interpretable topics).

slide-28
SLIDE 28

33/85

slide-29
SLIDE 29

34/85

Pro-Cyclical Topics

Hansen, McMahon, and Prat (QJE 2017)

slide-30
SLIDE 30

35/85

Counter-Cyclical Topics

Hansen, McMahon, and Prat (QJE 2017)

slide-31
SLIDE 31

36/85

FOMC Topics and Policy Uncertainty

Hansen, McMahon, and Prat (QJE 2017)

slide-32
SLIDE 32

37/85

Effect of Transparency

Hansen, McMahon, and Prat (QJE 2017)

◮ In 1993, there was an unexpected transparency shock where transcripts became public. ◮ Increasing transparency results in:

◮ higher discipline / technocratic language (probably beneficial) ◮ higher conformity (probably costly)

◮ Highlights tradeoffs from transparency in bureaucratic

  • rganizations.
slide-33
SLIDE 33

38/85

Structural Topic Model = LDA + Metadata

◮ STM provides two ways to include contextual information:

◮ Topic prevalence can vary by metadata

◮ e.g. Republicans talk about military issues more then Democrats

◮ Topic content can vary by metadata

◮ e.g. Republicans talk about military issues differently from Democrats.

◮ stm package in R provides easy syntax, complete workflow, going from raw texts to figures.

slide-34
SLIDE 34

40/85

What is machine learning?

◮ In classical computer programming, humans input the rules and the data, and the computer provides answers. ◮ In machine learning, humans input the data and the answers, and the computer learns the rules.

slide-35
SLIDE 35

41/85

A baseline for machine learning using text

  • 1. Take tf-idf-weighted POS-filtered bigrams (from above) as

inputs X.

  • 2. Train a machine learning model predict outcome y:

2.1 For classification, regularized logistic regression (or gradient boosted classifier). 2.2 For regression, use elastic net (or gradient boosted regressor).

  • 3. Use cross-validation grid search in training set to select model

hyperparameters.

  • 4. Evaluate model in held-out test set:

4.1 For classification, use F1 score and confusion matrix. 4.2 For regression, use R squared and calibration plot.

slide-36
SLIDE 36

42/85

Predicting Policy Topics from Text

Ash, Morelli, and Osnabruegge (2018)

◮ Comparative Manifesto Project:

◮ 44,020 annotated English-language political statements ◮ hundreds of political party platforms from English-speaking countries.

◮ Each statement gets a CMP code, e.g. “decentralization”, “education”

◮ We want to classify text to one of 19 topics.

slide-37
SLIDE 37

43/85

Regularized Logistic Regression

Ash, Morelli, and Osnabruegge (2018)

◮ N rows, M text features, K policy topics ◮ Probability model for policy topic Yi P(Yi = c) = eβcXi

K

k=1 eβkXi ,

c ∈ 1,...,K are the topic labels, Xi is the matrix of tf-idf frequencies, and β is an M ×K matrix of parameters. ◮ Cost function: J(β) = − 1 N

N

  • i=1

K

  • k=1

1{Yi = k}log eβkXi

K

l=1 eβlXi

M

  • j=1

K

  • k=1

β2

jk

◮ γ = strength of L2 penalty

◮ γ∗ = 1/2 selected by 3-fold cross-validation grid search.

slide-38
SLIDE 38

44/85

Agriculture/Education Topics: Predictive Features

Ash, Morelli, and Osnabruegge (2018)

slide-39
SLIDE 39

45/85

Model Accuracy

Ash, Morelli, and Osnabruegge (2018)

◮ Given a chunk of text, the logistic model computes a probability distribution over policy topics.

◮ harnesses expert knowledge about political topics from Manifesto Project

◮ Validation of accuracy: predict the CMP code in a held-out sample of manifesto corpus statements

◮ Out-of-sample accuracy = 53%

◮ quite good given there are 19 policy areas – choosing randomly would be correct 5% of the time; choosing top category (other topic) would be correct 15% of the time.

slide-40
SLIDE 40

46/85

Confusion Matrix

Ash, Morelli, and Osnabruegge (2018)

Admin Agric Culture Econ Educ Intl Other Tech Welfare Administration 348 5 8 117 12 47 9 20 10 94 21 43 1 33 118 886 Agriculture 6 110 2 34 2 2 5 1 6 23 4 49 14 18 276 Culture 12 155 14 13 15 5 1 17 52 2 16 8 15 44 369 Economics 59 12 3 961 11 18 21 7 21 107 35 95 1 43 176 1570 Education 24 1 12 30 481 6 6 5 9 66 14 6 35 88 783 Freedom 51 10 24 240 29 14 23 69 28 3 6 2 59 558 Internationalism 15 3 2 41 2 16 453 12 17 75 14 21 4 8 44 727 Law and Order 18 1 19 10 13 24 361 9 83 12 2 1 14 58 625 Nat’l Way of Life 18 1 8 31 14 25 20 14 133 73 26 26 3 3 91 486 Other 55 18 15 128 61 34 52 64 35 1239 33 86 11 57 185 2073 Party Politics 19 3 3 51 10 18 17 7 22 90 181 25 1 5 59 511 Quality of Life 40 24 4 130 4 7 24 2 10 96 11 619 2 43 45 1061 Target Groups 16 1 7 13 13 7 8 10 13 35 3 1 57 7 71 262 Technology 28 7 6 63 29 3 6 4 8 73 7 52 397 48 731 Welfare 67 4 10 151 52 29 16 26 38 161 20 22 12 34 1454 2096 Total Predicted 776 190 245 1807 714 480 695 548 371 2336 411 1066 107 710 2558 13014 Free- dom Law / Order Way of Life Party Politics Quality

  • f Life

Target Groups Total True

slide-41
SLIDE 41

47/85

Experiment: Electoral Reform in New Zealand

Ash, Morelli, and Osnabruegge (2018)

◮ A 1993 reform in New Zealand moved from majoritarian to

proportional representation: ◮ Majoritarian (first past the post): two parties, single party controls parliament. ◮ Proportional representation: many minority parties, coalition governments.

◮ How did it affect speech topics in the New Zealand Parliament?

slide-42
SLIDE 42

48/85

Change in Parliament Attention due to Reform

Ash, Morelli, and Osnabruegge (2018)

education no topic administration internationalism infrastructure culture target groups agriculture national way of life law and order quality of life freedom economics party politics −2 −1 1 2 3

slide-43
SLIDE 43

49/85

Example “Party Politics” Speech

Ash, Morelli, and Osnabruegge (2018)

“I have seen seven Opposition leaders in my time, but I have never seen a leader as relentlessly negative as Helen Clark. . . . How could anybody be so negative, day in, day out? It could get into the Guinness Book of Records. She does not have a positive word to say about anything. It is all negative, negative, negative.” ◮ Parliamentarian Richard Prebble, 15 Feb 1999

slide-44
SLIDE 44

51/85

Word Embedding: Language as Data

Word Embedding is a technology from computational linguistics that represents words and phrases as vectors in a geometric space, where locations and directions encode meaning.

slide-45
SLIDE 45

52/85

Why word vectors?

◮ Once words are represented as vectors, we can use linear algebra to understand the relationships between words:

◮ Words that are geometrically close to each other are similar: e.g. “student” and “pupil.” ◮ More intriguingly, word2vec algebra can depict conceptual, analogical relationships between words.

◮ Consider the analogy: man is to king as woman is to ____ ◮ With word2vec, we have vec(king)−vec(man)+vec(woman) ≈ vec(queen)

slide-46
SLIDE 46

53/85

How does it work?

◮ How does word2vec learn the meaning of the word “fox”? The quick brown fox jumps over the lazy dog ◮ By reading in every example of the word “fox” and tries to predict what other words will be in the context window.

◮ the prediction weights on these other words (after dimension reduction) are the word vectors.

slide-47
SLIDE 47

54/85

Most similar words to dog, depending on window size

◮ Small windows pick up substitutable words; large windows pick up topics.

slide-48
SLIDE 48

55/85

Measuring Emotionality in Politician Speeches

Gennaro and Ash (2019)

◮ Dictionary: a new domain-appropriate list of words for:

◮ Cognition Processing (“thinking”): insight, causation, discrepancy, tentativeness, certainty, inhibition, inclusion, and exclusion ◮ Emotion Processing (“feeling”): positive and negative emotions, pleasure, pain, happiness, anxiety, anger, and sadness.

◮ Use word embeddings to get the “direction” in language space corresponding to emotionality (most emotional, least cognitive).

slide-49
SLIDE 49

56/85

Cognition Language ◮ Top cognition sentences:

◮ "In my judgment, neither is

true in the case of this amendment.”

◮ “Is that correct?” ◮ “R. 15 contains a provision

that is similar but, in fact, broader in scope.”

Emotion Language ◮ Top emotion sentences:

◮ “There is nothing to trouble

any heart, nothing to hurt at all; death is only a quiet door, in an old garden wall.”

◮ “With joy in his heart and a

smile on his face he graced practically every social

  • ccasion with a song.”

◮ “We Democrats may disagree,

but we love our fellow men and we never hate them.”

slide-50
SLIDE 50

57/85

Emotion language in Congress, 1914-2010

slide-51
SLIDE 51

58/85

Analyzing Gender Bias with Word Embeddings

Garg, Schiebinger, Jurafsky, and Zou (2018)

Women’s occupation relative percentage vs. embedding bias in Google News vectors.

slide-52
SLIDE 52

59/85

Ethnic groups ↔ Occupations

Garg, Schiebinger, Jurafsky, and Zou (2018)

The top 10 occupations most closely associated with each ethnic group in the Google News embedding.

slide-53
SLIDE 53

61/85

Dependency Structure

◮ Dependency structures represent grammatical relations between words in a sentence:

◮ head-dependent relations (directed arcs) ◮ functional categories (arc labels)

slide-54
SLIDE 54

62/85

Extracting Information from Legal Texts

◮ Syntactic dependency parsers allow computers to read texts and parse subjects, actions, and other useful information.

◮ In particular, modal verbs shall, must, will, may, and can encode obligations and entitlements in legal language.

slide-55
SLIDE 55

63/85

Ash, MacLeod, and Naidu (2019)

◮ Data:

◮ new corpus of 30,000 collective bargaining agreements from Canada from 1986 through 2015

◮ Agent (subject) categories:

◮ worker, union, company, and manager.

◮ Encode contract statements as (subject, modal, action).

slide-56
SLIDE 56

64/85

Most Frequent Subject-Modal-Verb Tuples

Ash, MacLeod, and Naidu (2019)

Subject - Modal - Verb agreement_shall_be arbitrator_shall_have board_shall_have case_may_be committee_shall_meet company_shall_pay company_shall_provide company_will_pay company_will_provide decision_shall_be employee_may_request Subject - Modal - Verb employee_shall_be employee_shall_be_allowed employee_shall_be_considered employee_shall_be_entitled employee_shall_be_given employee_shall_be_granted employee_shall_be_laid_off employee_shall_be_paid employee_shall_be_required employee_shall_continue employee_shall_lose Subject - Modal - Verb employee_shall_receive employee_shall_retain employee_will_be employee_will_be_allowed employee_will_be_entitled employee_will_be_given employee_will_be_granted employee_will_be_paid employee_will_be_required employee_will_have employer_shall_grant

slide-57
SLIDE 57

65/85

Determinants of Relative Worker Control

Ash, MacLeod, and Naidu (2019)

◮ All specifications use within-firm, within-industry-year variation:

◮ Personal Income Tax (Non-Wage Compensation) ↑ ◮ Unemployment Rate (Bargaining Power) ↓ ◮ Number of Employers (Labor Market Competition) ↑

slide-58
SLIDE 58

67/85

Analyzing polarization in social media: Method and application to tweets on 21 mass shootings

Demszky, Garg, Voigt, Zou, Gentzkow, Shapiro, and Jurafsky

◮ Research Object:

◮ use NLP to understand four “dimensions” of social media polarization: topic choice, framing, affect, modality.

◮ Context:

◮ tweets in response to mass shooting events.

◮ Research question:

◮ does political partisanship manifest in polarized responses to violent/polarizing events.

slide-59
SLIDE 59

68/85

Dataset

◮ 21 mass shooting events, 2015-2018, from Gun Violence Archive ◮ tweets about those events, identified by:

◮ location keywords (e.g. chattanooga, roseburg, san bernardino, fresno, etc.) ◮ event keywords (lemmas): shoot, gun, kill, attack, massacre, victim ◮ filter out retweets and tweets from deactivated accounts ◮ N = 10,000 (out of 4.4 million tweets from the firehose archive).

slide-60
SLIDE 60

69/85

Identifying party affiliation of Twitter users

◮ Party affiliation identified off of whether you follow more Democrats or Republicans, from a list of Twitter accounts associated with legislators, presidential candidates, and party

  • rganizations (Volkova et al 2014).

◮ at least 51% of tweets for each event can be assigned partisanship this way.

◮ For geolocated users this matches up pretty well with party vote shares by state (R2 = .82):

slide-61
SLIDE 61

70/85

Partisanship

◮ Leave-one-out estimator from Gentzkow et al (2019), applied to each shooting event: π = 1 2( 1 |D|

  • i∈D

ˆ qi · ˆ ρ−i + 1 |R|

  • i∈R

ˆ qi ·(1− ˆ ρ−i))

◮ ˆ qi = token frequencies for user i, drawn from set of democrats D and set of republicans R ◮ ˆ ρ−i has elements ρ−i = qD

i

qD

i +qR i

empirical poster probabilities computed from all other users.

◮ π is an estimate for expected posterior probability that a Bayesian observer would correctly predict party after observing

  • ne randomly sampled token.

◮ consistency assumes tokens are drawn from multinomial logit.

slide-62
SLIDE 62

71/85

Tweets about mass shootings are polarized

◮ comparable to π = .53 in Congressional speeches (GST 2019). ◮ The increase in polarization over time is not statistically significant.

slide-63
SLIDE 63

72/85

Tweet Embeddings for Topic Assignment

  • 1. Make a new vocabulary:

1.1 Sample 10,000 tweets from each event 1.2 vocabulary of stemmed words occuring at least ten times in at least three events (N = 2000)

  • 2. Train GloVe embeddings on random samples of tweets from

each event

  • 3. Create Arora et al (2017) embeddings:

3.1 for each tweet t, compute weighted average vectors vt for each word, weighted by inverse frequency. 3.2 take out first principal component of matrix whose rows are vt

slide-64
SLIDE 64

73/85

Topics = Embedding Clusters

  • 1. Cluster the embeddings using k-means

◮ k-means clustering separates documents into k groups based

  • n distance in embedding space.

◮ different from a topic model because a document is in a single topic, rather than a distribution across topics.

  • 2. Drop hard-to-classify tweets:

2.1 compute ratio of distance to closest topic and distance to second-closest topic. 2.2 drop tweets above the 75th percentile.

◮ Validation using Amazon Mechanical Turk:

◮ Identify word intruder: five from one cluster, one from another cluster. ◮ Identify tweet intruder: three from one cluster, and one from another cluster.

slide-65
SLIDE 65

74/85

Topic Content

◮ The embedding method resulted in more coherent topics (better MTurk validation for words and tweets) than a topic

  • model. k = 8 got best coherence.

◮ Appendix reports samples of tweets for each topic.

slide-66
SLIDE 66

75/85

Between-topic vs within-topic polarization

◮ Within-topic polarization: compute π separately by the tweet clusters. ◮ Between-topic polarization: Compute π using cluster counts, rather than token counts.

slide-67
SLIDE 67

76/85

Within-topic polarization

◮ Most polarized topics: shooter’s identity & ideology (.55), laws & policy (.54)

slide-68
SLIDE 68

77/85

Partisanship of Topics, by Race of Shooter

slide-69
SLIDE 69

78/85

Partisan Framing Devices: Words

◮ Partisanship of phrases from GST model: ◮ Partisan valence of “terrorist” and “crazy” flip depending on race of shooter (these words have the largest racial difference in the joint vocabulary).

slide-70
SLIDE 70

79/85

Partisan Framing Devices: Events

◮ Partisanship of keywords for previous events from GST model: ◮ Democrats invoke white shooters, Republicans invoke POC shooters.

slide-71
SLIDE 71

80/85

Affect

◮ Starting point: Emotion lexicon from Mohammad and Turney (2013), available at saifmohammad.com.

◮ 14,182 words assigned to sentiment (positive/negative) and emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust).

◮ Domain propagation:

◮ pick 5-11 representative words per emotion category (Appendix E) ◮ for each word in vocabulary, compute average distance to each member of each category. take 30 closest words as lexicon.

slide-72
SLIDE 72

81/85

Partisanship of Affect Categories

◮ Compute GST partisanship scores using affect-category counts: ◮ Disgust affect flips along partisan lines depending on race of shooter.

slide-73
SLIDE 73

82/85

Modality

◮ Count the four most frequent necessity modals in the data: should, must, have to, need to.

◮ in this context, they are used as calls to action.

◮ Democrats use modals more than Republicans; Republicans are more fatalistic.

slide-74
SLIDE 74

84/85

Text as Data

◮ Unstructured text is and will be an important data source going forward in the social sciences. ◮ But these new data and methods are not a substitute for a good research question or a good research design – both are still necessary.

slide-75
SLIDE 75

85/85

◮ Please email me if you would like to help out on a project or discuss research ideas. ◮ Thanks—