Predicting Document Creation Times in News Citation Networks Andreas - - PowerPoint PPT Presentation

predicting document creation times in news citation
SMART_READER_LITE
LIVE PREVIEW

Predicting Document Creation Times in News Citation Networks Andreas - - PowerPoint PPT Presentation

Predicting Document Creation Times in News Citation Networks Andreas Spitz 1 , Jannik Strtgen 2 , and Michael Gertz 1 April 23, 2018 TempWeb 2018, Lyon 1 Database Systems Research Group 2 Bosch Center for Artificial Intelligence Heidelberg


slide-1
SLIDE 1

Predicting Document Creation Times in News Citation Networks

Andreas Spitz1, Jannik Strötgen2, and Michael Gertz1 April 23, 2018 — TempWeb 2018, Lyon

1 Database Systems Research Group 2 Bosch Center for Artificial Intelligence

Heidelberg University, Germany Germany

slide-2
SLIDE 2

Hm, when did this happen again?

1

slide-3
SLIDE 3

News Citation Networks

slide-4
SLIDE 4

News Citation Network Extraction

2

slide-5
SLIDE 5

News Citation Network Overview

News articles from RSS feeds:

◮ Politics and business feeds ◮ 34 English news outlets

(USA, UK, AUS, CAN, GER, CHN, QAT)

◮ 2 years (Nov 2015 - Oct 2017) ◮ 244.6 thousand articles ◮ 367.2 thousand edges

Used data:

◮ Hyperlinks in the article body ◮ Publication dates ◮ Temporal expressions 3

slide-6
SLIDE 6

News Outlet Statistics (sample)

short news outlet days articles temp exp

  • therin
  • therout

AT The Atlantic 334 7.2 10.5 16.7 50.6 BBC British Bc. Corp. 730 8.1 6.5 19.1 8.0 DW Deutsche Welle 334 1.2 6.1 48.1 5.9 FOX Fox News 548 2.7 9.8 0.0 0.0 NPR National Public Radio 334 0.4 8.4 63.6 58.5 NY The New Yorker 548 3.0 13.2 33.5 30.6 NYT New York Times 669 23.8 10.7 26.8 4.7 SMH Sydney Morn. Herald 548 2.3 7.0 3.0 51.9 WP Washington Post 548 62.7 9.4 13.7 5.1

4

slide-7
SLIDE 7

Evolution of Network Metrics

clustering coefficient average path length average degree undirected diameter

2016−01 2016−07 2017−01 2017−07 2016−01 2016−07 2017−01 2017−07 20 40 60 5 10 15 1 2 3 0.0 0.2 0.4 0.6

days measure value

network aggregated politics business

5

slide-8
SLIDE 8

Exploring Citation Chains

6

slide-9
SLIDE 9

Article Publication Time Prediction

slide-10
SLIDE 10

Task Definition: Publication Time Prediction

7

slide-11
SLIDE 11

Available News Citation Network Data

Predict article publication times from:

◮ Citation network topology ◮ Publication dates of adjacent articles ◮ Temporal expressions in adjacent articles 8

slide-12
SLIDE 12

Available News Citation Network Data

Predict article publication times from:

◮ Citation network topology ◮ Publication dates of adjacent articles ◮ Temporal expressions in adjacent articles ◮ Not the metadata of the article itself ◮ Not the article content 8

slide-13
SLIDE 13

Feature Extraction

slide-14
SLIDE 14

Network Topology Features

Node degree-based features:

◮ Incoming degree ◮ Outgoing degree ◮ Undirected degree 9

slide-15
SLIDE 15

Network Topology Features

Node degree-based features:

◮ Incoming degree ◮ Outgoing degree ◮ Undirected degree

Centrality-based features:

◮ Betweenness centrality ◮ Incoming closeness centrality ◮ Outgoing closeness centrality ◮ Page Rank centrality 9

slide-16
SLIDE 16

Network Topology Features

Node degree-based features:

◮ Incoming degree ◮ Outgoing degree ◮ Undirected degree

Density-based features:

◮ Undirected local clustering coefficient

Centrality-based features:

◮ Betweenness centrality ◮ Incoming closeness centrality ◮ Outgoing closeness centrality ◮ Page Rank centrality 9

slide-17
SLIDE 17

Temporal Network Features

10

slide-18
SLIDE 18

Temporal Expression Features

Correlation of temporal expressions:

◮ good with publication dates of

referencing articles (incoming edges)

◮ bad with publication dates of

referenced articles (outgoing edges)

11

slide-19
SLIDE 19

Temporal Expression Features

Correlation of temporal expressions:

◮ good with publication dates of

referencing articles (incoming edges)

◮ bad with publication dates of

referenced articles (outgoing edges)

11

slide-20
SLIDE 20

Missing Features and Imputation

Missing features

◮ 30.8% of feature values are missing ◮ 89.6% of articles are missing at least one feature 12

slide-21
SLIDE 21

Missing Features and Imputation

Missing features

◮ 30.8% of feature values are missing ◮ 89.6% of articles are missing at least one feature

Imputation of missing values

◮ Column mean of the feature 12

slide-22
SLIDE 22

Evaluation

slide-23
SLIDE 23

Regression Methods

Used regression methods:

◮ BASE: Baseline (average publication date of adjacent articles) ◮ LR: Linear regression ◮ BAY: Bayesian ridge regression (Laplace model) ◮ RF: Random forest ◮ GB: Gradient boosting (Laplace distribution, decision trees) ◮ SVM: Support vector machine (radial kernel) ◮ NN: Neural network (feedforward, one hidden layer) 13

slide-24
SLIDE 24

Evaluation Results: Mean Absolute Error (days)

BASE LR BAY NN RF GB SVM all 66.72 60.46 59.61 26.88 24.98 22.66 26.19 in 88.88 66.48 87.55 34.03 32.25 27.49 32.29

  • ut

87.32 59.54 40.24 32.52 30.10 26.68 30.77 in+out 18.68 55.45 54.95 12.62 11.23 12.76 14.31

14

slide-25
SLIDE 25

Distribution of Absolute Errors

  • ut

in+out all in

BASE LR BAY NN RF GB SVM BASE LR BAY NN RF GB SVM 50 100 150 200 250 50 100 150 200 250

regression method absolute error (days)

method

BASE LR BAY NN RF GB SVM

15

slide-26
SLIDE 26

Recall by Varying Absolute Error

  • ut

in+out all in

20 40 60 20 40 60 25 50 75 100 25 50 75 100

absolute error (days) recall (percentage of predictions < absolute error)

method

BASE LR BAY NN RF GB SVM

16

slide-27
SLIDE 27

Feature Importance: Random Forest

  • Feature importance: random forest

max

(

T

  • ut

)

min

(

T

in

)

µ

(

T

  • ut

)

µ

(

T

in

)

min

(

T

  • ut

)

max

(

T

in

)

max

(

X

in

)

µ

(

X

in

)

c

pr

σ

(

T

  • ut

)

σ

(

X

in

)

c

cl,out

span

(

T

  • ut

)

σ

(

T

in

)

span

(

T

in

)

min

(

X

in

)

span

(

X

in

)

c

cl,in

min

(

Dist

)

deg

  • ut

µ

(

Dist

)

deg

in

deg

all

max

(

Dist

)

c

btw

cc σ

(

Dist

)

10−3 10−2 10−1 100

relative importance

Feature type:

  • network topology

temporal expression temporal network

17

slide-28
SLIDE 28

Feature Importance: Gradient Boosting

  • Feature importance: gradient boosting

max

(

T

  • ut

)

min

(

T

in

)

deg

  • ut

µ

(

T

  • ut

)

min

(

Dist

)

deg

in

c

pr

σ

(

T

  • ut

)

σ

(

T

in

)

µ

(

T

in

)

deg

all

span

(

T

in

)

max

(

T

in

)

µ

(

X

in

)

min

(

T

  • ut

)

µ

(

Dist

)

c

btw

max

(

X

in

)

max

(

Dist

)

span

(

X

in

)

σ

(

X

in

)

span

(

T

  • ut

)

min

(

X

in

)

c

cl,out

σ

(

Dist

)

c

cl,in

cc 10−5 10−4 10−3 10−2 10−1 100

relative importance

Feature type:

  • network topology

temporal expression temporal network

18

slide-29
SLIDE 29

Summary & Resources

slide-30
SLIDE 30

Summary

News citation networks:

◮ Focus on anchored links inside the article body ◮ Constructed like a citation network between articles

Publication date prediction:

◮ Can be framed as a regression problem ◮ Average prediction error of 3 weeks ◮ Temporal network features are most discriminative 19

slide-31
SLIDE 31

Resources

Data and implementation are available online:

◮ [data] News citation network (including URLs) ◮ [data] Temporal annotations ◮ [code] Publication date prediction

https://dbs.ifi.uni-heidelberg.de/resources/data/

20

slide-32
SLIDE 32

Resources

Data and implementation are available online:

◮ [data] News citation network (including URLs) ◮ [data] Temporal annotations ◮ [code] Publication date prediction

https://dbs.ifi.uni-heidelberg.de/resources/data/

20