Linking People in Videos with Their Names Using Coreference - - PowerPoint PPT Presentation

▶

Dec 26, 2023 145 likes •464 views

Linking People in Videos with Their Names Using Coreference Resolution Vignesh Ramanathan, Armand Joulin, Percy Liang, and Li Fei-Fei Stanford University Images from Ramanathan et al. (2014) Yukun Zhu CSC2523 1 / 17 Task Missy points to the

SLIDE 1

Linking People in Videos with Their Names Using Coreference Resolution

Vignesh Ramanathan, Armand Joulin, Percy Liang, and Li Fei-Fei Stanford University

Images from Ramanathan et al. (2014)

Yukun Zhu CSC2523 1 / 17

SLIDE 2

Task

Missy points to the larger kid. The big kid walks off. Other kids jeer. No labelled instance. Script is the only source of supervision Names include nominal expressions and pronouns

Yukun Zhu CSC2523 2 / 17

SLIDE 3

Previous Approach

On person naming: Multiple instance learning, using proper names from script Treat videos and scripts as bag

f face tracks and names

Unidirectional information flow from text to vision

Yukun Zhu CSC2523 3 / 17

SLIDE 4

Previous Approach

On person naming: Multiple instance learning, using proper names from script Treat videos and scripts as bag

f face tracks and names

Unidirectional information flow from text to vision On coreference resolution: One of the core task in NLP Can operate on language alone Not accurate enough

Yukun Zhu CSC2523 3 / 17

SLIDE 5

Previous Approach

On person naming: Multiple instance learning, using proper names from script Treat videos and scripts as bag

f face tracks and names

Unidirectional information flow from text to vision On coreference resolution: One of the core task in NLP Can operate on language alone Not accurate enough

Yukun Zhu CSC2523 3 / 17

SLIDE 6

Problem Setup

Input:

Yukun Zhu CSC2523 4 / 17

SLIDE 7

Problem Setup

Input: Videos with detected human tracks

Yukun Zhu CSC2523 4 / 17

SLIDE 8

Problem Setup

Input: Videos with detected human tracks Script roughly aligned with video segments

Yukun Zhu CSC2523 4 / 17

SLIDE 9

Problem Setup

Input: Videos with detected human tracks Script roughly aligned with video segments Names (including pronoun/nominals) from script

Yukun Zhu CSC2523 4 / 17

SLIDE 10

Problem Setup

Input: Videos with detected human tracks Script roughly aligned with video segments Names (including pronoun/nominals) from script Cast names

Yukun Zhu CSC2523 4 / 17

SLIDE 11

Problem Setup

Output:

Yukun Zhu CSC2523 5 / 17

SLIDE 12

Problem Setup

Output: Name assignment to human tracks in video

Yukun Zhu CSC2523 5 / 17

SLIDE 13

Problem Setup

Output: Name assignment to human tracks in video Name assignment to human mentions in text

Yukun Zhu CSC2523 5 / 17

SLIDE 14

Proposed Method

C = γtCtrack + γmCmention + Calign

Yukun Zhu CSC2523 6 / 17

SLIDE 15

Proposed Method

C = γtCtrack(Y ) + γmCmention(Z, R) + Calign(A, Y , Z) Name-Track assignment Y ∈ {0, 1}T×P Name-Mention assignment Z ∈ {0, 1}M×P Antecedent matrix R ∈ {0, 1}M×M Alignment matrix A ∈ {0, 1}T×M

Yukun Zhu CSC2523 7 / 17

SLIDE 16

Ctrack(Y )

Cost of assigning names to tracks Based on video features only Formulate cost function of regression based clustering C(Y ; X, λ) = arg min

W

t∈τ

||Y − XW ||2

F + λ||W ||2 F

= tr(Y TΠ(X, λ)Y ) Constraints: Each track is assigned to exactly one name Speaker should be aligned to at least one track Name not mentioned in a scene won’t be aligned

Yukun Zhu CSC2523 8 / 17

SLIDE 17

Cmention(Z, R)

Depends on text only Proper mentions(68%) are trivial to map Pronouns/Nominals alone are not informative Apply regression based clustering to predict R Constraints: Each mention has at most one antecedent Each mention is assigned to one name Gender consistency/no self-association of pronouns Connection constraint Rm,m′ = 1 → Zm = Zm′

Yukun Zhu CSC2523 9 / 17

SLIDE 18

Calign(A, Y , Z)

Intuition Aligned track/mention should be assigned to the same name Tracks and mentions are ordered sequence through time Tracks and mentions are roughly aligned in time Formulation Soft connection penalty min ||ATY − Z||2

F

Monotonic constraint Mention mapping constraint

Yukun Zhu CSC2523 10 / 17

SLIDE 19

Optimization

min γtCtrack(Y )+γmCmention(Z, R) + Calign(A, Y , Z) s.t. Y ∈ CY , Z, R ∈ CZ,R, A ∈ CA Relax Y , R, Z to be [0, 1] Slack constraints of Y , Z Block coordinate descent

Yukun Zhu CSC2523 11 / 17

SLIDE 20

Optimization

min γtCtrack(Y )+γmCmention(Z, R) + Calign(A, Y , Z) s.t. Y ∈ CY , Z, R ∈ CZ,R, A ∈ CA Relax Y , R, Z to be [0, 1] Slack constraints of Y , Z Block coordinate descent Quadratic programming to optimize Y

Yukun Zhu CSC2523 11 / 17

SLIDE 21

Optimization

min γtCtrack(Y )+γmCmention(Z, R) + Calign(A, Y , Z) s.t. Y ∈ CY , Z, R ∈ CZ,R, A ∈ CA Relax Y , R, Z to be [0, 1] Slack constraints of Y , Z Block coordinate descent Quadratic programming to optimize Y Quadratic programming to optimize Z, R

Yukun Zhu CSC2523 11 / 17

SLIDE 22

Optimization

min γtCtrack(Y )+γmCmention(Z, R) + Calign(A, Y , Z) s.t. Y ∈ CY , Z, R ∈ CZ,R, A ∈ CA Relax Y , R, Z to be [0, 1] Slack constraints of Y , Z Block coordinate descent Quadratic programming to optimize Y Quadratic programming to optimize Z, R Dynamic time wrapping to optimize A

Yukun Zhu CSC2523 11 / 17

SLIDE 23

Optimization

min γtCtrack(Y )+γmCmention(Z, R) + Calign(A, Y , Z) s.t. Y ∈ CY , Z, R ∈ CZ,R, A ∈ CA Relax Y , R, Z to be [0, 1] Slack constraints of Y , Z Block coordinate descent Quadratic programming to optimize Y Quadratic programming to optimize Z, R Dynamic time wrapping to optimize A Round Y , Z to integer matrix

Yukun Zhu CSC2523 11 / 17

SLIDE 24

Dataset

Yukun Zhu CSC2523 12 / 17

SLIDE 25

Quantitative Results

Name assignment to tracks in video.

Random: Randomly picks a name based on crude alignment Cour: Weakly-supervised method for name assignment BOJ: min Ctrack without scene constraint OurUnidir: min Ctrack with scene constraint OurUnicor: min Ctrack with coreference constraints OurUnif: All tracks given equal values in alignment matrix OurBidir: Full model

Yukun Zhu CSC2523 13 / 17

SLIDE 26

Quantitative Results

Name assignment to mentions in text.

Yukun Zhu CSC2523 14 / 17

SLIDE 27

Qualitative Results

Yukun Zhu CSC2523 15 / 17

SLIDE 28

Errors

Missing/low resolution faces Error in coreference resolution

Yukun Zhu CSC2523 16 / 17

SLIDE 29

Summary

Contribution: Joint person naming and coreference resolution New dataset State-of-the-art performance on visual/textual side

Yukun Zhu CSC2523 17 / 17

SLIDE 30

Summary

Contribution: Joint person naming and coreference resolution New dataset State-of-the-art performance on visual/textual side Future work: Actions/attributes for alignment

Yukun Zhu CSC2523 17 / 17

SLIDE 31

V. Ramanathan, A. Joulin, P. Liang, and L. Fei-Fei. Linking People in

Videos with “Their” Names Using Coreference Resolution. In Computer Vision – ECCV 2014, pages 95–110. Springer International Publishing, Cham, Sept. 2014.

Yukun Zhu CSC2523 17 / 17