SLIDE 1 Distributional Embedding Approach for Relational Knowledge Representation
Dissertation Proposal Supervisor: Dr. Tao
Pace University March 2017
SLIDE 2
Contents overview
Part 1: Brief introduction
◮ Topic, Issue, and Solution idea
Part 2: Details
◮ Methods, Related work, and Proposed work
SLIDE 3
PART ONE
SLIDE 4 Introduction
Overview
◮ Relational Learning through Knowledge Base Representation ◮ Relational Knowledge Representation
◮ Knowledge ≈ entities + their relationships
Motivation and importance
◮ Relationships between entities are a rich source of information
SLIDE 5
Knowledge Bases (KBs)
Web-scale Extracted KBs provide a structured representation of the world knowledge
◮ Large quantities of knowledge publicly available in relational
and across different domains in interlinked form The ability to learn from relational data has significant impact on many applications
SLIDE 6
Applications and use cases
World Wide Web and Semantic Web
◮ linkage of related documents, and semantically structured data
Bioinformatics and Molecular Biology
◮ gene-disease association
Social Networks
◮ relationships between persons
Question Answering
◮ link prediction in knowledge base queries
SLIDE 7 Example KB: FreeBase1
1info source: Deep Learning Lectures (Bordes, 2015)
SLIDE 8 Example KB: WordNet2
2info source: Deep Learning Lectures (Bordes, 2015)
SLIDE 9
Problem
Collectively KBs have over 60 billion published facts and growing (Nickel, 2013) KBs have large dimensions, and thus they are ...
◮ Hard to manipulate ◮ Sparse (with few valid links) ◮ Noisy and/or Incomplete
Tackling these issues is a key to automatically understand/utilize the structure of large-scale knowledge bases
SLIDE 10 Idea
Modeling Knowledge Bases
◮ KBs Embeddings (inspired by Word Embeddings)
How ?
- 1. Encode (embed) KBs into low-dimensional vector space s.t.
similar entities/relations are represented in a similar “nearby” vectors
- 2. Use these representations:
◮ to complete/visualize KBs ◮ as KB data in text applications
(Bordes et al., 2013)
SLIDE 11
Example use case: link prediction
◮ Question Answering Systems ◮ Assess the validity of results from Information Retrieval
Systems
An example fragment of a KB.
SLIDE 12 Word Embeddings
The most successful approach for word meanings (now standard) Two main components . . .
- 1. Neural Language Modeling “NLM”
◮ neural networks approach for text representations
- 2. Distributional Semantics Hypothesis3
◮ words which are similar in meaning occur in similar contexts
(Rubenstein and Goodenough, 1965)
3One of the most successful ideas of modern statistical NLP As described in: Deep Learning for NLP (Socher, 2016)
SLIDE 13
Relations Embeddings
Current state-of-the-art in relation embeddings is exploiting (only): Neural Language Modeling However, (not): Distributional Semantics Hypothesis As a result,
The performance of current KBs modeling methods is far from being useful to leverage in real world applications
SLIDE 14
Current Approach
Current relation embeddings approaches are not making use of distributional similarities over KBs relations
SLIDE 15
Proposal Approach
Proposed approach (as inspired by word embeddings) brings entire experience of word representations to relation embeddings by incorporating Distributional Similarity
SLIDE 16
PART TWO
SLIDE 17
Knowledge Base
What is Knowledge Base ?
Knowledge bases (KBs) store factual information about the real-world in form of binary relations between entities (e.g. FreeBase, NELL, WordNet, YAGO). In KBs, facts are expressed as triplets “binary relations” between entities a triplet of tokens: (subject, verb, object) with the values: (entityi, relationk, entityj)
SLIDE 18 Sample fragments of KBs
Table 1: Sample KB triplets for “molecule” entity from WordNet18 (Miller, 1995)
head relation tail __radical_NN_1 _part_of __molecule_NN_1 __physics_NN_1 _member_of_domain_topic __molecule_NN_1 __molecule_NN_1 _has_part __atom_NN_1 __unit_NN_5 _hyponym __molecule_NN_1 __chemical_chain_NN_1 _part_of __molecule_NN_1 __molecule_NN_1 _hypernym __unit_NN_5
Table 2: Sample triplets from NELL4 KB
head relation tail action_movies is_a movie action_movies is_a_generalization_of die_hard leonardo_dicaprio is_an actor akiva_goldsman directedMovie batman_forever leonard_nimoy StarredIn star_trek motorola acquiredBy google david_beckham playSport soccer 4http://rtw.ml.cmu.edu/rtw/kbbrowser/
SLIDE 19
Introduction: Knowledge Graphs
Knowledge Graphs are graph structured knowledge bases (KBs).
◮ The multi-relational data (of KBs) can form directed graphs (of
knowledge) whose nodes correspond to entities and edges correspond to relations between entities.
◮ Multigraph Structure
Entity = Node Relation Type = Edge type Fact = Edge
SLIDE 20 Word Embeddings: Semantic Theory
Distributional similarity representation
◮ Distributional Hypothesis
“You shall know a word by the company it keeps.” (Firth, 1957)
Examples 5
◮ It was found in the banks of the Amoy River .. ◮ I was seated in my office at the bank when a card . . . ◮ with a plunge, like the swimmer who leaves the bank . . . ◮ through the issue of bank notes, the money capital . . . ◮ settlements were on the north bank of the Ohio River . . .
5https://youtu.be/T1O3ikmTEdA?t=16m29s
SLIDE 21 Word Embeddings: Neural Language Model
Vector Space Models (VSMs)
Distributed representation of words to solve dimensionally problem. VSMs Approaches:
- 1. count-based: Latent Semantic Analysis “LSA”
- 2. prediction-based: Neural probabilistic language models (Bengio
et al., 2003)
SLIDE 22
Word Embeddings: Sparse Representations
The sparsity of Symbolic Representations make them suffers from the “curse of dimensionality”
◮ Lose word order ◮ No account for semantic
Hypothetical example: Symbolic Representations of the terms Dog and Cat
SLIDE 23
Word Embeddings: Distributed Representation
Distributed Representations can address the sparsity issue
◮ Low-dimensional ◮ Induce a rich similarity space
Hypothetical example: Distributed Representations of the terms Dog and Cat
Question: How can we generate such rich vector representations ?
SLIDE 24
Word Embeddings: Neural Word Embedding
word2vec (Mikolov et al., 2013): Most successful example for modeling semantics (and syntactic) similarities of words.
It trains (generates) word vectors (embeddings) by leveraging Distributional Hypothesis to predict following/previous words in a given sequence.
SLIDE 25 Word Embeddings: Word2vec
In word2vec’s skip-gram model, the goal is to maximize the sum log-likelihood given all training vocabulary as target:
T
log p(wc|wt) Where, p(wc|wt) = exp(wT
t .wc)
t wi)
SLIDE 26 Word Embeddings: Example architecture6
Word2vec leverages Distributional Hypothesis “contexts” to estimate words embeddings Probability of a target word is estimated based on its context words
6http://bit.ly/2eIMHR7
SLIDE 27 Word Embeddings: Example VSM
Words Represented in Vector Space7
7http://projector.tensorflow.org/
SLIDE 28 Word Embeddings: Example Usage8
Representing words as vectors allows easy computation of similarity Spain to Madrid :as: Italy to ?
Arithmetic operations can be performed on word embeddings (e.g. to find similar words)
8https://www.tensorflow.org/tutorials/word2vec
SLIDE 29
Relation Embeddings: Related Work
TransE
State-of-the-art: Translating Embedding for Modeling Multi-relational Data “TransE” (Bordes et al., 2013)
◮ Learning objective: h + l ≈ t when (h, l, t) holds.
In other words, score(Rl(h, t)) = −dsit(h + l, t) wehre dist is L1-norm or L2-norm and {h, l, t} ∈ Rk
SLIDE 30
Proposed approach: example scenario
Table 3: Table 1: An example training dataset.
training examples - triplets (e3, r1, e2) (e1, r2, e3) (e1, r3, e4) (e2, r2, e5) (e6, r2, e4) (e3, r1, e6) (e5, r3, e3) We have E = (e1, e2, e3, e4, e5, e6) and R = (r1, r2, r3), and assuming current training target: (e1, r2, e3) And the window size is 1 In this case, the target’s context would be the triplets (e6, r2, e4) and (e2, r2, e5)
SLIDE 31 Proposed approach: model
With that being said, a triplet in our approach is treated just like a word in word2vec model 1 R
R
log p(tr
j |tr i )
Where a triplet t is a compositional vector tr = d(eh, rl, et) And p(tjr|ti r) = exp(ti r · tjr)
SLIDE 32
Plan of Action
Rough estimate of the planned/implemented tasks throughout the entire PhD program:
SLIDE 33
Timeline
Date Work Summer 2014 Web-technology tools Fall 2014/Spring 2015 Semantic Web methods Fall 2015 Shift focus to future-proof solution Spring 2016 ML/AI and data collection Fall 2016 Re-produce/test related work models Spring/Summer 2017 Build and evaluate proposed model Summer/Fall 2017 Write up the dissertation
SLIDE 34 References I
Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. journal of machine learning research, 3 (Feb):1137–1155, 2003. Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational
- data. In Advances in neural information processing systems, pages 2787–2795,
2013. John R Firth. A synopsis of linguistic theory, 1930-1955. 1957. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation
- f word representations in vector space. arXiv.org, January 2013.
George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995. Herbert Rubenstein and John B Goodenough. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633, 1965.