Tokeni z ation and Lemmati z ation FE ATU R E E N G IN E E R IN G - - PowerPoint PPT Presentation

tokeni z ation and lemmati z ation
SMART_READER_LITE
LIVE PREVIEW

Tokeni z ation and Lemmati z ation FE ATU R E E N G IN E E R IN G - - PowerPoint PPT Presentation

Tokeni z ation and Lemmati z ation FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist Te x t so u rces Ne w s articles T w eets Comments FEATURE ENGINEERING FOR NLP IN PYTHON Making te x t machine friendl y


slide-1
SLIDE 1

Tokenization and Lemmatization

FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

Rounak Banik

Data Scientist

slide-2
SLIDE 2

FEATURE ENGINEERING FOR NLP IN PYTHON

Text sources

News articles Tweets Comments

slide-3
SLIDE 3

FEATURE ENGINEERING FOR NLP IN PYTHON

Making text machine friendly

Dogs , dog reduction , REDUCING , Reduce don't , do not won't , will not

slide-4
SLIDE 4

FEATURE ENGINEERING FOR NLP IN PYTHON

Text preprocessing techniques

Converting words into lowercase Removing leading and trailing whitespaces Removing punctuation Removing stopwords Expanding contractions Removing special characters (numbers, emojis, etc.)

slide-5
SLIDE 5

FEATURE ENGINEERING FOR NLP IN PYTHON

Tokenization

"I have a dog. His name is Hachi."

Tokens:

["I", "have", "a", "dog", ".", "His", "name", "is", "Hachi", "."] "Don't do this."

Tokens:

["Do", "n't", "do", "this", "."]

slide-6
SLIDE 6

FEATURE ENGINEERING FOR NLP IN PYTHON

Tokenization using spaCy

import spacy # Load the en_core_web_sm model nlp = spacy.load('en_core_web_sm') # Initiliaze string string = "Hello! I don't know what I'm doing here." # Create a Doc object doc = nlp(string) # Generate list of tokens tokens = [token.text for token in doc] print(tokens) ['Hello','!','I','do',"n't",'know','what','I',"'m",'doing','here','.']

slide-7
SLIDE 7

FEATURE ENGINEERING FOR NLP IN PYTHON

Lemmatization

Convert word into its base form

reducing , reduces , reduced , reduction → reduce am , are , is → be n't → not 've → have

slide-8
SLIDE 8

FEATURE ENGINEERING FOR NLP IN PYTHON

Lemmatization using spaCy

import spacy # Load the en_core_web_sm model nlp = spacy.load('en_core_web_sm') # Initiliaze string string = "Hello! I don't know what I'm doing here." # Create a Doc object doc = nlp(string) # Generate list of lemmas lemmas = [token.lemma_ for token in doc] print(lemmas) ['hello','!','-PRON-','do','not','know','what','-PRON','be','do','here', '.']

slide-9
SLIDE 9

Let's practice!

FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

slide-10
SLIDE 10

Text cleaning

FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

Rounak Banik

Data Scientist

slide-11
SLIDE 11

FEATURE ENGINEERING FOR NLP IN PYTHON

Text cleaning techniques

Unnecessary whitespaces and escape sequences Punctuations Special characters (numbers, emojis, etc.) Stopwords

slide-12
SLIDE 12

FEATURE ENGINEERING FOR NLP IN PYTHON

isalpha()

"Dog".isalpha() True "3dogs".isalpha() False "12347".isalpha() False "!".isalpha() False "?".isalpha() False

slide-13
SLIDE 13

FEATURE ENGINEERING FOR NLP IN PYTHON

A word of caution

Abbreviations: U.S.A , U.K , etc. Proper Nouns: word2vec and xto10x . Write your own custom function (using regex) for the more nuanced cases.

slide-14
SLIDE 14

FEATURE ENGINEERING FOR NLP IN PYTHON

Removing non-alphabetic characters

string = """ OMG!!!! This is like the best thing ever \t\n. Wow, such an amazing song! I'm hooked. Top 5 definitely. ? """ import spacy # Generate list of tokens nlp = spacy.load('en_core_web_sm') doc = nlp(string) lemmas = [token.lemma_ for token in doc]

slide-15
SLIDE 15

FEATURE ENGINEERING FOR NLP IN PYTHON

Removing non-alphabetic characters

... ... # Remove tokens that are not alphabetic a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() or lemma == '-PRON-'] # Print string after text cleaning print(' '.join(a_lemmas)) 'omg this be like the good thing ever wow such an amazing song -PRON- be hooked top definitely'

slide-16
SLIDE 16

FEATURE ENGINEERING FOR NLP IN PYTHON

Stopwords

Words that occur extremely commonly

  • Eg. articles, be verbs, pronouns, etc.
slide-17
SLIDE 17

FEATURE ENGINEERING FOR NLP IN PYTHON

Removing stopwords using spaCy

# Get list of stopwords stopwords = spacy.lang.en.stop_words.STOP_WORDS string = """ OMG!!!! This is like the best thing ever \t\n. Wow, such an amazing song! I'm hooked. Top 5 definitely. ? """

slide-18
SLIDE 18

FEATURE ENGINEERING FOR NLP IN PYTHON

Removing stopwords using spaCy

... ... # Remove stopwords and non-alphabetic tokens a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() and lemma not in stopwords] # Print string after text cleaning print(' '.join(a_lemmas)) 'omg like good thing wow amazing song hooked definitely'

slide-19
SLIDE 19

FEATURE ENGINEERING FOR NLP IN PYTHON

Other text preprocessing techniques

Removing HTML/XML tags Replacing accented characters (such as é) Correcting spelling errors

slide-20
SLIDE 20

FEATURE ENGINEERING FOR NLP IN PYTHON

A word of caution

Always use only those text preprocessing techniques that are relevant to your application.

slide-21
SLIDE 21

Let's practice!

FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

slide-22
SLIDE 22

Part-of-speech tagging

FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

Rounak Banik

Data Scientist

slide-23
SLIDE 23

FEATURE ENGINEERING FOR NLP IN PYTHON

Applications

Word-sense disambiguation

"The bear is a majestic animal" "Please bear with me"

Sentiment analysis Question answering Fake news and opinion spam detection

slide-24
SLIDE 24

FEATURE ENGINEERING FOR NLP IN PYTHON

POS tagging

Assigning every word, its corresponding part of speech. "Jane is an amazing guitarist." POS Tagging:

Jane → proper noun is → verb an → determiner amazing → adjective guitarist → noun

slide-25
SLIDE 25

FEATURE ENGINEERING FOR NLP IN PYTHON

POS tagging using spaCy

import spacy # Load the en_core_web_sm model nlp = spacy.load('en_core_web_sm') # Initiliaze string string = "Jane is an amazing guitarist" # Create a Doc object doc = nlp(string)

slide-26
SLIDE 26

FEATURE ENGINEERING FOR NLP IN PYTHON

POS tagging using spaCy

... ... # Generate list of tokens and pos tags pos = [(token.text, token.pos_) for token in doc] print(pos) [('Jane', 'PROPN'), ('is', 'VERB'), ('an', 'DET'), ('amazing', 'ADJ'), ('guitarist', 'NOUN')]

slide-27
SLIDE 27

FEATURE ENGINEERING FOR NLP IN PYTHON

POS annotations in spaCy

PROPN → proper noun DET → determinant

spaCy annotations at hps://spacy.io/api/annotation

slide-28
SLIDE 28

Let's practice!

FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

slide-29
SLIDE 29

Named entity recognition

FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

Rounak Banik

Data Scientist

slide-30
SLIDE 30

FEATURE ENGINEERING FOR NLP IN PYTHON

Applications

Ecient search algorithms Question answering News article classication Customer service

slide-31
SLIDE 31

FEATURE ENGINEERING FOR NLP IN PYTHON

Named entity recognition

Identifying and classifying named entities into predened categories. Categories include person, organization, country, etc.

"John Doe is a software engineer working at Google. He lives in France."

Named Entities

John Doe → person Google → organization France → country (geopolitical entity)

slide-32
SLIDE 32

FEATURE ENGINEERING FOR NLP IN PYTHON

NER using spaCy

import spacy string = "John Doe is a software engineer working at Google. He lives in France." # Load model and create Doc object nlp = spacy.load('en_core_web_sm') doc = nlp(string) # Generate named entities ne = [(ent.text, ent.label_) for ent in doc.ents] print(ne) [('John Doe', 'PERSON'), ('Google', 'ORG'), ('France', 'GPE')]

slide-33
SLIDE 33

FEATURE ENGINEERING FOR NLP IN PYTHON

NER annotations in spaCy

More than 15 categories of named entities NER annotations at hps://spacy.io/api/annotation#named-entities

slide-34
SLIDE 34

FEATURE ENGINEERING FOR NLP IN PYTHON

A word of caution

Not perfect Performance dependent on training and test data Train models with specialized data for nuanced cases Language specic

slide-35
SLIDE 35

Let's practice!

FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON