Strings Joan Boone jpboone@email.unc.edu Summer 2020 Slide 1 - - PowerPoint PPT Presentation

strings
SMART_READER_LITE
LIVE PREVIEW

Strings Joan Boone jpboone@email.unc.edu Summer 2020 Slide 1 - - PowerPoint PPT Presentation

INLS 560 Programming for Information Professionals Strings Joan Boone jpboone@email.unc.edu Summer 2020 Slide 1 Topics Part 1 Basic string operations Part 2 Modify, search, replace, and splitting strings Part 3 Text analysis


slide-1
SLIDE 1

Slide 1

Strings

Joan Boone

jpboone@email.unc.edu

Summer 2020

INLS 560

Programming for Information Professionals

slide-2
SLIDE 2

Slide 2

Topics

Part 1

  • Basic string operations

Part 2

  • Modify, search, replace, and splitting strings

Part 3

  • Text analysis
slide-3
SLIDE 3

Slide 3

Strings are text

Most applications work with text in some format

  • Google Docs, word processors
  • Email
  • Social media
  • Search engines
  • Databases
  • Data and text mining analyze text by deriving patterns and trends

Some familiar Python examples

  • attendees = input('Enter number of people attending: ')
  • print('Bagel cost: ', bagel_cost)
  • steps_file = open('steps.txt', 'r')
slide-4
SLIDE 4

Slide 4

Basic String Operations: Iteration

Very similar to list and dictionary iteration: use a for loop

# Count the number of times a letter occurs in a string def main(): # Define a counter count = 0 # Get a string from the user. input_string = input('Enter a sentence: ') # Count occurrences of letter E or e for letter in input_string: if letter == 'E' or letter == 'e': count = count + 1 print('The letter E appears', count, 'times.') main()

letter_counter.py

slide-5
SLIDE 5

Slide 5

Basic String Operations: Indexing

'Innovation is serendipity'

text = 'Innovation is serendipity' print(text[3], text[12], text[24])

IndexError Exception occurs if an index is out of range for a string.

0 … 11 ... 14 ... 24

  • s y

index = 0 while index < 30: print(text[index]) index = index + 1 index = 0 while index < len(text): print(text[index]) index = index + 1

Common error: looping beyond end of a string How to avoid:

string_indexing.py

slide-6
SLIDE 6

Slide 6

Basic String Operations: Concatenation

first_name = 'Monty' last_name = 'Python' full_name = first_name + last_name print(full_name) MontyPython full_name = first_name + ' ' + last_name print(full_name) Monty Python

Rainfall Summary example: use of concatenation for the input prompt

Concatenation is a common operation where one string is concatenated, or appended, to the end of another string

for month in range(1, 13): inches = float(input('Enter rainfall for month ' + str(month) + ': ')) total = total + inches Enter rainfall for month 1: 5 Enter rainfall for month 2: 10 ...

slide-7
SLIDE 7

Slide 7

Strings are Immutable

(so are integers and floats)

  • In Python, strings cannot be modified once they are created. Some
  • perations appear to modify a string, but they do not.

Source: Starting Out with Python by Tony Gaddis

  • Takeaway: you cannot use an expression in the form string[index] on the

left side of an assignment operator, i.e., you cannot modify a character in a string using an index.

text = 'Innovation is serendipity' text[14] = 'S' text = 'Innovation is Serendipity'

TypeError: 'str' object does not support item assignment

Correct way to modify string

slide-8
SLIDE 8

Slide 8

Basic String Operations: Slicing

Very similar to list slicing

days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'] weekdays = days[1:6] python_author = 'Guido van Rossum' first_name = python_author[:5] last_name = python_author[6:] print(first_name, last_name) Guido van Rossum

String slicing: string[start : end] String slices select a subset of characters in a string. A string slice is also called a substring.

slide-9
SLIDE 9

Slide 9

Testing Strings with in and not in

in and not in operators return True or False

Other String Operations using Methods

  • Testing for values of strings
  • Performing various modifications
  • Searching for sub-strings and replacing sequences of

characters

Source: Starting Out with Python by Tony Gaddis

  • pening_text = 'It was a dark and stormy night'

if 'stormy' in opening_text: print('The string “stormy” was found') else: print('The string "stormy" was not found')

slide-10
SLIDE 10

Slide 10

Methods for Testing Values of Strings

Each method returns True or False, and assumes the string contains at least one character

Method Description

isalnum()

Returns true if string contains only alphabetic letters or digits

isalpha()

Returns true if string contains only alphabetic letters

islower()

Returns true if all of the alphabetic letters in the string are lowercase

isupper()

Returns true if all of the alphabetic letters in the string are uppercase

isnumeric()

Returns true if all characters are numeric (0-9)

isspace()

Returns true if the string contains only whitespace characters, e.g., newlines (\n) and tabs (\t)

Python documentation for String methods

slide-11
SLIDE 11

Slide 11

Testing Values of Strings for Input Validation

To validate an input string, often there are several requirements that must be met for validation to be successful. Here's a general algorithm that uses String methods for validation.

  • Use boolean variables to specify whether a validation requirement

has been met (is it True or False?), e.g, is the string numeric, at least 8 characters long, etc.

  • Initially, set all of these variables to False, i.e., assume the validation

will fail. If a validation requirement is met, then set its variable to True

  • Loop through each character of the string, and determine if the

requirements are met.

  • After evaluating the string, check to see if all of the boolean variables

have been set to True

– If all are true, then the input string is valid – If one or more are false, the input string is invalid

slide-12
SLIDE 12

Slide 12

Example: Password Validation

Prompts for a password, and validates it according to these rules:

  • at least 7 characters in

length

  • contains at least one

uppercase letter

  • contains at least one

lowercase letter

  • contains at least one digit

validate_password.py

Source: Starting Out with Python by Tony Gaddis

def valid_password(password): # Set the Boolean variables to false. correct_length = False has_uppercase = False has_lowercase = False has_digit = False # Validate length first if len(password) >= 7: correct_length = True # Test each character for character in password: if character.isupper(): has_uppercase = True if character.islower(): has_lowercase = True if character.isdigit(): has_digit = True # Are requirements met? if correct_length and has_uppercase and has_lowercase and has_digit: is_valid = True else: is_valid = False # Return the is_valid variable. return is_valid

slide-13
SLIDE 13

Slide 13

EXERCISE: Password Validation

Add another validation rule: the first character must be alphabetic.

Prompts for a password, and validates it according to these rules:

  • at least 7 characters in length
  • contains at least one

uppercase letter

  • contains at least one

lowercase letter

  • contains at least one digit

validate_password.py

Source: Starting Out with Python by Tony Gaddis

def valid_password(password): # Set the Boolean variables to false. correct_length = False has_uppercase = False has_lowercase = False has_digit = False # Validate length first if len(password) >= 7: correct_length = True # Test each character for character in password: if character.isupper(): has_uppercase = True if character.islower(): has_lowercase = True if character.isdigit(): has_digit = True # Are requirements met? if correct_length and has_uppercase and has_lowercase and has_digit: is_valid = True else: is_valid = False # Return the is_valid variable. return is_valid

slide-14
SLIDE 14

Slide 14

Topics

Part 1

  • Basic string operations

Part 2

  • Modify, search, replace, and splitting strings

Part 3

  • Text analysis
slide-15
SLIDE 15

Slide 15

Methods to Modify Strings

Method Description

lower()

Returns a copy of string with all alphabetic letters converted to lowercase

upper()

Returns a copy of string with all alphabetic letters converted to uppercase

lstrip()

Returns a copy of string with all leading whitespace characters removed

lstrip(char)

Returns a copy of string with all instances of char that appear at the beginning of the string removed

rstrip()

Returns a copy of string with all trailing whitespace characters removed

rstrip(char)

Returns a copy of string with all instances of char that appear at the end

  • f the string removed

strip()

Returns a copy of string with all leading and trailing whitespace characters removed

strip(char)

Returns a copy of string with all instances of char that appear at the beginning and the end of the string removed

Python documentation for String methods

slide-16
SLIDE 16

Slide 16

Example: Case-insensitive Comparison

# This program makes a case-insensitive comparison # of a user's response to a prompt again = 'y' while again.lower() == 'y': print('Hello') print('Do you want to see that again?') again = input('y = yes, anything else = no: ') # This program makes a case-insensitive comparison # of a user's response to a prompt again = 'y' while again.upper() == 'Y': print('Goodbye') print('Do you want to see that again?') again = input('y = yes, anything else = no: ')

Source: Starting Out with Python by Tony Gaddis

slide-17
SLIDE 17

Slide 17

Methods to Search and Replace Strings

Method Description

find(substring)

The substring argument is a string. The method returns the lowest index in the string where substring is found. If substring is not found, the method returns -1.

replace(old, new)

The old and new arguments are both strings. The method returns a copy of the string with all instances of

  • ld replaced by new.

startswith(substring)

The substring argument is a string. The method returns true if the string starts with substring.

endswith(substring)

The substring argument is a string. The method returns true if the string ends with substring.

Python documentation for String methods

slide-18
SLIDE 18

Slide 18

Splitting a String to create a List

  • split method returns a list containing words in the string
  • By default, the method uses spaces as separators
  • To specify a different separator, pass as an argument:

def main(): # Create a string with multiple words. my_string = 'One two three four' # Split the string. word_list = my_string.split() print(word_list) main() ['One', 'two', 'three', 'four']

Source: Starting Out with Python by Tony Gaddis

date_string = '10/08/2019' date_list = date_string.split('/') print(date_list) ['10', '08', '2019']

slide-19
SLIDE 19

Slide 19

Example: Parsing email addresses

  • Suppose you have a list or file of email addresses and you

want to extract the domain part of each address

  • One approach is to use string slicing:
  • Is there a better approach?

email_addr = 'newhire@startup.com' local_part = email_addr[0:7] domain_part = email_addr[8:] print(domain_part) email_addresses.py

slide-20
SLIDE 20

Slide 20

Example: Phone Number Translator

Exercise 5, Chapter 8

Many companies use phone numbers like 555-GET-FOOD so the number is easier to remember. On a standard phone, the alphabetic letters are mapped to numbers. How to write a program that prompts user for a phone number in XXX-XXX-XXXX format and translates any alphabetic characters to numeric?

Enter the phone number in the format XXX-XXX-XXXX: 555-GET-FOOD The phone number is 555-438-3663

phone_number_translator.py

slide-21
SLIDE 21

Slide 21

Topics

Part 1

  • Basic string operations

Part 2

  • Modify, search, replace, and splitting strings

Part 3

  • Text analysis
slide-22
SLIDE 22

Slide 22

Text Analysis Overview

Data Mining: Practical Machine Learning Tools and Techniques by Ian H. Witten, Eibe Frank, and Mark A. Hall

  • Text mining is about looking for patterns in text
  • Information extracted should be potentially useful
  • Summarizing salient features from a large body of text

Text analysis involves a set of techniques that assist in processing and analyzing text to identify useful, high-quality

  • information. These include
  • Tokenization
  • Normalization, stopword removal, and stemming
  • Word frequencies
  • Document parsing to identify content and structure of text
slide-23
SLIDE 23

Slide 23

Web Pages

HTML defines the structure of a web page and describes the content.

<!DOCTYPE html> <html> <head> <title>Title goes here</title> </head> <body> <h1>Main heading</h1> <p>Page content goes here</p> <h2>Subheading</h2> <p>More page content goes here</p> </body> </html>

HTML for a minimal web page

slide-24
SLIDE 24

Slide 24

Text Analysis of Web Page Content

  • Finding key words and phrases is a common task when indexing or

generating metadata for a document.

  • Analyzing the text of a web page is somewhat simplified because it

has structured content. For example, heading tags (<h1>, <h2>, <h3>, etc.) often describe important topics in a web page.

  • The Code4Lib Journal article, Medici 2: A Scalable Content

Management System for Cultural Heritage Datasets, is an example

  • f a well-structured web page.
  • medici2_article.txt is a slightly-scrubbed text version of this

article.

slide-25
SLIDE 25

Slide 25

Exercise: Find the text for every <h2> heading in the article

def main(): filename = 'medici2_article.txt' try: article_file = open(filename, 'r', encoding='utf8') for line in article_file: if '<h2>' in line: print(line) article_file.close() except FileNotFoundError as err: ... except OSError as err: ... except ValueError as err: ... except Exception as err: ... find_heading_text.py

Example heading text: <h2>Current Issue</h2>

Current Issue Previous Issues About For Authors Introduction Architecture Metadata Images 3D Models RTI Extracting 3D Models From RTI Future Directions Conclusion Acknowledgments Notes About the Authors

Expected output

slide-26
SLIDE 26

Slide 26

Accessing Web Resources

The urllib module is a Python library for accessing web resources with HTTP and HTTPS URL addresses. The module provides functions for

  • Opening a URL – similar to opening a file
  • Reading the contents
  • Converting the contents from the returned binary format to a

string of text

To use the urllib module you must:

  • import the appropriate libraries from urllib
  • Provide handlers for HTTPError and URLError exceptions

Python documentation

slide-27
SLIDE 27

Slide 27

Using urllib to read contents of a web page

import urllib.request import urllib.error from urllib.error import URLError, HTTPError def main(): try: doc_url = 'https://journal.code4lib.org/articles/12317' response = urllib.request.urlopen(doc_url) response_in_bytes = response.read() html = response_in_bytes.decode('utf8') print(html) except HTTPError as err: print('Error: Server could not fulfill the request.') print(err) except URLError as err: print('Error: Failed to reach a server.') print(err) except Exception as err: print(err) find_heading_text_url.py Send HTTP request to

  • pen the document URL

Read HTTP response Decode (convert) response from binary to text (UTF8)

slide-28
SLIDE 28

Slide 28

Exercise: How to find the text for every <h2> heading in the big 'html' string?

import urllib.request import urllib.error from urllib.error import URLError, HTTPError def main(): try: doc_url = 'https://journal.code4lib.org/articles/12317' response = urllib.request.urlopen(doc_url) response_in_bytes = response.read() html = response_in_bytes.decode('utf8') print(html) except HTTPError as err: print('Error: Server could not fulfill the request.') print(err) except URLError as err: print('Error: Failed to reach a server.') print(err) except Exception as err: print(err) Send HTTP request to

  • pen the document URL

Read HTTP response Decode (convert) response from binary to text (UTF8)

Introduction Architecture Metadata Images 3D Models RTI Extracting 3D Models From RTI Future Directions Conclusion Acknowledgments Notes About the Authors Current Issue Previous Issues For Authors

Expected output

slide-29
SLIDE 29

Slide 29

SSL Certificate Problem

  • Secure Sockets Layer (SSL) is a security standard that enables

encrypted communication between a client (or web browser) and a web server.

  • Typically, a client verifies a server's certificate, but it is also the

case that a server may require a signed certificate from a client.

  • When you run this program, find_heading_text_url.py, if you

see either of these exceptions

<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1076)> <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)>

this means your Python installation on a Mac platform does not have the SSL certificates being requested by the web server.

slide-30
SLIDE 30

Slide 30

SSL Certificate Problem: Possible Fix

There is a stackoverflow question that addresses this issue, and a fix, in the last entry