Programming in Python Lecture 3: Patterns and Functions Michael - - PowerPoint PPT Presentation

programming in python
SMART_READER_LITE
LIVE PREVIEW

Programming in Python Lecture 3: Patterns and Functions Michael - - PowerPoint PPT Presentation

Programming in Python Lecture 3: Patterns and Functions Michael Schroeder Sven Schreiber sven.schreiber@tu-dresden.de 1 Slides derived from Ian Holmes, Department of Statistics, University of Oxford Updates by Andreas Henschel Overview


slide-1
SLIDE 1

1

Programming in Python

Michael Schroeder Sven Schreiber

sven.schreiber@tu-dresden.de

Updates by Andreas Henschel

Lecture 3: Patterns and Functions

Slides derived from Ian Holmes, Department of Statistics, University of Oxford

slide-2
SLIDE 2

2

Overview

  • Patterns (Regular Expressions)
  • Functions and Lambda Functions
slide-3
SLIDE 3

3

Patterns

slide-4
SLIDE 4

4

What is a pattern?

https://commons.wikimedia.org/wiki/Tree

slide-5
SLIDE 5

5

Pattern-matching

  • logical test to ask whether a string contains a

pattern

  • e.g. does a yeast promoter sequence contain

the MCB binding site, ACGCGT?

name = ‘YBR007C’ dna = ‘TAATAAAAAACGCGTTGTCG’ if ‘ACGCGT’ in dna: print(‘%s has MCB!’ % name) 20 bases upstream of the yeast gene YBR007C The membership operator in The pattern for the MCB binding site YBR007C has MCB!

slide-6
SLIDE 6

6

Regular expressions

  • Python provides a pattern-matching engine
  • Patterns are called regular expressions
  • They are extremely powerful
  • Often called "regex" for short
  • module re

ACGCGT

  • We already defined a simple pattern:
  • What if we don’t care about the 3rd position?

=>

ACGCGT ACCCGT ACACGT ACTCGT

slide-7
SLIDE 7

7

Motivation: N-glycosylation motif

  • Common post-translational modification
  • Attachment of a sugar group
  • Occurs at asparagine residues with the

consensus sequence NX1X2, where

– X1 can be anything (but proline inhibits) – X2 is serine or threonine

  • Can we detect potential N-glycosylation sites

in a protein sequence?

slide-8
SLIDE 8

8

Building regexs I: Character Classes

  • Square brackets define a set of alternative

characters (character class)

  • E.g. [abc] -> matches a,b, or c
  • Use - to match a range of characters: [A-Z]
  • Negation :[^X] matches anything but X
  • [^A-Z] matches anything but A-Z
  • . matches anything
  • [a] is equivalent to a
slide-9
SLIDE 9

9

Building regexs II: Abbreviations

  • \d matches any decimal digit [0-9]
  • \D matches any non-digit [^0-9]
  • Equivalent syntax for:

– whitespace (\s and \S) – alphanumeric (\w and \W)

slide-10
SLIDE 10

10

Building regexps II: Quantifiers

  • * matches none or any number of times

– E.g. ca*t matches: ct, cat, caat, caaat, caaaat, ...

  • + matches one or any number of times

– E.g. ca+t matches cat, caat, caaat, caaaat, ...

  • ? matches none or once

– E.g. bio-?info matches bioinfo and bio-info

  • {n} matches a specific number of times
  • {n,m} matches from n (min) to m (max) times

– E.g. ab{1,3}c will match abc, abbc, abbbc

slide-11
SLIDE 11

11

Using Regular Expressions

  • Compile a regular expression object (pattern) using

re.compile

  • pattern has a number of methods

– match (in case of success returns a Match object,

  • therwise None, matches only at the beginning !)

– search (scans through whole string looking for a match) – findall (returns a list of all matches)

>>> import re >>> pattern = re.compile('[ACGT]') >>> if pattern.match(“A"): print(“A matches") Matched >>> if pattern.match("a"): print(“a matches") >>>

successful match unsuccessful, returns None by def. case sensitive

>>> import re >>> if re.match('[ACGT]‘, “A"): print("Matched") >>> Matched

without compiling, short, but less performant A matches Matched

slide-12
SLIDE 12

12

Matching alternative strings

  • (this|that) matches "this" or "that"
  • ...and is equivalent to th(is|at)

>>> pattern=re.compile("(this|that|other)", re.IGNORECASE) >>> pattern.search("Will match THIS") ## success <_sre.SRE_Match object at 0x00B52860> >>> pattern.search(“Also THat will be matched") ## success <_sre.SRE_Match object at 0x00B528A0> >>> pattern.search("Will not match ot-her") ## will return None >>>

case unsensitive search pattern Python returns a description of the match object

slide-13
SLIDE 13

13

Word and string boundaries

 ^ matches the start of a string  $ matches the end of a string  \b matches word boundaries

"Escaping" special characters

  • Characters with special meaning:

. ^ $ * + ? { [ ] \ | ( )

  • \ is used to free or "escape" those characters from their

special meaning

  • so \[ just matches the character "["

– if not escaped, "[" signifies the start of a character class, as in [ACGT]

slide-14
SLIDE 14

14

Substitutions/Match Retrieval

>>> re.sub("(red|blue|green)", "colored", "blue socks and red shoes") 'colored socks and colored shoes' >>> e,raw,frm,to = re.findall("\d+", \ "E-value: 4, \ Raw Bit Score: 165, \ Match position: 362-419") >>> print(e, raw, frm, to) 4 165 362 419

\ allows multiple line commands alternatively, construct multi-line strings using triple quotes """ …""" The result, a list of 4 strings, is assigned to 4 variables matches one or more digits Regex use without compiling

  • Regex can also be used to substitute patterns

using re.sub

slide-15
SLIDE 15

15

N-glycosylation site detector

>>> protein=”\

MGMFFNLRSNIKKKAMDNGLSLPISRNGSSNNIKDKRSEHNSNSLKGKYRYQPRSTPSKFQLTVSITSLI\ IIAVLSLYLFISFLSGMGIGVSTQNGRSLLGSSKSSENYKTIDLEDEEYYDYDFEDIDPEVISKFDDGVQ\ HYLISQFGSEVLTPKDDEKYQRELNMLFDSTVEEYDLSNFEGAPNGLETRDHILLCIPLRNAADVLPLMF\ KHLMNLTYPHELIDLAFLVSDCSEGDTTLDALIAYSRHLQNGTLSQIFQEIDAVIDSQTKGTDKLYLKYM\ DEGYINRVHQAFSPPFHENYDKPFRSVQIFQKDFGQVIGQGFSDRHAVKVQGIRRKLMGRARNWLTANAL\ KPYHSWVYWRDADVELCPGSVIQDLMSKNYDVI”

>>> regex = "N[^P][ST]" >>> for match in re.finditer(regex, protein): print(match.group(), match.span()) NGS (26, 29) NLT (214, 217) NGT (250, 253) N[^P][ST]- the main regular expression re.finditer provides an iterator

  • ver match-objects

match.group and match.span print the actual matched string and the position-tuple.

slide-16
SLIDE 16

16

Another Example:

Courtesy of Chris Bystroff

[KHDAS]DEL

slide-17
SLIDE 17

17

Another Example: Zinc finger motif

Von Thomas Splettstoesser (www.scistyle.com) - self-made, based on PDB structure 1A1L, the open source molecular visualization tool PyMol and Cinema 4D, GFDL, https://commons.wikimedia.org/w/index.php?curid=3106866

slide-18
SLIDE 18

18

Courtesy of Chris Bystroff

C\w{2,4}C\w{3}[LIVMFYWC]\w{8}H\w{3,5}H

hydrophobic

slide-19
SLIDE 19

19

Test your Regular Expressions www.pythex.org

  • Develop regular expressions
  • Test them on examples of your choice

^[1-9]\w{3}$

2REG 1VSN 1osn 9ins 1a1b

PDB IDs

slide-20
SLIDE 20

20

Functions

slide-21
SLIDE 21

21

Functions

  • Similar code is often needed in different places
  • f a program
  • but copy/paste code is a bad idea!
  • need to separate those pieces of code and call them

from different places

  • Separated code for a self-contained tasks is

called a function

  • Examples of such tasks:

– cleaning up a sequence (lowercase, strip newlines..) – reverse complementing a sequence

slide-22
SLIDE 22

22

Function Syntax

def <functionname> (<arg1>, <arg2>, ...): <block> return <something> def sum_up_numbers (num1, num2): my_sum = num1 + num2 return my_sum

Syntax Example

slide-23
SLIDE 23

23

Calling a function

def sum_up_numbers (num1, num2): my_sum = num1 + num2 return my_sum

Function Definition Function Calls

sum_up_numbers (1,5) sum_up_numbers (num1=1,num2=5)

6 6

slide-24
SLIDE 24

24

Example: Largest number

  • Function to find the largest number in a list

def find_max(aList): max = aList.pop() for x in aList: if x > max: max = x return max numbers = [1, 5, 1, 12, 3, 4, 6] print("Maximum: %i” % find_max(numbers)) Maximum: 12 Function declaration Function result Function body Function call

slide-25
SLIDE 25

25

Lambda Functions

slide-26
SLIDE 26

26

Lambda Functions

  • Kind of anonymous functions
  • Similar to normal functions but...

– ...not bound to a name – ...different syntax – ...can be assigned to variables, passed to functions – ...restricted to one expression/instruction

def calc(x): return (x-3)*2 calc(5) Normal function definition 4 calc1 = lambda x: (x-3)*2 calc1(5) calc2=calc1 calc2(5) Lambda function 4 4

slide-27
SLIDE 27

27

Map, filter, and reduce

  • Lambda functions can be passed as arguments to

functions

  • Powerful in combination with map, filter, and reduce

map filter reduce (lambda_function, sequence) Function applied to each element ...of the given sequence Decides what to to with the result: map -> apply to each element, return modified list filter -> return list with element tested True reduce -> returns one element resulting from computation

slide-28
SLIDE 28

28

Examples

map(lambda x: x*3, [1,2,3]) [3,6,9] filter(lambda x: x>=1.0, [1.2,0.5,0.7,1.3]) [1.2,1.3] filter(lambda x: x!=0, map(lambda x: x-2, [4,2,5])) [2,3] reduce(lambda x,y: x+y, (1,2,3,4)) 10 1, 2 3 x, y 2,0,3 3, 3 6 x, y 3,4 x, y

slide-29
SLIDE 29

29

Summary

  • Regular expression as powerful tools to detect patterns
  • Allow matching of character classes, repetitions, alternatives,

etc.

  • Learn the meaning of special characters

. ^ $ * + ? { [ ] \ | ( )

  • Python offers regexp functions in the re module

– match, search, findall, finditer etc.

  • Regular expressions can be used to find motifs in sequences
  • Functions as way to separate self-contained tasks and to

structure code

  • Lambda function with map, filter, and reduce for efficient list

processing