Alyona Medelyan @zelandiya Anna Divoli @annadivoli Problem 1 - - PowerPoint PPT Presentation
Alyona Medelyan @zelandiya Anna Divoli @annadivoli Problem 1 - - PowerPoint PPT Presentation
Mining Unstructured Data: Practical Applications Alyona Medelyan @zelandiya Anna Divoli @annadivoli Problem 1 London New York How do lawyers scan, file, store & share clients case documents efficiently? Images: Ambro /
New York London
Problem 1
Images: Ambro / FreeDigitalPhotos.net
How do lawyers scan, file, store & share client’s case documents efficiently?
slambo_42@flickr Anoto AB@flickr
!
!"#$ !%#$ &"#$
How do doctors, patients & researchers distribute & share medical records efficiently?
"#$%&'(!"&()(*&)+! ,(-./0.#(!
1&/2!,34!)'$%%5%(/! )((0)+!$%6#$/ !!!789!1&/2#+:&('!/);! 1)&<%$!
1&/2! 1)&<%$! 1&/2#0/! 1)&<%$!
=>4>!)**#0(/!2#+:%$-! =>4>!#1(%$-2&6!%(..%-!
789!1&/2#+:&('!/);!
?0-/#:&)(!@)(A!
1&/2#0/!,34!)'$%%5%(/!
The FATCA Legislation
Takes effect 1 January 2013
Problem 3 How can a financial institution find U.S. citizens in masses of paperwork efficiently?
How much time do we actually spend on …
4%)$*2&('B!')/2%$&('!&(C#! D$&.('!%5)&+-! ?$%).('!:#*-! E()+FG&('!&(C#! 3%<&%1&('!:#*-! H$')(&G&('!:#*-! ?$%).('!6$%-%(/).#(-! I:&.('!&5)'%-! I(/%$&('!:)/)! E66$#<&('!:#*-! J0@+&-2&('!:#*-! K$)(-+).('!:#*-!
LM! LN! L7! L8! O! M! M! P! P! N! N! L
!"#$%&#'(%)'*)#$$+#&),*%'%-)
4%)$*2Q!LM2!R!1%%A!S!T7MB888!R!F%)$!
IDC: Hidden cost of information
average hours / week
introduction unstructured data real life problems unstructured data & text analytics metadata in legal domain healthcare records issues conclusions compliance in finance
'()*+,$ !-.(/,$ 0(1*2.132*$ 43)(+$ 5*6,$ 7-.8*,$ 9+:(./$ %*)(.$ ;.1.<.,*,$ =/+8,$
Text Mining Natural Language Processing
unstructured data
Opinion Mining Business Intelligence Document Organization Data Extraction Search Machine Learning Text Processing Statistics Linguistics
What can one mine from unstructured data?
text text text text text text text text text text text text text text text text text text
sentiment keywords tags genre categories taxonomy terms entities names patterns biochemical entities …
text text text text text text ! text text text ! text text text ! text text text! text text text!
'()*+,$ !-.(/,$ 0(1*2.132*$ 43)(+$ 5*6,$ 7-.8*,$ 9+:(./$ %*)(.$ ;.1.<.,*,$ =/+8,$
text text text text text text text text text text text text text text text text text text
People U.S. politicians News about U.S. politicians News
4/$0*/0$%:!! @&#+#'&*)+! :)/)! =(&U0%!&:%(.V%$-! W&/%$)/0$%!$%C%$%(*%-! I;6%$/-X! )((#/).#(! YC$%%!/%;/Z!
Structured & unstructured data interplay
introduction unstructured data real life problems unstructured data & text analytics metadata in legal domain healthcare records issues conclusions compliance in finance
- *)(!
#*$! 5%/):)/)! :5-!
- )<%!
Legal document processing pipeline
Images: Ambro / FreeDigitalPhotos.net
New York London
Assigning metadata
(approximation)
15 docs per day 3 min per doc 0.75 h per day 240 working days per year $200 hourly charge $36,000 per year per lawyer
Keyword extraction
0.0027 min per doc 10 min for yearly worth of docs
jacockshaw@flickr
,(/%'$).('! !! 5%/):)/)!! %;/$)*.#(! ! 1&/2!!
- *)((&('!
2[6QRR111>F#0/0@%>*#5R1)/*2\<SA+0]6^_06)'!
5%/):)/)! :5-! Efficient (legal) document processing pipeline
keywords tags
introduction unstructured data real life problems unstructured data & text analytics metadata in legal domain healthcare records issues conclusions compliance in finance
!
!"#$ !%#$ &"#$
slambo_42@flickr Anoto AB@flickr
!%#$
!
!"#$
!
&"#$
!
! ! !
$
`).#()+!E++&)(*%!C#$!a%)+/2!,(C#$5).#(!K%*2(#+#'F! Y`Ea,KZ! :%V(&.#(-!!
b&-*#(.(0%:c!
\!
!
L> `)5%B!@&$/2!:)/%B!@+##:!/F6%! ^> I5%$'%(*F!*#(/)*/Y-Z! 7> J$&5)$F!*)$%'&<%$R62#(%!(05@%$! N> d%:&*&(%-B!:#-)'%-B!)(:!2#1!+#('! /)A%(! _> E++%$'&%-R)++%$'&*!$%)*.#(-! P> b)/%!#C!+)-/!62F-&*)+! M> b)/%-R$%-0+/-!#C!/%-/-!)(:!
- *$%%(&('-!
e> d)f#$!&++(%--%-R-0$'%$&%-!)(:!/2%&$! :)/%-! O> ?2$#(&*!:&-%)-%-! L8> ")5&+F!&++(%--!2&-/#$F! LL> g!
>?@ABB666CD/-CD(>C8+EB-*)/(D*@/3,B-.8.F(D*B$
&"7$
)*G()*DHI:.H+D$@2+:*,,$
!
L> `)5%B!@&$/2!:)/%B!@+##:!/F6%! ^> I5%$'%(*F!*#(/)*/Y-Z! 7> J$&5)$F!*)$%'&<%$R62#(%!(05@%$! N> d%:&*&(%-B!:#-)'%-B!)(:!2#1!+#('! /)A%(! _> E++%$'&%-R)++%$'&*!$%)*.#(-! P> b)/%!#C!+)-/!62F-&*)+! M> b)/%-R$%-0+/-!#C!/%-/-!)(:!
- *$%%(&('-!
e> d)f#$!&++(%--%-R-0$'%$&%-!)(:!/2%&$! :)/%-! O> ?2$#(&*!:&-%)-%-! L8> ")5&+F!&++(%--!2&-/#$F! LL> g!
>?@ABB666CD/-CD(>C8+EB-*)/(D*@/3,B-.8.F(D*B$
d%:&*)+!$%-%)$*2%$-! 0-%!6).%(/!$%*#$:-! C#$!!:&-*#<%$&%-g! g!$%*#$:-!1&/2!$%5#<%:!Ja,Q! &(C#$5).#(!C$#5!-/$0*/0$%:!V%+:-! @0/!5#-/+F!C$#5!C$%%!/%;/c!
Ed,E!^8L^!
$ 666C>:@2+C:+-$ $ ,(/(:+D.D8/*C:+-B</+8B$ $ 666C(DJ+2-.H+DG.8*C:+-$
hK2%!a%)+/2!,(-0$)(*%!J#$/)@&+&/F!)(:!E**#0(/)@&+&/F!E*/!#C! LOOP!Ya,JEEZ!J$&<)*F!)(:!4%*0$&/F!30+%-i! ! hK2%!J).%(/!4)C%/F!)(:!j0)+&/F!,56$#<%5%(/!E*/!#C!^88_! YJ4j,EZ!J).%(/!4)C%/F!30+%i!
!
`)5%-!
!
k%#'$)62&*!-0@:&<&-&#(-!
- 5)++%$!/2)(!)!4/)/%Q!-/$%%/!)::$%--B!
*&/FB!*#0(/FB!6$%*&(*/B!G&6!*#:%g!
! !
b)/%-!Y%;*%6/!F%)$ZQ!@&$/2B!
):5&--&#(B!:&-*2)$'%g!
! !
J2#(%!R!");!(05@%$-! I5)&+!)::$%--%-!
! !
4#*&)+!-%*0$&/F!l! d%:&*)+!$%*#$:-!!l! a%)+/2!6+)(!@%(%V*&)$Fl! E**#0(/-!!l!
&"7$
18 identifiers!
]%2&*+%!&:%(.V%$-!m!
- %$&)+!(05@%$-B!&(*+>!+&*%(-%!
6+)/%!(05@%$-!
! !
b%<&*%!&:%(.V%$-!m!
- %$&)+!(05@%$-!
! !
=3W-!!!!R!!!!!!!,J!)::$%--%-!
! !
n%/$&*!&:%(.V%$-B!
&(*+0:&('!V('%$!)(:!<#&*%!6$&(/-!
! !
")*%!62#/#!&5)'%-!!
m!)(F!*#56)$)@+%!&5)'%-!
! !
E(F!#/2%$!0(&U0%!,b-!%/*>!
K2)(A-!C#$!:&-*0--&#(-Q! !!!`&')5!42)2B!4/)(C#$:! !!!I(%&:)!d%(:#(*)B!=D&(-*#-&(B!d):&-#(! !!!,$%()!46)-&*B!?)$:&o!=(&<%$-&/F!
keywords tags
slambo_42@flickr Anoto AB@flickr
text text text text text text ! text text text ! text text text ! text text text! text text text!
introduction unstructured data real life problems metadata in legal domain conclusions compliance in finance unstructured data & text analytics healthcare records issues
"#$%&'(!"&()(*&)+! ,(-./0.#(!
1&/2!,34!)'$%%5%(/! )((0)+!$%6#$/ !!!789!1&/2#+:&('!/);! 1)&<%$!
1&/2! 1)&<%$! 1&/2#0/! 1)&<%$!
=>4>!)**#0(/!2#+:%$-! =>4>!#1(%$-2&6!%(..%-!
789!1&/2#+:&('!/);!
?0-/#:&)(!@)(A!
1&/2#0/!,34!)'$%%5%(/!
The FATCA Legislation
Takes effect 1 January 2013
FATCA COMPLIANCE – STEP 1
Detect U.S. citizenship indicators
Recommended Solution from FATCA Legislation:
- “Query an electronic database using
standard queries in programming languages”
- “Adopt similar approaches as used for the
Anti-money-laundering and Know-your-customer requirements”
- “Note that information, data, or files are not
electronically searchable if they are stored as images”
1)+5&(AB!/2#51)/-#(pq&A$!
FATCA COMPLIANCE – STEP 2
Contact client for additional info or a waver
Actual Solution for the FATCA Legislation:
#*$! +&(A!)()+F-&-! %(./F!%;/$)*.#(! )()+F-&-! ')/2%$!/2%!/$)&+!*+&%(/X-!:)/)! *#(<%$/!)++!&5)'%-!/#!/%;/! :%/%*/!+#*).#(-B!@)(A!(05@%$-! )0/#r*)/%'#$&G%! *2%*A! $%-#+<%!&(*#(-&-/%(*&%-!
Efficient FATCA Compliance
introduction unstructured data real life problems metadata in legal domain healthcare records issues conclusions compliance in finance unstructured data & text analytics healthcare records issues
Alyona Medelyan, PhD @zelandiya Anna Divoli, PhD @annadivoli
Natural Language Processing Text Mining Wikipedia Mining Machine Learning
Try out text analytics provided by the Pingar API!
Online demo: apidemo.pingar.com Free Sandbox account: pingar.com/get-the-api
Biomedical Text Mining Search User Interfaces Human Factors Knowledge Discovery