[PPT] - Web Mining Web Mining Web Mining Web Mining Web mining is the use PowerPoint Presentation

SLIDE 1

Web Mining Web Mining Web Mining Web Mining

Based on several presentations found on the web: Sh i Ull T i P d Shapiro, Ullman, Terziyan, Pedersen ...

1

Wh t i W b Mi i ? Wh t i W b Mi i ? What is Web Mining? What is Web Mining?

 Web mining is the use of data mining techniques to

automatically discover and extract information automat cally d scover and extract nformat on from Web documents/services (Et i i 1996 CACM 39(11)) (Etzioni, 1996, CACM 39(11))

 Web mining aims to discovery useful information or

m g m y f f m knowledge from the Web hyperlink structure, page content and usage data. g (Bing LIU 2007, Web Data Mining, Springer)

2

Wh t i W b Mi i ? Wh t i W b Mi i ? What is Web Mining? What is Web Mining?

 Motivation / Opportunity

 The WWW is huge, widely distributed, global information service

d h f h f d centre and, therefore, constitutes a rich source for data mining

 Intelligent Web Search

P li ti R d ti E i

 Personalization, Recommendation Engines  Web-commerce applications  Building the Semantic Web  Building the Semantic Web  Web page classification and categorization  News classification and clustering  News classification and clustering  Information / trend monitoring  Analysis of online communities 3

y

 Web and mail spam filtering

Ab d d th it i i Ab d d th it i i Abundance and authority crisis Abundance and authority crisis

 Liberal and informal culture of content generation and

dissemination

 Redundancy and non-standard form and content  Millions of qualifying pages for most broad queries

M ll ons of qual fy ng pages for most broad quer es

 Example: java or kayaking

N th it ti i f ti b t th li bilit f it

 No authoritative information about the reliability of a site  Little support for adapting to the background of specific users  Pages added continuously and average page changes in a few

weeks

4

SLIDE 2

5

Diff t f “ l i l” D t Mi i ? Diff t f “ l i l” D t Mi i ? Different from “classical” Data Mining? Different from “classical” Data Mining?

 The web is not a relation

 Textual information + linkage structure

 Usage data is huge and growing rapidly

 Google’s usage logs are bigger than their web crawl  Data generated per day is comparable to largest conventional  Data generated per day is comparable to largest conventional

data warehouses

6

Si f th W b Si f th W b Size of the Web Size of the Web

 Number of pages  Number of pages

 11.5 billion indexable pages (http://www.cs.uiowa.edu/~asignori/web-size/ www2005)  Technically, infinite

 Because of dynamically generated content  Lots of duplication (30-40%)

 Best estimate of “unique” static HTML pages comes from search

i l i engine claims

 Yahoo = claimed 19.2 billion in Aug 2005

 Number of unique web sites

 Netcraft survey says 98 million sites

7

O t b 2006 W b S S O t b 2006 W b S S October 2006 Web Server Survey October 2006 Web Server Survey

8

http://news.netcraft.com/archives/web_server_survey.html

SLIDE 3

from from http://www.worldwidewebsize.com/ http://www.worldwidewebsize.com/

9

A th t tim t th b i A th t tim t th b i Another way to estimate the web size Another way to estimate the web size

 The number of web servers was estimated by sampling

and testing random IP address numbers and determining the fraction of such tests that successfully located a web server

 The estimate of the average number of pages per

server was obtained by crawling a sample of the servers server was obtained by crawling a sample of the servers identified in the first experiment

Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the web Nature 400(6740): 107–109

10

web. Nature, 400(6740): 107–109.

Web Information Retrieval Web Information Retrieval f m f m

 According to most predictions, the majority of human information

will be available on the Web in ten??? years

 Effective information retrieval can aid in

 Research: Find all papers about web mining  Research: Find all papers about web mining  Health/ Medicine : What could be reason for symptoms of “yellow

eyes”, high fever and frequent vomiting

 Travel: Find information on the tropical island of St. Lucia  Business: Find companies that manufacture digital signal processors  Entertainment: Find all movies starring Marilyn Monroe during the

years 1960 and 1970

 Arts: Find all short stories written by Jhumpa Lahiri

11

 Arts: Find all short stories written by Jhumpa Lahiri

Why is Web Information Retrieval Difficult? Why is Web Information Retrieval Difficult? Why is Web Information Retrieval Difficult? Why is Web Information Retrieval Difficult?

 The Abundance Problem (99% of information of no interest to 99%  The Abundance Problem (99% of information of no interest to 99%

f people)

 Hundreds of irrelevant documents returned in response to a search

p query

 Limited Coverage of the Web (Internet sources hidden behind

search interfaces) search interfaces)

 Largest crawlers cover less than 18% of Web pages

 The Web is extremely dynamic  The Web is extremely dynamic

 Lots of pages added, removed and changed every day

 Very high dimensionality (thousands of dimensions)  Very high dimensionality (thousands of dimensions)  Limited query interface based on keyword-oriented search  Limited cust mizati n t individual users

12

 Limited customization to individual users

SLIDE 4

Search Landscape Search Landscape

2005

p

Sept 2009 Sept 2009

13

http://marketshare.hitslink.com/search-engine-market-share.aspx?qprid=4

S h E i W b C O l S h E i W b C O l Search Engine Web Coverage Overlap Search Engine Web Coverage Overlap

4 searches were d f d h defined that returned 141 web pages.

14

http://www.searchengineshowdown.com/stats/overlap.shtml

W b h b i W b h b i Web search basics Web search basics

Sponsored Links CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping!

User

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds) Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this ] pp g All models. Helpful advice. www.best-vacuum.com

Web crawler

page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Indexer

Search

The Web

15

Ad indexes Indexes

W b C li B i W b C li B i Web Crawling Basics Web Crawling Basics

Start with a “seed set” of to-visit urls

get next url

to visit urls

get page

visited urls Web

extract urls

visited urls eb

web pages

16

SLIDE 5

C li I C li I Crawling Issues Crawling Issues

 L ad n web servers  Load on web servers

 E.g., no more than 1 request to the same server every 10 seconds

 Insufficient resources to crawl entire web

 Visit “important” pages first (pagerank, inlinks …)

 How to keep crawled pages “fresh”?

 How often do web pages change? What do we mean by freshness?

p g g y

 Detecting replicated content e.g., mirrors

 Use document comparison techniques (java manuals)  Use document comparison techniques (java manuals)

 Can’t crawl the web from one machine

P ll li i th l

17

 Parallelizing the crawl

W b Ad ti i W b Ad ti i Web Advertising Web Advertising

 Banner ads (1995-2001)  Banner ads (1995 2001)

 Initial form of web advertising

P p l bsit s h d X$ f 1000 “imp ssi ns” f d

 Popular websites charged X$ for every 1000 impressions” of ad

 Modeled similar to TV, magazine ads

 L

li kth u h t s

 Low clickthrough rates

 low ROI for advertisers

I t d d b O t d 2000

 Introduced by Overture around 2000

 Advertisers “bid” on search keywords  When someone searches for that keyword, the highest bidder’s ad

is shown

 Advertiser is charged only if the ad is clicked on

18

 Advertiser is charged only if the ad is clicked on

W b Ad ti i W b Ad ti i Web Advertising Web Advertising

 Search advertising is the revenue model

 Multi-billion-dollar industry  Multi-billion-dollar industry  Advertisers pay for clicks on their ads

 Interesting problems

 What ads to show for a search?  What ads to show for a search?

 Maximise revenue, each advertiser has a limited budget

 If I’m an advertiser, which search terms should I bid on and

how much to bid?

19

Web Mining Taxonomy Web Mining Taxonomy Web Mining Taxonomy Web Mining Taxonomy

Web Mining

Web S Web C Web Usage Structure Mining Content Mining Web Usage Mining

20

SLIDE 6

Web Mining Taxonomy Web Mining Taxonomy Web Mining Taxonomy Web Mining Taxonomy

 Web content mining: focuses on techniques for

assisting a user in finding documents that meet a certain criterion

 Web structure mining: aims at developing techniques to  Web structure mining: aims at developing techniques to

take advantage of the collective judgement of web page quality which is available in the form of hyperlinks quality which is available in the form of hyperlinks

 Web usage mining: focuses on techniques to study the

user behaviour when navigating the web

(also known as Web log mining and clickstream analysis)

21

W b C t t Mi i W b C t t Mi i Web Content Mining Web Content Mining

Examines the content of web pages as well as results of web searching.

22

W b C t t Mi W b C t t Mi Web Content Minng Web Content Minng

 Can be thought of as extending the work performed by

basic search engines

 Search engines have crawlers to search the web and

g gather information, indexing techniques to store the information, and query processing support to provide q y p g pp p information to the users

 Web Content Mining is: the process of extracting

knowledge from web contents

23

g

I f m ti R t i l I f m ti R t i l Information Retrieval Information Retrieval

 Given:

 A source of textual

Documents source

documents

 A user query (text based)

IR Query

q y ( )

IR System

Fi d

 Find:

 A set (ranked) of documents

Document Document D t

that are relevant to the query

Ranked Documents

Document

24

SLIDE 7

S mi S mi St t d D t St t d D t Semi Semi-Structured Data Structured Data

 Text content is in general semi structured  Text content is, in general, semi-structured  Example:

 Title

A th

 Author  Publication_Date

Structured attribute/value pairs

 Length  Category  Category  Abstract

Unstructured

25

 Content

Structuring Textual Information Structuring Textual Information Structuring Textual Information Structuring Textual Information

 Many methods designed to analyze structured data  If we can represent documents by a set of attributes we will be

able to use existing data mining methods

 How to represent a document?

 Vector based representation

p

 (referred to as “bag of words” as it is invariant to permutations)

 Use statistics to add a numerical dimension to unstructured text

Term frequency f q y Document frequency Term proximity

26

Document length Term proximity

Document Representation Document Representation Document Representation Document Representation

 A document representation aims to capture what the document

m p m p m is about

 One possible approach (boolean representation):  Each entry describes a document  Attribute describe whether or not a term appears in the

document Example

Terms Camera Digital Memory Pixel … Document 1 1 1 1 Document 1 1 1 1 Document 2 1 1

27

… … … … …

Document Representation Document Representation Document Representation Document Representation

 Another approach:  Each entry describes a document  Attributes represent the frequency in which a term appears

in the document Example: Term frequency table

Terms Camera Digital Memory Print … Document 1 3 2 1 Document 2 4 3

28

… … … … …

SLIDE 8

Document Representation Document Representation Document Representation Document Representation

 But a term is mentioned more times in longer documents

Th f l ti f (% f d t)

 Therefore, use relative frequency (% of document):

 No. of occurrences/No. of words in document

Terms Camera Digital Memory Print … Document 1 0.03 0.02 0.01 Document 2 0.004 0.003

29

… … … … …

More on Document Representation More on Document Representation More on Document Representation More on Document Representation



Stop Word removal: Many words are not informative and thus



Stop Word removal: Many words are not informative and thus irrelevant for document representation



the, and, a, an, is, of, that, … th , an , a, an, s, of, that, …



Stemming: reducing words to their root form (Reduce dimensionality)



A document may contain several occurrences of words like



A document may contain several occurrences of words like

 fish, fishes, fisher, and fishers



But would not be retrieved by a query with the keyword y q y y

 fishing



Different words share the same word stem and should be represented with its stem, instead of the actual word

 Fish 30



For the Portuguese language these techniques are less studied

Weighting Scheme for Term Frequencies Weighting Scheme for Term Frequencies Weighting Scheme for Term Frequencies Weighting Scheme for Term Frequencies

 TF-IDF weighting: give higher weight to terms that are rare

TF IDF weighting give higher weight to terms that are rare

 TF: term frequency (increases weight of frequent terms)

 If a term is frequent in lots of documents it does not have discriminative power  If a term is frequent in lots of documents it does not have discriminative power

 IDF: inverse term frequency

i j ij i j

d w n d w document in

f

s

ccurrence
f

number the is document and term given a For

i ij ij

d n TF 

i i j ij

n d w documents

f

number the is document in words

f

number the is d um f u f um

i

n n IDF

j j

log 

j j

w n contain that documents

f

number the is

j ij ij

IDF TF x  

31

There is no compelling motivation for this method but it has been shown to be superior to other methods

Locating Relevant Documents Locating Relevant Documents Locating Relevant Documents Locating Relevant Documents

 Given a set of keywords  Use similarity/distance measure to find

similar/relevant documents

 Rank documents by their relevance/similarity

H t d t i if t d t i il ? How to determine if two documents are similar?

32

SLIDE 9

Di t B d M t hi Di t B d M t hi

 In order retrieve documents similar to a given document we need a

Distance Based Matching Distance Based Matching

measure of similarity

 Euclidean distance (example of a metric distance):

 The Euclidean distance between

X=(x1, x2, x3,…xn) and Y =(y1,y2, y3,…yn) X (x1, x2, x3,…xn) and Y (y1,y2, y3,…yn)

 is defined as:

n

B C





 

n i i i

y x Y X D

1 2

) ( ) , (

B C A Properties of a metric distance:

D(X,X)=0
D(X Y)=D(Y X)

33

D

D(X,Y)=D(Y,X)
D(X,Z)+D(Z,Y) ≥ D(X,Y)

Angle Based Matching Angle Based Matching Angle Based Matching Angle Based Matching

 Cosine of the angle between the vectors representing the document

and the query

 Documents “in the same direction” are closely related.

y

 Transforms the angular measure into a measure ranging from 1 for

the highest similarity to 0 for the lowest the highest similarity to 0 for the lowest

B C

T



   

T

Y X Y X Y X Y X D ) , cos( ) , (

A D

  

 

2 2 i i i i

y x y x

34

D

Performance Measure Performance Measure



The set of retrieved documents can be formed by collecting the top- ranking documents according to a similarity measure



The quality of a collection can be compared by the two following measures } { } { } { Retrieved Retrieved Relevant precision  

percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)

} { } { } { } { Relevant Retrieved Relevant recall Retrieved  

percentage of documents that are relevant to the query and were, in fact, retrieved

} {

Retrieved documents Relevant documents Relevant & retrieved

35

All documents

I t lli t W b S h I t lli t W b S h Intelligent Web Search Intelligent Web Search

 Combine the intelligent IR tools

 meaning of words  meaning of words  order of words in the query  authority of the source

 With the unique web features

 retrieve Hyper-link information  utilize Hyper-link as input

36

yp p

SLIDE 10

Text Minin Text Minin Text Mining Text Mining

 Data mining in text: find something useful and surprising from a

text collection;

t t i i i f ti t i l

 text mining vs. information retrieval;  data mining vs. database queries.

D t l ifi ti

 Document classification

 Topic hierarchies, spam filters

l

 Document clustering

 cluster documents by a common author  cluster documents containing information from a common source

(fraud)

 Key-word based association rules

37

 Key-word based association rules

Clustering and automatic automatic clusters’ labeling

38

http://clusty.com

39 40

SLIDE 11

Web Structure Mining Web Structure Mining

Exploiting Hyperlink Structure Social network analysis

41

Fi t ti f h i Fi t ti f h i First generation of search engines First generation of search engines

E l d k d b d h

 Early days: keyword based searches

 Keywords: “web mining”  Retrieves documents with “web” and mining”

L t ith

 Later on: cope with

 synonymy problem  polysemy problem  stop words  stop words

 Common characteristic: Only information on the

42

pages is used

M d h i M d h i Modern search engines Modern search engines

 Link structure is very important

 Adding a link: deliberate act  Adding a link: deliberate act  Harder to fool systems using in-links  Link is a “quality mark”  A page is important if important pages link to it

p g p p p g

M d h l k

 Modern search engines use link structure as important

source of information

43

C t l Q ti Central Question:

Which useful information can be Which useful information can be Wh ch u fu nf rmat n can Wh ch u fu nf rmat n can derived derived f th li k t t f th b? f th li k t t f th b? from the link structure of the web? from the link structure of the web?

44

SLIDE 12

Some answers Some answers Some answers Some answers

1.

Structure of Internet

1.

Structure of Internet

2.

Google

2.

Google

3.

HITS: Hubs and Authorities

3.

HITS Hubs and Authorities

45

1 Th W b St t 1 Th W b St t

1. The Web Structure
1. The Web Structure

 A study was conducted on a graph inferred from two

large Altavista crawls.

Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., andWiener, J. (2000). Graph structure in the web. In Proc. WWW Conference.

Th st d fi m d th h p th sis th t th mb f

 The study confirmed the hypothesis that the number of

in-links and out-links to a page approximately follows a Zipf distribution (a particular case of a power law) Zipf distribution (a particular case of a power-law)

46

P L P L Power Laws Power Laws

47

I Li k Li k In In-

Links

Links

48

SLIDE 13

O t O t Li k Li k Out Out-Links Links

49

Th W b St t Th W b St t The Web Structure The Web Structure

 If the web is treated as an undirected graph

 90% of the pages form a single connected

component component

 If the web is treated as a directed graph

If the web is treated as a directed graph

 four distinct components are identified, the four

p , with similar size

50

General Topology General Topology

Tendrils Tendrils 44mil SCC IN OUT SCC IN 44mil 44mil 56mil Disconnected Tubes Disconnected components Tubes

SCC: set of pages that can be reached by one another IN: pages that have a path to SCC but not from it

51

IN: pages that have a path to SCC but not from it OUT: pages that can be reached by SCC but not reach it TENDRILS: pages that cannot reach and be reached the SCC pages

Some statistics Some statistics

 Only between 25% of the pages there is a connecting path

BUT

 If there is a path:  If there is a path:

 Directed: average length <17

U di t d l th 7 (!!!)

 Undirected: average length <7 (!!!)

 It’s a “small world” -> between two people only chain of length 6!

(http://en.wikipedia.org/wiki/Small_world_phenomenon)

 Small World Graphs

 High number of relatively small cliques  Small diameter

52

 Internet (SCC) is a small world graph

SLIDE 14

53

Google Google Google Google

 Search engine that uses link structure to calculate a

g quality ranking (PageRank) for each page I t iti P R k b th b bilit th t

 Intuition: PageRank can be seen as the probability that

a “random surfer” visits a page



Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proc. WWW Conference, pages 107–117

 A page is important if important pages link to it

54

http://paspespuyas.com/comunidad/media/pageRank.gif

G l G l Google Google

 Keywords w entered by user  Select pages containing w and pages which have in-links  Select pages containing w and pages which have in links

with caption w

 Anch r t xt  Anchor text

 Provides more accurate descriptions of Web pages  Anchors exist for un-indexable documents (e.g., images)

 Font sizes of words in text:

 Words in larger or bolder font are assigned higher weights

 Rank pages according to importance

55

p g g mp

PageRank PageRank PageRank PageRank

(P R nk) + (W bsit C nt nt) Ov r ll R nk in R sults

Page Rank Page Rank:

A page is important if many important pages link to it. (PageRank) + (Website Content) = Overall Rank in Results

 Link ij :

ag an ag an

p g mp f m y mp p g . j

 i considers j important.  the more important i, the more important j becomes.  if i has many out-links: links are less important.

 Initially: all importances pi = 1. Iteratively, pi is refined.

56





  

j i

i OutDegree i PageRank p p j PageRank ) ( ) ( ) ( ) ( 1

SLIDE 15

PageRank PageRank

 Let OutDegreei = # out-links of page i  Adjust pj:

j pj





  

j i

i OutDegree i PageRank p p j PageRank ) ( ) ( ) ( ) ( 1 h h h h h

j i

g ) (

 This is the weighted sum of the importance of the pages referring to Pj

Parameter p is probability that the surfer gets bored and starts on a new random

page p g

(1-p) is the probability that the random surfer follows a link on current page

57

P R k P R k PageRank PageRank

58

Repeat until pagerank vector converges…

HITS HITS (Hyperlink

(Hyperlink-Induced Topic Search) Induced Topic Search)

HITS HITS (Hyperlink

(Hyperlink Induced Topic Search) Induced Topic Search)

 HITS uses hyperlink structure to identify authoritative

Web sources for broad-topic information discovery

Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632.

 Premise: Sufficiently broad topics contain communities

consisting of two types of hyperlinked pages: g yp yp p g

 Authorities: highly-referenced pages on a topic

H b th t “ i t” t th iti

 Hubs: pages that “point” to authorities  A good authority is pointed to by many good hubs; a good hub

d h

59

points to many good authorities

Hubs Hubs Hubs Hubs

Pages that link to a collection of authoritative pages on a broad topic pages point to interesting links to authorities = relevant pages

60

SLIDE 16

Authorities Authorities Authorities Authorities

Relevant pages of the highest quality on a broad topic Relevant pages of the highest quality on a broad topic

61

HITS HITS

 Steps for Discovering Hubs and Authorities on a

specific topic specific topic

 Collect seed set of pages S (returned by search engine)  Expand seed set to contain pages that point to or are pointed

to by pages in seed set (removes links inside a site)

 Iteratively update hub weight h(p) and authority weight a(p)

for each page:

 

 

 

q p p q

q a p h q h p a ) ( ) ( ) ( ) (

 After a fixed number of iterations, pages with highest

hub/authority weights form core of community

62

hub/authority weights form core of community

St th d k f HITS St th d k f HITS Strengths and weaknesses of HITS Strengths and weaknesses of HITS

 Strength: its ability to rank pages according to the  Strength: its ability to rank pages according to the

query topic, which may be able to provide more relevant authority and hub pages relevant authority and hub pages.

 Weaknesses:

 It is easily spammed. It is in fact quite easy to influence

HITS since adding out-links in one’s own page is so easy. g p g y

 Topic drift. Many pages in the expanded set may not be on

topic. p

 Inefficiency at query time: The query time evaluation is

slow. Collecting the root set, expanding it and performing

63

slow. Collecting the root set, expanding it and performing

eigenvector computation are all expensive operations

Web Usage Mining Web Usage Mining g g g g

l i b i ti analyzing user web navigation

64

SLIDE 17

Web Usage Mining Web Usage Mining Web Usage Mining Web Usage Mining

 Pages contain information  Links are “roads”  Links are roads  How do people navigate over the Internet?  

Web usage mining (Clickstream Analysis)

 Information on navigation paths is available in log files  Information on navigation paths is available in log files.  Logs can be examined from either a client or a server

65

perspective.

W b it U A l i W b it U A l i Website Usage Analysis Website Usage Analysis

Why analyze Website usa e? Why analyze Website usage? Knowledge about how visitors use Website could

 Provide guidelines to web site reorganization; Help prevent disorientation  Help designers place important information where the visitors look for it  Pre-fetching and caching web pages  Provide adaptive Website (Personalization)  Questions which could be answered

 What are the differences in usage and access patterns among users?  What user behaviours change over time?  How usage patterns change with quality of service (slow/fast)? 66  What is the distribution of network traffic over time?

Website Usage Analysis Website Usage Analysis

67

Data Sources Data Sources Data Sources Data Sources

68

SLIDE 18

D t S D t S Data Sources Data Sources

 Server level collection: the server stores data regarding requests

performed by the client, thus data regard generally just one source;

 Client level collection: it is the client itself which sends to a

f w repository information regarding the user's behaviour (can be implemented by using a remote agent (such as Javascripts or Java applets) or by modifying the source code of an existing browser pp ) y y g g (such as Mosaic or Mozilla) to enhance its data collection

capabilities. );

 Proxy level collection: information is stored at the proxy side,

thus Web data regards several Websites, but only users whose

69

g , y Web clients pass through the proxy.

W b U Mi i P W b U Mi i P Web Usage Mining Process Web Usage Mining Process

We b Se r ve r L

g

Data Pr e par ation Cle an

Data Mi i

Pr e par ation Data

Mining

Sit Site Data

Usage Patte r ns

70

D t P ti D t P ti Data Preparation Data Preparation

 Data cleaning

Data cleaning

 By checking the suffix of the URL name, for example, all log entries

with filename suffixes such as, gif, jpeg, etc , g , jp g,

 User identification

If p is st d th t is n t di tl link d t th p i s p s

 If a page is requested that is not directly linked to the previous pages,

multiple users are assumed to exist on the same machine

 Other heuristics involve using a combination of IP address machine  Other heuristics involve using a combination of IP address, machine

name, browser agent, and temporal information to identify users

 Transaction identification  Transaction identification

 All of the page references made by a user during a single visit to a site

Si f t ti f i l f t ll f

71

 Size of a transaction can range from a single page reference to all of

the page references

A l A l W b L Fil A l W b L Fil A l Analog Analog – Web Log File Analyser Web Log File Analyser

http://www.analog.cx/

 Gives basic statistics such as

 number of hits  number of hits  average hits per time period  what are the popular pages in your site  what are the popular pages in your site  who is visiting your site  what keywords are users searching for to get to you  what keywords are users searching for to get to you  what is being downloaded

72

SLIDE 19

W b U Mi i W b U Mi i Web Usage Mining Web Usage Mining

 Commonly used approaches

 Preprocessing data and adapting existing data mining

techniques

 For example associatin rules: does not take into account the

rder of the page requests
rder of the page requests

 Developing novel data mining models

p g g

73

Data Mining on Web Transactions Data Mining on Web Transactions Data M n ng on W ransact ons Data M n ng on W ransact ons

 Association Rules:

 discovers similarity among sets of items across transactions

  X =====> Y where X, Y are sets of items, confidence or P(X v Y), 

support or P(X^Y)

 Examples:

 60% of clients who accessed /products/, also accessed

/products/software/webminer.htm.

 30% of clients who accessed /special-offer.html, placed an online order in

/products/software/.

 (Actual Example from IBM official Olympics Site)

{Badminton Diving} ===> {Table Tennis} (69 7%  35%)

74

{Badminton, Diving} ===> {Table Tennis} (69.7%,.35%)

Other Data Mining Techniques Other Data Mining Techniques

 Sequential Patterns:

 30% of clients who visited /products/software/, had done a search in Yahoo

using the keyword “software” before their visit

 60% of clients who placed an online order for WEBMINER, placed another online

rder for software within 15 days

 Clustering and Classification  Clustering and Classification

 clients who often access /products/software/webminer.html tend to be

from educational institutions. from educational institutions.

 clients who placed an online order for software tend to be students in the 20-25

age group and live in the United States.

 75% of clients who download software from /products/software/demos/ visit

between 7:00 and 11:00 pm on weekends.

75

Path and Usage Pattern Discovery Path and Usage Pattern Discovery ath an Usag att rn D sco ry ath an Usag att rn D sco ry

 Types of Path/Usage Information

 Most Frequent paths traversed by users  Entry and Exit Points  Distribution of user session duration

 Examples:  Examples:

 60% of clients who accessed /home/products/file1.html,

followed the path /home ==> /home/whatsnew ==> /home/products followed the path /home > /home/whatsnew > /home/products ==> /home/products/file1.html

 (Olympics Web site) 30% of clients who accessed sport specific pages

( y p ) p p p g started from the Sneakpeek page.

 65% of clients left the site after 4 or less references.

76

SLIDE 20

A Fi l Q ti A Fi l Q ti A Final Question A Final Question

 Google Tools: search engine, Google news, Google maps,

Gmail, Calendar, Google docs, YouTube, Orkut, Google Desktop, Chrome, Android...

 Do You Trust Google to Resist Data Mining Across

Services?

 http://www readwriteweb com/archives/do you trust  http://www.readwriteweb.com/archives/do_you_trust_

google_to_resist_data_mining_across_services.php

77

S mm S mm Summary Summary

 Web is huge and dynamic  Web mining makes use of data mining techniques to  Web mining makes use of data mining techniques to

automatically discover and extract information from Web documents/services Web documents/services

 Web content mining  Web structure mining  Web usage mining  Web usage mining

78

R f R f References References

 Data Mining: Introductory and Advanced Topics,

Margaret Dunham (Prentice Hall, 2002)

 Mining the Web - Discovering Knowledge from

Hypertext Data Soumen Chakrabarti Morgan Hypertext Data, Soumen Chakrabarti, Morgan- Kaufmann Publishers

79

Thank you !!! Thank you !!!

80