Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr - - PowerPoint PPT Presentation

web mining web mining overview overview
SMART_READER_LITE
LIVE PREVIEW

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr - - PowerPoint PPT Presentation

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline Web Mining Outline Goal: Examine the use of data mining on Examine the use of data mining on Goal: the World Wide Web the World Wide Web Web


slide-1
SLIDE 1

1

Web MINING Web MINING Overview Overview

Dr Ahmed Dr Ahmed Rafea Rafea

slide-2
SLIDE 2

2

Web Mining Outline Web Mining Outline

Goal: Goal: Examine the use of data mining on Examine the use of data mining on the World Wide Web the World Wide Web

  • Web Data

Web Data

  • Web Content Mining

Web Content Mining

  • Web Structure Mining

Web Structure Mining

  • Web Usage Mining

Web Usage Mining

  • Common Web Mining Techniques

Common Web Mining Techniques

  • Research Directions

Research Directions

slide-3
SLIDE 3

3

Web Data Web Data

  • Web pages

Web pages

  • Page structures

Page structures

  • Usage data

Usage data

  • Supplemental data

Supplemental data

– – Profiles Profiles – – Registration information Registration information – – Cookies Cookies

slide-4
SLIDE 4

4

Web Mining Taxonomy Web Mining Taxonomy

Modified from [zai01]

slide-5
SLIDE 5

5

Web Content Mining (1) Web Content Mining (1)

The lack of structure that permeates the

information sources on the World Wide Web makes automated discovery of Web-based information difficult

In recent years these factors have prompted

researchers to develop more intelligent tools for information retrieval, such as intelligent Web agents, and to extend data mining techniques to provide a higher level of

  • rganization for semi-structured data

available on the Web

slide-6
SLIDE 6

6

Web Content Mining (2) Web Content Mining (2)

  • Techniques for Web content mining can be

Techniques for Web content mining can be classified into: classified into:

– – Agent Based Approach Agent Based Approach

» » Intelligent Search Agents using domain characteristics Intelligent Search Agents using domain characteristics » » Information Filtering/ Categorization using information Information Filtering/ Categorization using information retrieval techniques retrieval techniques » » Personalized Web Agents using user preference Personalized Web Agents using user preference

– – Database Approach Database Approach

» » Multilevel Databases which extracts meta data from Multilevel Databases which extracts meta data from lower level data and organize in a structured collection lower level data and organize in a structured collection » » Web Query Systems that uses SQL Web Query Systems that uses SQL-

  • like to extract web

like to extract web document structure, and content queries using IR document structure, and content queries using IR techniques techniques

slide-7
SLIDE 7

7

Web Structure Mining (1) Web Structure Mining (1)

  • Mine structure (links, graph) of the Web

Mine structure (links, graph) of the Web

  • Techniques

Techniques

– – PageRank PageRank – – CLEVER CLEVER

  • Create a model of the Web organization.

Create a model of the Web organization.

  • May be combined with content mining to

May be combined with content mining to more effectively retrieve important pages. more effectively retrieve important pages.

slide-8
SLIDE 8

8

Web Structure Mining (2) Web Structure Mining (2)

  • PageRank

PageRank

– – Used by Used by Google Google – – Prioritize pages returned from search by Prioritize pages returned from search by looking at Web structure. looking at Web structure. – – Importance of page is calculated based on Importance of page is calculated based on number of pages which point to it number of pages which point to it – – Backlinks Backlinks. . – – Weighting is used to provide more Weighting is used to provide more importance to importance to backlinks backlinks coming from coming from important pages. important pages.

slide-9
SLIDE 9

9

Web Structure Mining (3) Web Structure Mining (3)

  • CLEVER Identifies authoritative and

CLEVER Identifies authoritative and hub pages. hub pages.

– – Authoritative Pages Authoritative Pages : :

» » Highly important pages. Highly important pages. » » Best source for requested information. Best source for requested information.

– – Hub Pages Hub Pages : :

» » Contain links to highly important pages. Contain links to highly important pages.

slide-10
SLIDE 10

10

Web Usage Mining (1) Web Usage Mining (1)

Web usage mining is the automatic discovery

  • f user access patterns from Web servers

Organizations collect large volumes of data in

their daily operations, generated automatically by Web servers and collected in server access logs.

Other sources of user information include

referrer logs which contain information about the referring pages for each page reference, and user registration or survey data gathered via CGI scripts.

slide-11
SLIDE 11

11

Web Usage Mining (2) Web Usage Mining (2)

  • Techniques for Web content mining can

Techniques for Web content mining can be classified into: be classified into:

– – Pattern Discovery Tools using techniques Pattern Discovery Tools using techniques from AI, data mining, and information from AI, data mining, and information retrieval to mine for knowledge from retrieval to mine for knowledge from collected data collected data – – Pattern Analysis Tools are needed to Pattern Analysis Tools are needed to understand, visualize, and interpret these understand, visualize, and interpret these patterns patterns

slide-12
SLIDE 12

12

Common Web Mining Common Web Mining Techniques Techniques

  • The common techniques for Web

The common techniques for Web mining are: mining are:

– – clustering clustering – – classification, classification, – – association rules, association rules, – – path analysis, and path analysis, and – – sequential patterns. sequential patterns.

slide-13
SLIDE 13

13

Clustering Clustering

Clustering analysis allows one to group together

clients or data items that have similar characteristics.

Clustering of client information or data items on Web

transaction logs, can facilitate the development and execution of future marketing strategies, both online and off-line, such as:

– automated return mail to clients falling within a certain cluster, or – dynamically changing a particular site for a client, on a return visit, based on past classification of that client.

slide-14
SLIDE 14

14

Classification Classification

Discovering classification rules allows one to

develop a profile of items belonging to a particular group according to their common attributes.

This profile can then be used to classify new

data items that are added to the database.

For example, classification on WWW access

logs may lead to the discovery of relationships such as the following:

– clients from state or government agencies who visit the site tend to be interested in the page /company/product1

slide-15
SLIDE 15

15

Association Rules Association Rules

  • Rules that govern "databases of transactions

Rules that govern "databases of transactions where each transaction consists of a set of where each transaction consists of a set of items." items."

  • This technique is used to predict the

This technique is used to predict the correlation of items "where the presence of correlation of items "where the presence of

  • ne set of items in a transaction implies (with
  • ne set of items in a transaction implies (with

a certain degree of confidence) the presence a certain degree of confidence) the presence

  • f other items.“
  • f other items.“
  • For example, prediction of the percentage of

For example, prediction of the percentage of clients accessing a particular URL who will clients accessing a particular URL who will place online orders for a certain product place online orders for a certain product

slide-16
SLIDE 16

16

Path Analysis Path Analysis

  • A technique that involves the generation of some

A technique that involves the generation of some form of graph that "represents form of graph that "represents relation[s relation[s] defined on ] defined on Web pages." Web pages."

  • This can be the physical layout of a Web site in which

This can be the physical layout of a Web site in which the Web pages are nodes and the hypertext links the Web pages are nodes and the hypertext links between these pages are directed edges. between these pages are directed edges.

  • Most graphs are involved in determining frequent

Most graphs are involved in determining frequent traversal patterns or large reference sequences from traversal patterns or large reference sequences from physical layout, such as the most frequently visited physical layout, such as the most frequently visited paths in a Web site. paths in a Web site.

  • For example, what paths do users travel before they

For example, what paths do users travel before they go to a particular URL? go to a particular URL?

slide-17
SLIDE 17

17

Sequential Patterns Sequential Patterns

  • Applied to Web access server

Applied to Web access server transaction logs. transaction logs.

  • The purpose is to discover sequential

The purpose is to discover sequential patterns that indicate user visit patterns patterns that indicate user visit patterns

  • ver a certain period.
  • ver a certain period.
  • For example, "30% of clients who

For example, "30% of clients who visited /company/products/, had done a visited /company/products/, had done a search in Yahoo within the past week search in Yahoo within the past week

  • n keyword W"
  • n keyword W"
slide-18
SLIDE 18

18

Research Directions Research Directions

Intelligent integration and correlation of information

from diverse sources as Web server logs, referral logs, registration files, and index server logs. can reveal usage information which may not be evident from any one of them.

There is a need to develop mining algorithms that

take as input the existing data, the already mined knowledge, and the new data, and develop a new model in an efficient manner

There is a need to develop tools which incorporate

statistical methods, visualization, and human factors to help better understand the mined knowledge.