1
Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr - - PowerPoint PPT Presentation
Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr - - PowerPoint PPT Presentation
Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline Web Mining Outline Goal: Examine the use of data mining on Examine the use of data mining on Goal: the World Wide Web the World Wide Web Web
2
Web Mining Outline Web Mining Outline
Goal: Goal: Examine the use of data mining on Examine the use of data mining on the World Wide Web the World Wide Web
- Web Data
Web Data
- Web Content Mining
Web Content Mining
- Web Structure Mining
Web Structure Mining
- Web Usage Mining
Web Usage Mining
- Common Web Mining Techniques
Common Web Mining Techniques
- Research Directions
Research Directions
3
Web Data Web Data
- Web pages
Web pages
- Page structures
Page structures
- Usage data
Usage data
- Supplemental data
Supplemental data
– – Profiles Profiles – – Registration information Registration information – – Cookies Cookies
4
Web Mining Taxonomy Web Mining Taxonomy
Modified from [zai01]
5
Web Content Mining (1) Web Content Mining (1)
The lack of structure that permeates the
information sources on the World Wide Web makes automated discovery of Web-based information difficult
In recent years these factors have prompted
researchers to develop more intelligent tools for information retrieval, such as intelligent Web agents, and to extend data mining techniques to provide a higher level of
- rganization for semi-structured data
available on the Web
6
Web Content Mining (2) Web Content Mining (2)
- Techniques for Web content mining can be
Techniques for Web content mining can be classified into: classified into:
– – Agent Based Approach Agent Based Approach
» » Intelligent Search Agents using domain characteristics Intelligent Search Agents using domain characteristics » » Information Filtering/ Categorization using information Information Filtering/ Categorization using information retrieval techniques retrieval techniques » » Personalized Web Agents using user preference Personalized Web Agents using user preference
– – Database Approach Database Approach
» » Multilevel Databases which extracts meta data from Multilevel Databases which extracts meta data from lower level data and organize in a structured collection lower level data and organize in a structured collection » » Web Query Systems that uses SQL Web Query Systems that uses SQL-
- like to extract web
like to extract web document structure, and content queries using IR document structure, and content queries using IR techniques techniques
7
Web Structure Mining (1) Web Structure Mining (1)
- Mine structure (links, graph) of the Web
Mine structure (links, graph) of the Web
- Techniques
Techniques
– – PageRank PageRank – – CLEVER CLEVER
- Create a model of the Web organization.
Create a model of the Web organization.
- May be combined with content mining to
May be combined with content mining to more effectively retrieve important pages. more effectively retrieve important pages.
8
Web Structure Mining (2) Web Structure Mining (2)
- PageRank
PageRank
– – Used by Used by Google Google – – Prioritize pages returned from search by Prioritize pages returned from search by looking at Web structure. looking at Web structure. – – Importance of page is calculated based on Importance of page is calculated based on number of pages which point to it number of pages which point to it – – Backlinks Backlinks. . – – Weighting is used to provide more Weighting is used to provide more importance to importance to backlinks backlinks coming from coming from important pages. important pages.
9
Web Structure Mining (3) Web Structure Mining (3)
- CLEVER Identifies authoritative and
CLEVER Identifies authoritative and hub pages. hub pages.
– – Authoritative Pages Authoritative Pages : :
» » Highly important pages. Highly important pages. » » Best source for requested information. Best source for requested information.
– – Hub Pages Hub Pages : :
» » Contain links to highly important pages. Contain links to highly important pages.
10
Web Usage Mining (1) Web Usage Mining (1)
Web usage mining is the automatic discovery
- f user access patterns from Web servers
Organizations collect large volumes of data in
their daily operations, generated automatically by Web servers and collected in server access logs.
Other sources of user information include
referrer logs which contain information about the referring pages for each page reference, and user registration or survey data gathered via CGI scripts.
11
Web Usage Mining (2) Web Usage Mining (2)
- Techniques for Web content mining can
Techniques for Web content mining can be classified into: be classified into:
– – Pattern Discovery Tools using techniques Pattern Discovery Tools using techniques from AI, data mining, and information from AI, data mining, and information retrieval to mine for knowledge from retrieval to mine for knowledge from collected data collected data – – Pattern Analysis Tools are needed to Pattern Analysis Tools are needed to understand, visualize, and interpret these understand, visualize, and interpret these patterns patterns
12
Common Web Mining Common Web Mining Techniques Techniques
- The common techniques for Web
The common techniques for Web mining are: mining are:
– – clustering clustering – – classification, classification, – – association rules, association rules, – – path analysis, and path analysis, and – – sequential patterns. sequential patterns.
13
Clustering Clustering
Clustering analysis allows one to group together
clients or data items that have similar characteristics.
Clustering of client information or data items on Web
transaction logs, can facilitate the development and execution of future marketing strategies, both online and off-line, such as:
– automated return mail to clients falling within a certain cluster, or – dynamically changing a particular site for a client, on a return visit, based on past classification of that client.
14
Classification Classification
Discovering classification rules allows one to
develop a profile of items belonging to a particular group according to their common attributes.
This profile can then be used to classify new
data items that are added to the database.
For example, classification on WWW access
logs may lead to the discovery of relationships such as the following:
– clients from state or government agencies who visit the site tend to be interested in the page /company/product1
15
Association Rules Association Rules
- Rules that govern "databases of transactions
Rules that govern "databases of transactions where each transaction consists of a set of where each transaction consists of a set of items." items."
- This technique is used to predict the
This technique is used to predict the correlation of items "where the presence of correlation of items "where the presence of
- ne set of items in a transaction implies (with
- ne set of items in a transaction implies (with
a certain degree of confidence) the presence a certain degree of confidence) the presence
- f other items.“
- f other items.“
- For example, prediction of the percentage of
For example, prediction of the percentage of clients accessing a particular URL who will clients accessing a particular URL who will place online orders for a certain product place online orders for a certain product
16
Path Analysis Path Analysis
- A technique that involves the generation of some
A technique that involves the generation of some form of graph that "represents form of graph that "represents relation[s relation[s] defined on ] defined on Web pages." Web pages."
- This can be the physical layout of a Web site in which
This can be the physical layout of a Web site in which the Web pages are nodes and the hypertext links the Web pages are nodes and the hypertext links between these pages are directed edges. between these pages are directed edges.
- Most graphs are involved in determining frequent
Most graphs are involved in determining frequent traversal patterns or large reference sequences from traversal patterns or large reference sequences from physical layout, such as the most frequently visited physical layout, such as the most frequently visited paths in a Web site. paths in a Web site.
- For example, what paths do users travel before they
For example, what paths do users travel before they go to a particular URL? go to a particular URL?
17
Sequential Patterns Sequential Patterns
- Applied to Web access server
Applied to Web access server transaction logs. transaction logs.
- The purpose is to discover sequential
The purpose is to discover sequential patterns that indicate user visit patterns patterns that indicate user visit patterns
- ver a certain period.
- ver a certain period.
- For example, "30% of clients who
For example, "30% of clients who visited /company/products/, had done a visited /company/products/, had done a search in Yahoo within the past week search in Yahoo within the past week
- n keyword W"
- n keyword W"
18
Research Directions Research Directions
Intelligent integration and correlation of information
from diverse sources as Web server logs, referral logs, registration files, and index server logs. can reveal usage information which may not be evident from any one of them.
There is a need to develop mining algorithms that
take as input the existing data, the already mined knowledge, and the new data, and develop a new model in an efficient manner
There is a need to develop tools which incorporate