A Graphical representation for identifier structure in application - - PowerPoint PPT Presentation

▶

Jan 11, 2023 378 likes •658 views

UC Berkeley A Graphical representation for identifier structure in application logs Ari Rabkin, Wei Xu, Avani Wildani, Armando Fox, David Patterson and Randy Katz SLAML October 3, 2010 Motivation & Summary Log analysis is

SLIDE 1

UC Berkeley

Ari Rabkin, Wei Xu, Avani Wildani, Armando Fox, David Patterson and Randy Katz SLAML October 3, 2010

A Graphical representation for identifier structure in application logs

SLIDE 2

Motivation & Summary

Log analysis is fundamentally constrained

by the information content of the underlying logs

Need tools to help developers spot flaws

in their loging

We propose a compact graph-based

representation for log structure

Differs from previous work in analyzing

logging behavior, not logs of particular executions

SLIDE 3

Focus on identifers

We focus on identifiers in logs

– Variable fields that refer to entities in a system. – Can be operationally defined as variable fields with increasingly many possible strings [Xu 09]

Previous work has modeled logs as sets of

concurrent state machines. [Fu 09, Tan 08]

– Identifiers tie together messages that correlate to the same state machine

SLIDE 4

Some defects

Imagine a transaction processing system.

3:45 Starting transaction t123  3:46 Transaction failed  3:50 Starting transaction t123  3:51 Finished trans that was started at 3:50.

SLIDE 5

Missing IDs

Imagine a transaction processing system.

3:45 Starting transaction t123  3:46 Transaction failed  3:50 Starting transaction t123  3:51 Finished trans that was started at 3:50.

No ID

SLIDE 6

Inconsistent IDs

Imagine a transaction processing system.

3:45 Starting transaction t123  3:46 Transaction failed  3:50 Starting transaction t123  3:51 Finished trans that was started at 3:50.

Inconsistent identification

SLIDE 7

Ambiguous IDs

Imagine a transaction processing system.

3:45 Starting transaction t123  3:46 Transaction failed  3:50 Starting transaction t123  3:51 Finished trans that was started at 3:50.

Ambiguous identification

SLIDE 8

Goals

Seek a compact representation for logs
Make common logging flaws visible
Facilitate comparison across related logs
Not depend on details of particular

execution traces

SLIDE 9

A real example

"#$%&'&#$()*&+ '&#$(

/ !0 !1 !- !2 !/ !! !3 4 !,

5)67&*+89 :%;( <99+=>9 ?#@

Hadoop datanode logs from Yahoo! M45 cluster

SLIDE 10

Definitions

Definitions:

– A log message is a string. – Each log message is associated with a specific message type. – All messages of a type are structurally

identical. (same set of identifier fields)

– Identifiers belong to identifier classes.

SLIDE 11

Assumptions

Assumptions

– Have representative sample of logs – Can find message type from message – Can extract identifiers from messages – Have identifier class for each identifier field in a message type

SLIDE 12

Core structure

Ex: Starting task t123 on node n

Task ID Host name Starting task… Host name Task ID

Formally: a graph with V = { identifier classes} U {message types} E = { (i,m) | message m includes an identifier of class i}

SLIDE 13

Subsumption

Sometimes, one identifier includes

another.

Model this by adding a graph edge

between two identifiers if one inclues another.

Call this subsumption

– E.g., URLs subsume host names

Host name URL

SLIDE 14

Frequency

Can encode frequency information on

diagram

Rare Common Medium

Scaled relative to most-frequent message
r identifier
γ-correction: scale by sqrt(freq / Max(freq))

SLIDE 15

Ubiquity

Can show information about joint ID-

message statistics

Want to distinguish (ab)normal messages
Defn:

The ubiquity of identifier class C for message type T is the fraction of identifiers belonging to class C appearing in messages of type T.

Orthogonal to frequency of message

SLIDE 16

Drawing ubiquity

Line thickness proportional to ubiquity

Abnormal failure Task ID Starting task…

SLIDE 17

Diagramming defects

Missing ID:

Message 1 Message 2

Inconsistent IDs

Message 1 Message 2 ID 1 ID 2

SLIDE 18

Our prototype

Have a prototype that converts logs

into .dot files for rendering with GraphViz

Pluggable parsers
Omit message strings; output alongside

SLIDE 19

A real example, part 2

"#$%&'&#$()*&+ '&#$(

/ !0 !1 !- !2 !/ !! !3 4 !,

5)67&*+89 :%;( <99+=>9 ?#@

Hadoop datanode logs from Yahoo! M45 cluster

SLIDE 20

Inconsistent identifiers

Logs from Chukwa, an open-source log collection system [Boulon 08, Rabkin 10]

!"#$%&'()

*+ ,-

!"#$%&'.)/01'2$%2&3

567

9:'/#"()

Old New

"#$%&'()*

"#$%&'(1*234(5%&5'6

7'3&

+! /8

9:;

SLIDE 21

Ambiguous identifiers

Logs from SCADS, an experimental system at Berkeley

SLIDE 22

Ambiguous identifiers

Logs from SCADS, an experimental system at Berkeley

SLIDE 23

Comparing logs

Comparing Hadoop JobTracker logs

"#$ %&'(

+,,-./,

)0 1! 11 2 3 )4 5 !6 )1 !3 !2 !4 30 !* )! 1) )6

7-8#9-:;

* 4 6

<#',=&.- 7&8(

36 )3 10 1*

>?#8(

*1 *4 ** 15

@ABC?D-=,

15-node cluster at Berkeley

Missing ID/message

M45 cluster (professional management)

"#$ %&'(

)* *+ ,, )! *! *)

!* ). */

0112341

)5 *, *- )- )/ ,*

*6 )6 !. !, !- !5 +! +* ** *

7#'18&32 9&:(

.* +.

SLIDE 24

Conclusions

Aspects of log structure can be encoded in

succinct diagrams.

Our choice of representation captures:

– missing identifiers, inconsistent identifiers, and ambiguous identifiers – How much detail about different topics – Ratio of routine vs peculiar messages + types

Usable on real systems, even with limited

understanding of system and logs

No need for temporal information

SLIDE 25

Questions?

SLIDE 26

A note on parsing

I used semi-hand-written parsers.
Wrote rules to tag identifiers:

– e.g., "job_..." is a job ID

Tokenized lines, identified line by token

sequence + constants

– Special cases for numbers

Explored using program analysis to extract

messages

– Came out ugly, but cleanable. – Need to fix names – Need to merge some categories

SLIDE 27

Related work

Xu 09
State machines
Entropy as metric?

Ari Rabkin, Wei Xu, Avani Wildani, Armando Fox, David Patterson and Randy Katz SLAML October 3, 2010

A Graphical representation for identifier structure in application logs

Motivation & Summary

by the information content of the underlying logs

in their loging

representation for log structure

logging behavior, not logs of particular executions

Focus on identifers

– Variable fields that refer to entities in a system. – Can be operationally defined as variable fields with increasingly many possible strings [Xu 09]

concurrent state machines. [Fu 09, Tan 08]

– Identifiers tie together messages that correlate to the same state machine

Some defects

3:45 Starting transaction t123 3:46 Transaction failed 3:50 Starting transaction t123 3:51 Finished trans that was started at 3:50.

Missing IDs

3:45 Starting transaction t123 3:46 Transaction failed 3:50 Starting transaction t123 3:51 Finished trans that was started at 3:50.

Inconsistent IDs

3:45 Starting transaction t123 3:46 Transaction failed 3:50 Starting transaction t123 3:51 Finished trans that was started at 3:50.

Ambiguous IDs

3:45 Starting transaction t123 3:46 Transaction failed 3:50 Starting transaction t123 3:51 Finished trans that was started at 3:50.

Goals

execution traces

A real example

Hadoop datanode logs from Yahoo! M45 cluster

Definitions

– A log message is a string. – Each log message is associated with a specific message type. – All messages of a type are structurally

– Identifiers belong to identifier classes.

Assumptions

– Have representative sample of logs – Can find message type from message – Can extract identifiers from messages – Have identifier class for each identifier field in a message type

Core structure

Formally: a graph with V = { identifier classes} U {message types} E = { (i,m) | message m includes an identifier of class i}

Subsumption

another.

between two identifiers if one inclues another.

– E.g., URLs subsume host names

Frequency

diagram

Ubiquity

message statistics

The ubiquity of identifier class C for message type T is the fraction of identifiers belonging to class C appearing in messages of type T.

Drawing ubiquity

Diagramming defects

Our prototype

into .dot files for rendering with GraphViz

A real example, part 2

Hadoop datanode logs from Yahoo! M45 cluster

Inconsistent identifiers

Ambiguous identifiers

Ambiguous identifiers

Comparing logs

Conclusions

succinct diagrams.

– missing identifiers, inconsistent identifiers, and ambiguous identifiers – How much detail about different topics – Ratio of routine vs peculiar messages + types

understanding of system and logs

Questions?

A note on parsing

– e.g., "job_..." is a job ID

sequence + constants

– Special cases for numbers

messages

– Came out ugly, but cleanable. – Need to fix names – Need to merge some categories

Related work

3:45 Starting transaction t123  3:46 Transaction failed  3:50 Starting transaction t123  3:51 Finished trans that was started at 3:50.

3:45 Starting transaction t123  3:46 Transaction failed  3:50 Starting transaction t123  3:51 Finished trans that was started at 3:50.

3:45 Starting transaction t123  3:46 Transaction failed  3:50 Starting transaction t123  3:51 Finished trans that was started at 3:50.

3:45 Starting transaction t123  3:46 Transaction failed  3:50 Starting transaction t123  3:51 Finished trans that was started at 3:50.