[PPT] - Text File Layout Inference Reid Phillips, Wingning Li, Craig PowerPoint Presentation

SLIDE 1

Text File Layout Inference

Reid Phillips, Wingning Li, Craig Thompson University of Arkansas {rxp01, wingning, cwt}@uark.edu

SLIDE 2

Outline

Review

– Layout inference problem and properties – Prototype

Assumptions
Flow of control
Oracles
Record length analysis
Content identification
Making a decision
Current work

– Recent prototype enhancements – Results and analysis

Future work

SLIDE 3

Problem description

? ? ? ?

Undefined Data Files Well-Defined Data Files Infer File Layout

? ? ? ?

Undefined Data Files Well-Defined Data Files

Layout Engine Layout Engine Layout Engine Layout Engine

SLIDE 4

File layout definition

Character encoding

– EBCDIC – ASCII

Delimiters

– Field – Record – Text quote (string literals)

File type

– Delimited – Fixed – Hybrid

Record length
Record structure

– Field type – Field location – Field length*

SLIDE 5

Input file types and properties

File Types Record Length Record Delimiter Field Length Field Delimiter Delimited Variable Yes Variable Yes Fixed Fixed No Fixed No Hybrid Fixed Yes Fixed No Additional aspects Record Delimiter Field Delimiter Character Encoding Text Quote Current scope CR, LF, CR-LF Comma, Pipe, Tab ASCII EBCDIC Single quote, double quote

SLIDE 6

File layout definition

Character encoding

– EBCDIC – ASCII

Delimiters

– Field { Comma, Pipe, Tab } – Record { CR, LF, CR-LF } – Text quote { “, ‘ }

File type

– Delimited – Fixed – Hybrid

Record length
Record structure

– Field type – Field location – {Field length}

SLIDE 7

Prototype Assumptions

Structured file
U.S. customer data
Name, address, phone, and email content

types.

Allowing errors and blank fields for a

statistically small number of records

Known field delimiters and record

delimiters as shown earlier

ASCII and EBCDIC encoding

SLIDE 8

Prototype process overview

file sampling encoding analysis ascii internal encoding file type analysis length analysis fixed type delimiter fixed field content id Layout (xml) delimited content id Content Oracle Data set up

racles

Key elements: 1. Sampling 2. Encoding analysis 3. File type analysis 4. File length analysis 5. Delimited field content id 6. Fixed field content id 7. Oracles and their applications 8. XML layout

hybrid

SLIDE 9

Encoding and file type analysis

File encoding analysis is based on statistical

difference on character encoding. For ASCII and EBCDIC, the most significant bit is used.

File type analysis is based on the assumed field

delimiters and record delimiters and statistical measurement of their occurrences.

The prototype has performed very well for both

analyses in real data and synthetic data testing.

Future work in this area could include allowing

additional character encoding such as Latin-1, Unicode and additional delimiters.

SLIDE 10

Content types (oracles)

Name prefix First name Last name Name suffix Directional Street number Street name Street suffix Post Office Box Unit Designator Unit number Full name Address one Address two City State Zip code Email Phone number Boolean

Oracle A string Yes No Address Line

SLIDE 11

Record length analysis

For a fully fixed file, record length must be
determined. Once the length is known, the file is

treated as a hybrid file in field content identification.

Start with a initial length and try all possible

length values until a well known content is lined up nicely according to the oracle of the content type.

The prototype has been tested using synthetic

and real data successfully.*

Future work in this area could include examining

multiple content types determining which provide the best evidence.

*Two real files have been used and initially only one returned a record length. After adjusting an appropriate threshold both returned correct results.

SLIDE 12

Record length analysis

First step
Step N
Final step

SLIDE 13

Delimited field content identification

For delimited files, fields are between delimiters.

Once the file analysis is done, all field locations are known.

For a given field in the sample, each content
racle is consulted by sending each string in the

field to it and receiving a “yes” and “no” answer from it. The percentage of yes answers of each

racle is computed and used to identify the

content type of the field.

SLIDE 14

Fixed field content identification

For hybrid files, records are between record

delimiters and have a fixed length. Once the file analysis is done, record length is known.

To determine the length and starting position of

each field, oracles in conjunction with combinatorial and statistical approach are used.

At this point it is possible for ambiguity among

the fields and thus guesses, potential field positions (PFP), are generated.

SLIDE 15

Fixed field content identification (PFP analysis)

First name? First name? First name? First name? First name?

SLIDE 16

Making a decision

SLIDE 17

Recent prototype enhancements

Web service

– All functionality provided by the prototype can be accessed via a single function call.

buildLayout( fileSample: byte[] ): String

– A web service endpoint was defined that invokes the preceding method

buildLayout( parameterData: byte[], fileSample: byte[] ):

String

– This definition assumes that the server running the engine might not have access to the data and thus the engine parameters and the file data sample must be passed to the endpoint as method parameters.

SLIDE 18

Recent prototype enhancements

Configuration file

– User can set each threshold

Example: Setting the value indicating when fields

line up during record length analysis

– User can set each heuristic

Example: The size of the data sample.

– User can set what delimiters to test for – User can specify what content types to identify by setting what oracles are loaded into the prototype

SLIDE 19

Recent prototype enhancements

Header record

– Currently are able to test for existence – Future work would include using the information from the header record as extra evidence about the file

Assign a label to UNSPECIFIED fields
Correspond header information with the

prototype’s results

Assumption: Only fully delimited files have

header records. Can be easily extended to include fixed file types.

SLIDE 20

Recent prototype enhancements

Comparator program

– Compares the XML output of the engine with an XML file representing the correct output – Lists correct and incorrect results separately – Provides simple statistics – Text dump to the console

SLIDE 21

Recent prototype enhancements

Cross reference analysis

– Correlate related fields such as zipcode, city, and state

Vertical analysis

– Examine the sample by columns rather than by record (horizontal) – Ex: Boolean v. Directional

Contextual analysis

– Ex: Middle name and street number fields

All are performed after the decision

methodologies in an attempt to improve the results with extra evidence.

SLIDE 22

Recent prototype enhancements

Decision functionality

– Currently a work in progress – In order, previous decisions were based on the content types, the PFP, and corresponding statistical counts – New logic will reorder the logic to consider the PFP, the content types, and finally the corresponding statistical counts – New logic appears to be a more generic solution to the decision making process

SLIDE 23

Results for a synthetic fixed and hybrid data file

Prototype results

1 full name 1 27 27 2 address one 28 54 27 3 address two 55 81 27 4 city 82 99 18 5 state 100 102 3 6 zip code 103 107 5

Actual record structure

1 full name 1 27 27 2 address one 28 54 27 3 address two 55 81 27 4 city 82 99 18 5 state 100 101 2 6 zip code 102 107 6

SLIDE 24

Results for three real data files: File 3 (delimited)

Prototype results
Actual record structure

1 Seq # 2 filler 3 name 4-5 filler 6 address line one 7 filler 8 city 9 state 10 zip code 11 zip plus four 12-14 filler 15 name 16-28 filler 29 first name 30 middle name 29 last name 32-56 filler 57 address line one 58-67 filler 68 directional 69 street number 70-71 filler 72 street name 73 filler 74 street suffix 75-81 filler 82 city 83 state 84 zip code 85-87 filler 88 boolean 89 Seq # 1 * UNSPECIFIED * 2 * UNSPECIFIED * 3 name 4-5 * UNSPECIFIED * 6 address line one 7 * UNSPECIFIED * 8 city 9 state 10 zip code 11 zip plus four 12-14 * UNSPECIFIED * 15 name 16-28 * UNSPECIFIED * 29 first name 30 middle name 29 last name 32-56 * UNSPECIFIED * 57 address line one 58-67 * UNSPECIFIED * 68 directional 69 street number 70-71 * UNSPECIFIED * 72 street name 73 * UNSPECIFIED * 74 street suffix 75-81 * UNSPECIFIED * 82 city 83 state 84 zip code 85-87 * UNSPECIFIED * 88 directional 89 street number

SLIDE 25

Results for three real data files: File 2 (hybrid)

Prototype results

1 full name 1 48 48 2 address line 49 72 24 3 city 73 88 16 4 state 89 90 2 5 zip code 91 95 5 6 * UNSPECIFIED * 96 99 4 7 * UNSPECIFIED * 100 109 10 9 first name 110 129 20 10 last name 130 169 40 11 * UNSPECIFIED * 170 174 5 12 * UNSPECIFIED * 175 185 11 13 boolean 186 186 1

Actual record structure

1 full name 1 48 48 2 address line 49 72 24 3 city 73 88 16 4 state 89 90 2 5 zip code 91 95 5 6 zip plus four 96 99 4 7 ID 100 109 10 9 first name 110 129 20 10 last name 130 145 16 11 filler 146 169 24 12 zip code 170 174 5 13 dash 175 175 1 14 zip plus four 176 179 4 15 ID 180 185 6 16 flag 186 186 1

SLIDE 26

Results for three real data files: File 6 (hybrid)

Prototype results

1 * UNSPECIFIED * 1 21 21 2 * UNSPECIFIED * 22 37 16 3 first name 38 58 21 4 middle name 59 74 16 5 last name 75 88 14 6 * UNSPECIFIED * 89 95 7 7 city 96 123 28 8 state 124 125 2 9 zip code 126 135 10 10 * UNSPECIFIED * 136 143 8 11 address line 144 172 29 12 * UNSPECIFIED * 173 182 10

Actual record structure

1 ID 1 21 21 2 account # 22 37 16 3 first name 38 58 21 4 middle name 59 74 16 5 last name 75 95 21 6 city 96 123 28 7 state 124 125 2 8 zip code 126 135 10 9 zip plus four 136 143 8 10 address line 144 182 39

SLIDE 27

Future Work

In addition to all the future work outlined earlier,

the following is a partial list:

– Additional information for a content (e.g., first name last last or last name, first name) – Process the file completely to gather data quality statistics of each content type using oracles or other tools – Incorporate results from (John’s) UALR team – Improve efficiency and system architecture for ease of extension – Refinement of XML – Interface to the DSML project – Develop use cases for batch processing (low performance and high accuracy) and interactive processing with GUI (high performance and very high accuracy with user help)

SLIDE 28

Questions? Discussions ?

SLIDE 29

File type – delimited

SLIDE 30

File type – fixed

SLIDE 31

File type – hybrid

SLIDE 32

Sample delimited text output

SLIDE 33

Text File Layout Inference

Reid Phillips, Wingning Li, Craig Thompson University of Arkansas {rxp01, wingning, cwt}@uark.edu

Outline

– Layout inference problem and properties – Prototype

– Recent prototype enhancements – Results and analysis

Problem description

? ? ? ?

? ? ? ?

File layout definition

– Field type – Field location – Field length*

Input file types and properties

File layout definition

– Field type – Field location – {Field length}

Prototype Assumptions

types.

statistically small number of records

delimiters as shown earlier

Prototype process overview

Key elements: 1. Sampling 2. Encoding analysis 3. File type analysis 4. File length analysis 5. Delimited field content id 6. Fixed field content id 7. Oracles and their applications 8. XML layout

Encoding and file type analysis

difference on character encoding. For ASCII and EBCDIC, the most significant bit is used.

delimiters and record delimiters and statistical measurement of their occurrences.

analyses in real data and synthetic data testing.

additional character encoding such as Latin-1, Unicode and additional delimiters.

Content types (oracles)

Name prefix First name Last name Name suffix Directional Street number Street name Street suffix Post Office Box Unit Designator Unit number Full name Address one Address two City State Zip code Email Phone number Boolean

Record length analysis

treated as a hybrid file in field content identification.

length values until a well known content is lined up nicely according to the oracle of the content type.

and real data successfully.*

multiple content types determining which provide the best evidence.

Record length analysis

Delimited field content identification

Once the file analysis is done, all field locations are known.

field to it and receiving a “yes” and “no” answer from it. The percentage of yes answers of each

content type of the field.

Fixed field content identification

delimiters and have a fixed length. Once the file analysis is done, record length is known.

each field, oracles in conjunction with combinatorial and statistical approach are used.

the fields and thus guesses, potential field positions (PFP), are generated.

Fixed field content identification (PFP analysis)

Making a decision

Recent prototype enhancements

– All functionality provided by the prototype can be accessed via a single function call.

– A web service endpoint was defined that invokes the preceding method

String

– This definition assumes that the server running the engine might not have access to the data and thus the engine parameters and the file data sample must be passed to the endpoint as method parameters.

Recent prototype enhancements

– User can set each threshold

line up during record length analysis

– User can set each heuristic

– User can set what delimiters to test for – User can specify what content types to identify by setting what oracles are loaded into the prototype

Recent prototype enhancements

– Currently are able to test for existence – Future work would include using the information from the header record as extra evidence about the file

prototype’s results

header records. Can be easily extended to include fixed file types.

Recent prototype enhancements

– Compares the XML output of the engine with an XML file representing the correct output – Lists correct and incorrect results separately – Provides simple statistics – Text dump to the console

Recent prototype enhancements

– Correlate related fields such as zipcode, city, and state

– Examine the sample by columns rather than by record (horizontal) – Ex: Boolean v. Directional

– Ex: Middle name and street number fields

methodologies in an attempt to improve the results with extra evidence.

Recent prototype enhancements

Results for a synthetic fixed and hybrid data file

Results for three real data files: File 3 (delimited)

Results for three real data files: File 2 (hybrid)

Results for three real data files: File 6 (hybrid)

Future Work

the following is a partial list:

Questions? Discussions ?

File type – delimited

File type – fixed

File type – hybrid

Sample delimited text output

Sample fixed text output