Text File Layout Inference Reid Phillips, Wingning Li, Craig - - PowerPoint PPT Presentation

text file layout inference
SMART_READER_LITE
LIVE PREVIEW

Text File Layout Inference Reid Phillips, Wingning Li, Craig - - PowerPoint PPT Presentation

Text File Layout Inference Reid Phillips, Wingning Li, Craig Thompson University of Arkansas {rxp01, wingning, cwt}@uark.edu Outline Review Layout inference problem and properties Prototype Assumptions Flow of control


slide-1
SLIDE 1

Text File Layout Inference

Reid Phillips, Wingning Li, Craig Thompson University of Arkansas {rxp01, wingning, cwt}@uark.edu

slide-2
SLIDE 2

Outline

  • Review

– Layout inference problem and properties – Prototype

  • Assumptions
  • Flow of control
  • Oracles
  • Record length analysis
  • Content identification
  • Making a decision
  • Current work

– Recent prototype enhancements – Results and analysis

  • Future work
slide-3
SLIDE 3

Problem description

? ? ? ?

Undefined Data Files Well-Defined Data Files Infer File Layout

? ? ? ?

Undefined Data Files Well-Defined Data Files

Layout Engine Layout Engine Layout Engine Layout Engine

slide-4
SLIDE 4

File layout definition

  • Character encoding

– EBCDIC – ASCII

  • Delimiters

– Field – Record – Text quote (string literals)

  • File type

– Delimited – Fixed – Hybrid

  • Record length
  • Record structure

– Field type – Field location – Field length*

slide-5
SLIDE 5

Input file types and properties

File Types Record Length Record Delimiter Field Length Field Delimiter Delimited Variable Yes Variable Yes Fixed Fixed No Fixed No Hybrid Fixed Yes Fixed No Additional aspects Record Delimiter Field Delimiter Character Encoding Text Quote Current scope CR, LF, CR-LF Comma, Pipe, Tab ASCII EBCDIC Single quote, double quote

slide-6
SLIDE 6

File layout definition

  • Character encoding

– EBCDIC – ASCII

  • Delimiters

– Field { Comma, Pipe, Tab } – Record { CR, LF, CR-LF } – Text quote { “, ‘ }

  • File type

– Delimited – Fixed – Hybrid

  • Record length
  • Record structure

– Field type – Field location – {Field length}

slide-7
SLIDE 7

Prototype Assumptions

  • Structured file
  • U.S. customer data
  • Name, address, phone, and email content

types.

  • Allowing errors and blank fields for a

statistically small number of records

  • Known field delimiters and record

delimiters as shown earlier

  • ASCII and EBCDIC encoding
slide-8
SLIDE 8

Prototype process overview

file sampling encoding analysis ascii internal encoding file type analysis length analysis fixed type delimiter fixed field content id Layout (xml) delimited content id Content Oracle Data set up

  • racles

Key elements: 1. Sampling 2. Encoding analysis 3. File type analysis 4. File length analysis 5. Delimited field content id 6. Fixed field content id 7. Oracles and their applications 8. XML layout

hybrid

slide-9
SLIDE 9

Encoding and file type analysis

  • File encoding analysis is based on statistical

difference on character encoding. For ASCII and EBCDIC, the most significant bit is used.

  • File type analysis is based on the assumed field

delimiters and record delimiters and statistical measurement of their occurrences.

  • The prototype has performed very well for both

analyses in real data and synthetic data testing.

  • Future work in this area could include allowing

additional character encoding such as Latin-1, Unicode and additional delimiters.

slide-10
SLIDE 10

Content types (oracles)

Name prefix First name Last name Name suffix Directional Street number Street name Street suffix Post Office Box Unit Designator Unit number Full name Address one Address two City State Zip code Email Phone number Boolean

Oracle A string Yes No Address Line

slide-11
SLIDE 11

Record length analysis

  • For a fully fixed file, record length must be
  • determined. Once the length is known, the file is

treated as a hybrid file in field content identification.

  • Start with a initial length and try all possible

length values until a well known content is lined up nicely according to the oracle of the content type.

  • The prototype has been tested using synthetic

and real data successfully.*

  • Future work in this area could include examining

multiple content types determining which provide the best evidence.

*Two real files have been used and initially only one returned a record length. After adjusting an appropriate threshold both returned correct results.

slide-12
SLIDE 12

Record length analysis

  • First step
  • Step N
  • Final step
slide-13
SLIDE 13

Delimited field content identification

  • For delimited files, fields are between delimiters.

Once the file analysis is done, all field locations are known.

  • For a given field in the sample, each content
  • racle is consulted by sending each string in the

field to it and receiving a “yes” and “no” answer from it. The percentage of yes answers of each

  • racle is computed and used to identify the

content type of the field.

slide-14
SLIDE 14

Fixed field content identification

  • For hybrid files, records are between record

delimiters and have a fixed length. Once the file analysis is done, record length is known.

  • To determine the length and starting position of

each field, oracles in conjunction with combinatorial and statistical approach are used.

  • At this point it is possible for ambiguity among

the fields and thus guesses, potential field positions (PFP), are generated.

slide-15
SLIDE 15

Fixed field content identification (PFP analysis)

First name? First name? First name? First name? First name?

slide-16
SLIDE 16

Making a decision

slide-17
SLIDE 17

Recent prototype enhancements

  • Web service

– All functionality provided by the prototype can be accessed via a single function call.

  • buildLayout( fileSample: byte[] ): String

– A web service endpoint was defined that invokes the preceding method

  • buildLayout( parameterData: byte[], fileSample: byte[] ):

String

– This definition assumes that the server running the engine might not have access to the data and thus the engine parameters and the file data sample must be passed to the endpoint as method parameters.

slide-18
SLIDE 18

Recent prototype enhancements

  • Configuration file

– User can set each threshold

  • Example: Setting the value indicating when fields

line up during record length analysis

– User can set each heuristic

  • Example: The size of the data sample.

– User can set what delimiters to test for – User can specify what content types to identify by setting what oracles are loaded into the prototype

slide-19
SLIDE 19

Recent prototype enhancements

  • Header record

– Currently are able to test for existence – Future work would include using the information from the header record as extra evidence about the file

  • Assign a label to UNSPECIFIED fields
  • Correspond header information with the

prototype’s results

  • Assumption: Only fully delimited files have

header records. Can be easily extended to include fixed file types.

slide-20
SLIDE 20

Recent prototype enhancements

  • Comparator program

– Compares the XML output of the engine with an XML file representing the correct output – Lists correct and incorrect results separately – Provides simple statistics – Text dump to the console

slide-21
SLIDE 21

Recent prototype enhancements

  • Cross reference analysis

– Correlate related fields such as zipcode, city, and state

  • Vertical analysis

– Examine the sample by columns rather than by record (horizontal) – Ex: Boolean v. Directional

  • Contextual analysis

– Ex: Middle name and street number fields

  • All are performed after the decision

methodologies in an attempt to improve the results with extra evidence.

slide-22
SLIDE 22

Recent prototype enhancements

  • Decision functionality

– Currently a work in progress – In order, previous decisions were based on the content types, the PFP, and corresponding statistical counts – New logic will reorder the logic to consider the PFP, the content types, and finally the corresponding statistical counts – New logic appears to be a more generic solution to the decision making process

slide-23
SLIDE 23

Results for a synthetic fixed and hybrid data file

  • Prototype results

1 full name 1 27 27 2 address one 28 54 27 3 address two 55 81 27 4 city 82 99 18 5 state 100 102 3 6 zip code 103 107 5

  • Actual record structure

1 full name 1 27 27 2 address one 28 54 27 3 address two 55 81 27 4 city 82 99 18 5 state 100 101 2 6 zip code 102 107 6

slide-24
SLIDE 24

Results for three real data files: File 3 (delimited)

  • Prototype results
  • Actual record structure

1 Seq # 2 filler 3 name 4-5 filler 6 address line one 7 filler 8 city 9 state 10 zip code 11 zip plus four 12-14 filler 15 name 16-28 filler 29 first name 30 middle name 29 last name 32-56 filler 57 address line one 58-67 filler 68 directional 69 street number 70-71 filler 72 street name 73 filler 74 street suffix 75-81 filler 82 city 83 state 84 zip code 85-87 filler 88 boolean 89 Seq # 1 * UNSPECIFIED * 2 * UNSPECIFIED * 3 name 4-5 * UNSPECIFIED * 6 address line one 7 * UNSPECIFIED * 8 city 9 state 10 zip code 11 zip plus four 12-14 * UNSPECIFIED * 15 name 16-28 * UNSPECIFIED * 29 first name 30 middle name 29 last name 32-56 * UNSPECIFIED * 57 address line one 58-67 * UNSPECIFIED * 68 directional 69 street number 70-71 * UNSPECIFIED * 72 street name 73 * UNSPECIFIED * 74 street suffix 75-81 * UNSPECIFIED * 82 city 83 state 84 zip code 85-87 * UNSPECIFIED * 88 directional 89 street number

slide-25
SLIDE 25

Results for three real data files: File 2 (hybrid)

  • Prototype results

1 full name 1 48 48 2 address line 49 72 24 3 city 73 88 16 4 state 89 90 2 5 zip code 91 95 5 6 * UNSPECIFIED * 96 99 4 7 * UNSPECIFIED * 100 109 10 9 first name 110 129 20 10 last name 130 169 40 11 * UNSPECIFIED * 170 174 5 12 * UNSPECIFIED * 175 185 11 13 boolean 186 186 1

  • Actual record structure

1 full name 1 48 48 2 address line 49 72 24 3 city 73 88 16 4 state 89 90 2 5 zip code 91 95 5 6 zip plus four 96 99 4 7 ID 100 109 10 9 first name 110 129 20 10 last name 130 145 16 11 filler 146 169 24 12 zip code 170 174 5 13 dash 175 175 1 14 zip plus four 176 179 4 15 ID 180 185 6 16 flag 186 186 1

slide-26
SLIDE 26

Results for three real data files: File 6 (hybrid)

  • Prototype results

1 * UNSPECIFIED * 1 21 21 2 * UNSPECIFIED * 22 37 16 3 first name 38 58 21 4 middle name 59 74 16 5 last name 75 88 14 6 * UNSPECIFIED * 89 95 7 7 city 96 123 28 8 state 124 125 2 9 zip code 126 135 10 10 * UNSPECIFIED * 136 143 8 11 address line 144 172 29 12 * UNSPECIFIED * 173 182 10

  • Actual record structure

1 ID 1 21 21 2 account # 22 37 16 3 first name 38 58 21 4 middle name 59 74 16 5 last name 75 95 21 6 city 96 123 28 7 state 124 125 2 8 zip code 126 135 10 9 zip plus four 136 143 8 10 address line 144 182 39

slide-27
SLIDE 27

Future Work

  • In addition to all the future work outlined earlier,

the following is a partial list:

– Additional information for a content (e.g., first name last last or last name, first name) – Process the file completely to gather data quality statistics of each content type using oracles or other tools – Incorporate results from (John’s) UALR team – Improve efficiency and system architecture for ease of extension – Refinement of XML – Interface to the DSML project – Develop use cases for batch processing (low performance and high accuracy) and interactive processing with GUI (high performance and very high accuracy with user help)

slide-28
SLIDE 28

Questions? Discussions ?

slide-29
SLIDE 29

File type – delimited

slide-30
SLIDE 30

File type – fixed

slide-31
SLIDE 31

File type – hybrid

slide-32
SLIDE 32

Sample delimited text output

slide-33
SLIDE 33

Sample fixed text output