Text File Layout Inference Reid Phillips, Wingning Li, Craig - - PowerPoint PPT Presentation
Text File Layout Inference Reid Phillips, Wingning Li, Craig - - PowerPoint PPT Presentation
Text File Layout Inference Reid Phillips, Wingning Li, Craig Thompson University of Arkansas {rxp01, wingning, cwt}@uark.edu Outline Review Layout inference problem and properties Prototype Assumptions Flow of control
Outline
- Review
– Layout inference problem and properties – Prototype
- Assumptions
- Flow of control
- Oracles
- Record length analysis
- Content identification
- Making a decision
- Current work
– Recent prototype enhancements – Results and analysis
- Future work
Problem description
? ? ? ?
Undefined Data Files Well-Defined Data Files Infer File Layout
? ? ? ?
Undefined Data Files Well-Defined Data Files
Layout Engine Layout Engine Layout Engine Layout Engine
File layout definition
- Character encoding
– EBCDIC – ASCII
- Delimiters
– Field – Record – Text quote (string literals)
- File type
– Delimited – Fixed – Hybrid
- Record length
- Record structure
– Field type – Field location – Field length*
Input file types and properties
File Types Record Length Record Delimiter Field Length Field Delimiter Delimited Variable Yes Variable Yes Fixed Fixed No Fixed No Hybrid Fixed Yes Fixed No Additional aspects Record Delimiter Field Delimiter Character Encoding Text Quote Current scope CR, LF, CR-LF Comma, Pipe, Tab ASCII EBCDIC Single quote, double quote
File layout definition
- Character encoding
– EBCDIC – ASCII
- Delimiters
– Field { Comma, Pipe, Tab } – Record { CR, LF, CR-LF } – Text quote { “, ‘ }
- File type
– Delimited – Fixed – Hybrid
- Record length
- Record structure
– Field type – Field location – {Field length}
Prototype Assumptions
- Structured file
- U.S. customer data
- Name, address, phone, and email content
types.
- Allowing errors and blank fields for a
statistically small number of records
- Known field delimiters and record
delimiters as shown earlier
- ASCII and EBCDIC encoding
Prototype process overview
file sampling encoding analysis ascii internal encoding file type analysis length analysis fixed type delimiter fixed field content id Layout (xml) delimited content id Content Oracle Data set up
- racles
Key elements: 1. Sampling 2. Encoding analysis 3. File type analysis 4. File length analysis 5. Delimited field content id 6. Fixed field content id 7. Oracles and their applications 8. XML layout
hybrid
Encoding and file type analysis
- File encoding analysis is based on statistical
difference on character encoding. For ASCII and EBCDIC, the most significant bit is used.
- File type analysis is based on the assumed field
delimiters and record delimiters and statistical measurement of their occurrences.
- The prototype has performed very well for both
analyses in real data and synthetic data testing.
- Future work in this area could include allowing
additional character encoding such as Latin-1, Unicode and additional delimiters.
Content types (oracles)
Name prefix First name Last name Name suffix Directional Street number Street name Street suffix Post Office Box Unit Designator Unit number Full name Address one Address two City State Zip code Email Phone number Boolean
Oracle A string Yes No Address Line
Record length analysis
- For a fully fixed file, record length must be
- determined. Once the length is known, the file is
treated as a hybrid file in field content identification.
- Start with a initial length and try all possible
length values until a well known content is lined up nicely according to the oracle of the content type.
- The prototype has been tested using synthetic
and real data successfully.*
- Future work in this area could include examining
multiple content types determining which provide the best evidence.
*Two real files have been used and initially only one returned a record length. After adjusting an appropriate threshold both returned correct results.
Record length analysis
- First step
- Step N
- Final step
Delimited field content identification
- For delimited files, fields are between delimiters.
Once the file analysis is done, all field locations are known.
- For a given field in the sample, each content
- racle is consulted by sending each string in the
field to it and receiving a “yes” and “no” answer from it. The percentage of yes answers of each
- racle is computed and used to identify the
content type of the field.
Fixed field content identification
- For hybrid files, records are between record
delimiters and have a fixed length. Once the file analysis is done, record length is known.
- To determine the length and starting position of
each field, oracles in conjunction with combinatorial and statistical approach are used.
- At this point it is possible for ambiguity among
the fields and thus guesses, potential field positions (PFP), are generated.
Fixed field content identification (PFP analysis)
First name? First name? First name? First name? First name?
Making a decision
Recent prototype enhancements
- Web service
– All functionality provided by the prototype can be accessed via a single function call.
- buildLayout( fileSample: byte[] ): String
– A web service endpoint was defined that invokes the preceding method
- buildLayout( parameterData: byte[], fileSample: byte[] ):
String
– This definition assumes that the server running the engine might not have access to the data and thus the engine parameters and the file data sample must be passed to the endpoint as method parameters.
Recent prototype enhancements
- Configuration file
– User can set each threshold
- Example: Setting the value indicating when fields
line up during record length analysis
– User can set each heuristic
- Example: The size of the data sample.
– User can set what delimiters to test for – User can specify what content types to identify by setting what oracles are loaded into the prototype
Recent prototype enhancements
- Header record
– Currently are able to test for existence – Future work would include using the information from the header record as extra evidence about the file
- Assign a label to UNSPECIFIED fields
- Correspond header information with the
prototype’s results
- Assumption: Only fully delimited files have
header records. Can be easily extended to include fixed file types.
Recent prototype enhancements
- Comparator program
– Compares the XML output of the engine with an XML file representing the correct output – Lists correct and incorrect results separately – Provides simple statistics – Text dump to the console
Recent prototype enhancements
- Cross reference analysis
– Correlate related fields such as zipcode, city, and state
- Vertical analysis
– Examine the sample by columns rather than by record (horizontal) – Ex: Boolean v. Directional
- Contextual analysis
– Ex: Middle name and street number fields
- All are performed after the decision
methodologies in an attempt to improve the results with extra evidence.
Recent prototype enhancements
- Decision functionality
– Currently a work in progress – In order, previous decisions were based on the content types, the PFP, and corresponding statistical counts – New logic will reorder the logic to consider the PFP, the content types, and finally the corresponding statistical counts – New logic appears to be a more generic solution to the decision making process
Results for a synthetic fixed and hybrid data file
- Prototype results
1 full name 1 27 27 2 address one 28 54 27 3 address two 55 81 27 4 city 82 99 18 5 state 100 102 3 6 zip code 103 107 5
- Actual record structure
1 full name 1 27 27 2 address one 28 54 27 3 address two 55 81 27 4 city 82 99 18 5 state 100 101 2 6 zip code 102 107 6
Results for three real data files: File 3 (delimited)
- Prototype results
- Actual record structure
1 Seq # 2 filler 3 name 4-5 filler 6 address line one 7 filler 8 city 9 state 10 zip code 11 zip plus four 12-14 filler 15 name 16-28 filler 29 first name 30 middle name 29 last name 32-56 filler 57 address line one 58-67 filler 68 directional 69 street number 70-71 filler 72 street name 73 filler 74 street suffix 75-81 filler 82 city 83 state 84 zip code 85-87 filler 88 boolean 89 Seq # 1 * UNSPECIFIED * 2 * UNSPECIFIED * 3 name 4-5 * UNSPECIFIED * 6 address line one 7 * UNSPECIFIED * 8 city 9 state 10 zip code 11 zip plus four 12-14 * UNSPECIFIED * 15 name 16-28 * UNSPECIFIED * 29 first name 30 middle name 29 last name 32-56 * UNSPECIFIED * 57 address line one 58-67 * UNSPECIFIED * 68 directional 69 street number 70-71 * UNSPECIFIED * 72 street name 73 * UNSPECIFIED * 74 street suffix 75-81 * UNSPECIFIED * 82 city 83 state 84 zip code 85-87 * UNSPECIFIED * 88 directional 89 street number
Results for three real data files: File 2 (hybrid)
- Prototype results
1 full name 1 48 48 2 address line 49 72 24 3 city 73 88 16 4 state 89 90 2 5 zip code 91 95 5 6 * UNSPECIFIED * 96 99 4 7 * UNSPECIFIED * 100 109 10 9 first name 110 129 20 10 last name 130 169 40 11 * UNSPECIFIED * 170 174 5 12 * UNSPECIFIED * 175 185 11 13 boolean 186 186 1
- Actual record structure
1 full name 1 48 48 2 address line 49 72 24 3 city 73 88 16 4 state 89 90 2 5 zip code 91 95 5 6 zip plus four 96 99 4 7 ID 100 109 10 9 first name 110 129 20 10 last name 130 145 16 11 filler 146 169 24 12 zip code 170 174 5 13 dash 175 175 1 14 zip plus four 176 179 4 15 ID 180 185 6 16 flag 186 186 1
Results for three real data files: File 6 (hybrid)
- Prototype results
1 * UNSPECIFIED * 1 21 21 2 * UNSPECIFIED * 22 37 16 3 first name 38 58 21 4 middle name 59 74 16 5 last name 75 88 14 6 * UNSPECIFIED * 89 95 7 7 city 96 123 28 8 state 124 125 2 9 zip code 126 135 10 10 * UNSPECIFIED * 136 143 8 11 address line 144 172 29 12 * UNSPECIFIED * 173 182 10
- Actual record structure
1 ID 1 21 21 2 account # 22 37 16 3 first name 38 58 21 4 middle name 59 74 16 5 last name 75 95 21 6 city 96 123 28 7 state 124 125 2 8 zip code 126 135 10 9 zip plus four 136 143 8 10 address line 144 182 39
Future Work
- In addition to all the future work outlined earlier,