Datafile Analyzer

Go application for analyzing and understanding flat data files.

Introduction

Datafile Analyzer is a tool for developers who need to work with datafiles supplied by a 3rd party, such as a client or an online tool. This tool will analyze the flat file column-by-column and give information about each column and will warn of typical data errors if they are found.

The ideas for this tool were formed during my recent experience of having to integrate 10 large CSV files, supplied daily by a 3rd party, into my client's database. The first step I normally take in such a situation is to open the files in Excel and start sorting and filtering in order to gain an understanding of the data: what are the intended data types, which columns relate to others, which columns contain clean data and which need cleaning up, etc. This tool is intended to accelerate that first stage by automating many typical analysis steps.

The tool is intended to run on a developer's local machine. It accepts command line parameters, then reads the file into memory, analyzes it, then automatically starts a local web server, which can be viewed on a configurable localhost port, in order to provide a nice visual display of the results, as shown above. Further usage details and download links are below.

Features

Language and 3rd party libraries

The program is written in Go. It uses several external packages in addition to the standard library:

Download and setup

  1. Download the zipped config and frontend files and unpack them into a directory
  2. Download the appropriate executable for your OS (Linux, Mac, Windows) and move it into the same directory

Usage

  1. Open the text file in a text editor and check whether or not there is a header record and which character is used as the value separator
  2. In the command prompt, cd to the directory containing the executable and the unpacked config and frontend files
  3. If the file has a header record and the separator is a comma, then the target file is the only parameter needed:
    ./datafile_analyzer_linux_amd64-0.1 -file="/full/path/to/file"
  4. If the file does not have a header record, add this flag: -hasHeader=false
  5. If the file has a separator other than comma, e.g. a pipe, add this: -separator="|" (NB: use "/t" for tab-separated files)
  6. For files where a comma, rather than a dot, is used as the decimal point in numbers, add this: -decimalPoint=","
  7. The file will now be read and processed, and when the processing is finished, the prompt will say starting server
  8. In your browser, go to localhost:8000 to view the results

Troubleshooting

Each value is treated a column / line breaks are not recognized properly

  1. Open the text file in a text editor
  2. Check the line termination of the text file. It needs to be CR/LF, and not just CR. For example in Sublime Text 3, under View - Line Endings, the correct setting is either "Windows" or "Unix", but not "Mac OS 9".

Interpretation of results

For this explanation I will be working with a file containing some real data on restaurants in the US from Yelp, converted from JSON to CSV using json2csv, and enhanced with some made-up columns for demonstration purposes. The real columns are business_id, open, city, review_count, stars and attributes.Attire, and the rest are made up.

The results page as seen at the top of this page repeats the execution parameters in a single row at the top of the page, and the entire remainder of the page consists of an fully expanded "accordian" list where each column in the data file is represented as a collapsible box, one after the other, in the order in which the columns appear in the file.

As a first step on a new file we can collapse all the columns to their summary rows by clicking on any of the column name links on the left side 3 times in a row. That results in a nice overview of each column as shown below:

I would make the following remarks about this file based on this view:

  1. All the columns are named, which helps us a lot to work out what is happening. Unnamed columns would appear as "Column 2", "Column 3" etc
  2. There is a unique column, business_id, which is very good news if we want to integrate the data into a database. Unique columns have a light blue background
  3. The data looks fairly clean. The tool could guess a datatype for each column, and a quick glance over the sample values doesn't throw up any major warning signs
  4. The columns business_id and daily_customers are the only ones with a significant proportion of distinct values. Many of the columns have few enough distinct values that they are treated as "type" columns (more on this below)
  5. The final column is empty. Empty columns have a grey background. We would verify this against the data source just in case the column was not intended to be empty

Now we can press F5 to return to the default view, which is the fully expanded accordian, and scan down the columns one at a time. This is instantaneous, since the results are stored in program memory and do not change unless the program is re-executed.

Let's start with the first column, business_id, which we already know contains unique values:

  1. Glancing at the string samples, it's clear that these are randomly-generated strings
  2. Using the min and max length numbers on the upper left side, we know that the strings have a fixed length of 22, with no exceptions
  3. The warning on the right side with the yellow background is telling us that the strings have no clear case. Roughly half start with a capital letter, giving them the title case, and roughly half don't, but contain capitals later in the value, giving them the case "unknown". These are both fine for a random string. But I would seek out the single fully lower and fully upper case values to make sure nothing strange is happening
  4. Finally, with a string like this, I would open the file in Excel and filter for unusual characters such as %, which may indicate that some values are encoded. In a later version of the tool, I would like to do these checks automatically

Now let's look at the second column, open:

  1. This is a lovely clean column with no issues
  2. The chart makes clear the relative frequency of true and false: in this case, the data is about restaurants and the column is called "open", so it means that the source data mainly concerns restaurants which were open at the time of the data snapshot

Now let's look at the third column, city:

  1. Like other string columns, we have an extended list of sample values
  2. In this case it is clear that the city was entered by hand in the source, and not chosen from a list. There are many spelling variations of the same city, even just in the samples shown. I have highlighted some in pink. This column would need to be tidied up before entering a database
  3. Note also the warning in top right: there are 2 empty values. Given that that there are about 86k records and only 2 have no city, this column was clearly intended to be mandatory. I would get in touch with the source and find out what the city is meant to be for the 2 records where it is missing

Now let's look at the fifth column, stars:

  1. If we look at the sample values, we see this is a case where integers like 1, 2, 3 are mixed with floats like 2.5, 3.5 in a similar proportion. This meant that a single data type was not assignable, as reflected in the error in the top right: "The data type could not be inferred", and the numbers in brackets are the number of values of each inferred type in the column
  2. However, the case that integers and floats were mixed in this way appeared so often in my test files that I decided in this case to convert the integers to floats and treat the column as a float column. This is shown in the warning on the right: "have converted all to float"
  3. The second error on the right, "Data type(s) present other than {type}" will appear whenever a type for the whole column has been inferred, yet there is data which appears to be another type mixed into the column. In this case the information is redundant, but it is useful in other cases, such a column which contains 95% string values, but 5% numeric values have slipped in, which would be indicative of a data error

Now just a quick comment on the warning on the sixth column, attributes.Attire:

  1. The warning tells us that this column has 60.7k blanks, and we know the file has 85.9k records. So unlike the city column above, this column is clearly intended to allow empty values, and we would reflect this in the database by either allowing NULL values or by assigning a default value such as "Not specified"

Now let's look at the warning on the seventh column, num_reviews:

  1. When a column is an exact copy of another column, both columns will show the "Redundancy: Duplicate of {column}" as shown above
  2. In this case I would inform/change the data source to exclude the redundant column. Maybe the column was not intended to be redundant and it is the result of a copy/paste error on the part of the user or developer who created the file
  3. The message in green in the centre is explained below

Now let's look at the eighth column, registration:

  1. This is a clean datatime column except for the 16 empty values as noted in the top right. Unlike the other columns shown thus far, this column contains enough distinct values to be treated as a data column rather than a type column, and as such, it has a distribution chart and not a bar chart of the most common values. The distribution is suspiciously even however, and that is because this data was randomly generated by me rather than being sourced from the real world
  2. In the top middle in green we see that this column is "linked" to two others: opening_day and daily_customers. This means that the values in these columns change at the same time. Often a text file will aggregrate many columns from different sources, and this indicator is useful for figuring out which columns came from the same source system

Finally, a quick comment about the ninth column, opening_day:

  1. If we look at the sample values above, we see that the values look like datetimes: they show both date and time data. However the time data is always midnight ("00:00:00"). If this is the case for every value in the column, the error in the top right will appear: "Contains no time data: should be a date column". This is useful for knowing when to treat a datetime as a date in the database, which is handled and displayed more easily
  2. The equivalent check is done for float data. If every float value in a column has no decimal data, e.g. 23.0000, 4.0000, etc, then a message will appear that indicates that the column would be better treated as an integer

Configuration

Configuration of the tool is handled by the file config.toml, which was part of the zipped archive and can be edited with any text editor. This file is read upon each execution of the program. Each of the settings has a comment for explanation.

To be completed



comments powered by Disqus