Precisely Speaking

General Category => Rants & Raves => Topic started by: precisonline on June 19, 2008, 07:54:40 PM

Title: Out of the Box
Post by: precisonline on June 19, 2008, 07:54:40 PM
I gotta admit, I have this sickness that tends to make me really enjoy taking on projects that may be a little outside of my comfort zone.  As long as there's an opportunity to stretch the horizon a bit, and with the understanding that the deadline isn't so unrealistic that it's, well, unrealistic, it can be quite the joy to learn something new, EVEN IF it's really something quite old. ;)

Take today for example.  The task was to export a bunch of data from one system into a series of flat files.  Depending on the volume of the data I might normally bring the data into Excel to verify that all of the columns have at lease something in them.  With these exports, however, there were millions of rows per file and even if I went through the insane task of transferring these files to my local PC there'd be no way that Excel could open them anyway.  Another thought might be import the data into something else and then check the data that way, but the timeline for this project wasn't conducive to writing both export and import programs.

All I really needed for this first test was to ensure that all of the columns had some data.  All of the rows wouldn't necessarily have all columns, but as I got thinking more about the problem I figured it might be useful to know how many rows had a particular field value so that I could compare that count against the total row count to see if the percentages were realistic.  (I mean hey, if there's 20 million rows and a particular column has 3 hits, likely something's amiss there.)

All these exports were written on an AIX system, so I figured what about this 'awk' thing that's been around since dirt was created?  A quick web search and it looked like this could be just the ticket.  (All cards on the table I've written 1 or 2 awk scripts before, but nothing really to write home - or here - about.)  About 5 minutes later I had the script written and was running the first test.

The script basically goes through each of the lines and a) calculates the number of rows, b) calculates the maximum number of columns, and c) determines the number of hits in each column.  This is then used at the end to generate column by column percentages as mentioned earlier.

Frankly, the script is so simple it's boring, but in case you're wondering:

Code: [Select]
  max_fields = 0;
  records = 0;
  if (NF > max_fields)
    max_fields = NF;
  for(ndx = 1 ; ndx <= NF ; ++ndx)
    if($ndx != "")
      fields[ndx] += 1;
END {\
  printf("Max fields = %d\n",max_fields);
  printf("Records    = %d\n\n",records);
  for(ndx = 1 ; ndx <= max_fields ; ++ndx)
    printf("%d\t%d\t%.2f\n",ndx,fields[ndx],(fields[ndx] / records) * 100);

As a report writing "language", awk is really remarkably simple.  There's stuff you do at the beginning (the BEGIN block), stuff you do for each row, and stuff you do at the end of the job (in the END block).  This is significantly simpler than a lot of other options because there's no real file work (opening, reading, etc.) - that's a handled for you.

Now I'm not saying I'd like to write a bunch of these in the future, but the 5 minutes or so that it took to learn the language well enough to write this script saved me a lot of time and energy today, and that - I think - makes this something worth writing about.