I am an old Unix hacker and know how to use tools like Awk/Sed and whatnot to do text processing. I have managed to carry on even in the Windoze days by installing Cygwin.

More recently I’ve been using Python because I wanted to get an understanding of the language. It has a lot of the same functionality and is more modern and has much cleaner syntax. I like the short hand ways of filtering arrays as well, plus lambda functions, the list goes on. Problem is: no-one out there in the real world is gainfully employed using it. Lots of open source stuff but I need to eat.

I’ve been looking for Java work recently and decided that I would start using Java for the pre-processing work because I want to keep my hand in with the language. I had decided against using it a long time ago because you had to do all sorts of nonsense loading 3rd party stuff in the class path if you wanted to split strings and use regular expressions. I steeled myself for a lot of pain after I made this decision and found I was wrong.

Java 2 has most of the things you need built into the String class these days and the process was remarkably painless.

The problem

I was trying to compress a file full of run time data down to look for long-running functions. It looked like this:

23/08/2006 16:07:47|58067|Entering|FUNC_1
23/08/2006 16:07:47|58068|Leaving|FUNC_1
23/08/2006 16:07:47|58067|Entering|FUNC_2
23/08/2006 16:07:47|58067|Entering|FUNC_1
23/08/2006 16:07:47|58067|Leaving|FUNC_1
23/08/2006 16:07:47|58067|Leaving|FUNC_2

This is the timestamp, the number of seconds in the day, the operation, and the function name. I have milliions of lines of this and want to remove the lines next to each other that have a 0 difference in the seconds. This is simply to give me somehwere to start in my hunt for long-running functions. I know that any real programming environment would have a profiler but we’re talking PL/SQL here so you have to roll your own.

I wrote a program in Awk that pulled the whole thing into memory and then looked at the fields in the records but it wouldn’t work. I realised later that I was calling split with the argument |, I think that this split the whole like up into single characters.

Lost patience with Awk and thought OK, try again in Java.

The Solution

I needed to be able to read a line and break it into its delimited components.

Had to do the usual Java nonsense to get the file open:

        BufferedReader r = new BufferedReader( new FileReader( args0 )) ;

Then get the current line with

                currLine = r.readLine() ;
                if ( currLine == null ) break ;

Ho hum. Next split the strings. First I tried:

            currBits = currLine.split(“|”);

(currBits is the pieces of the string). This wouldn’t work because | means split on everything. I had to use JDB (which is not that bad to use, actually) and worked out that the Bits elements were all one character each. Then we changed to

            currBits = currLine.split(“\|”);

This escaped the special character and it worked. The rest of the code was schoolperson stuff using the array of string pieces for comparison with some read ahead to look at the next record and work out if you want to print or not.

In essence I can do the simple text processing and line splitting in Java now. The regular expressions are also now part of the main library (Patterns) – it’s really moved on from Java 1.1; which was too painful so I stuck to Awk. Theres also a replace function on strings as well.

Only problem is, string are immutable and will get thrown away into the garbage collection void if you do a lot of string manipulation. I wonder if Sun have added the functionality to StringBuffer? I bet not.


OK – simple Java editor that lets me do debugging and so on WITHOUT having to create a project and spend half a day messing with it. Emacs is OK but I’d have liked code completion and built in debugging. I know you can get emacs to do this but it then takes a year to start up as it loads everything and I’m not a patient guy with my tools – start up quickly and start giving me what I need NOW. This is probably why I don’t like M$ windows, even though I use it every day.