gzopen – Dr John's Tech Talk

Intro
A colleague of mine is an awesome Perl programmer. It may be a dying language but for those of us who know it it can still do some great things in our hands. So this guy, let’s call him Stain, wrote some code to analyze large proxy logs in parallel by using threads. Using threads presumably makes the analysis program run a lot faster since you’re throwing more CPUs at a problem that is CPU bound. It would have taken me years to write but I bet he did it in a couple days because he’s so experienced. I was eager to reap the benefits of his labor but was disappointed with the results I was getting…

The details
What I need the program for is very basic – just looking for a character string in the lnies of a proxy log. In fact my needs are so basic I don’t need a program at all. I can cobble together my analysis with Linux command-line utilities like a

$ gunzip -c proxy_log.gz|grep drjohnstechtalk > /tmp/drjs

And that’s what I was doing before his program came along. when I run his program with eight threads on a server with 16 cores, it should finish a lot faster than my command line, but it wasn’t. So I rolled up my sleeves to look into his code to see what the problem is. Running his code with a single thread showed that it ran much, much slower than my command-line commands. I focused on that problem as the root of the issue.

Here is the essence of his code for the analysis:

#!/usr/bin/perl
# DrJ research labs...
use strict;
use Compress::Zlib;
use threads;
use threads::shared;
use File::Find;
my @searchStrings = ("drjohnstechtalk");
my @matches;
my $lines;
&amp;searchGZip("/tmp/p-02.gz");
#----------------------------------------------------------------------------
# searchGZip
#----------------------------------------------------------------------------
sub searchGZip{
        my $file = shift(@_);
        my $line;
        my $gz;
        my $loglines = 0;
 
        if($gz = gzopen($file, "rb")){
 
                while($gz-&gt;gzreadline($line)){
                        $loglines++;
                        if(&amp;matchLine($line)){
                                push(@matches,$line);
                        }
                }
                $lines += $loglines;
                $gz-&gt;gzclose;
        }else{
                #&amp;addAppLogMsg("Could not open file $file: $gzerrno\n");
                print "Cannot read file $file\n";
        }
}
 
#----------------------------------------------------------------------------
# matchLine
#----------------------------------------------------------------------------
sub matchLine{
        my $line = shift;
 
        foreach(@searchStrings){
 
                return 1 if $line =~ /$_/
        }
        return 0;
}

He’s using functions I never even head of, like gzopen and gzreadline, apprently from the Compress::Zlib package. Looks elegant, right? Yes. But slow as a dog.

Version 2
So I began to introduce more shell commands into the Perl since shell by itself is faster. In this version I’ve gotten rid of those unfamiliar gzopen and gzreadlikne functions in favor of what I know:

#!/usr/bin/perl
# DrJ research labs...
use strict;
use Compress::Zlib;
use threads;
use threads::shared;
use File::Find;
my @searchStrings = ("drjohnstechtalk");
my @matches;
my $lines;
&amp;searchGZip("/tmp/p-02.gz");
#----------------------------------------------------------------------------
# searchGZip
#----------------------------------------------------------------------------
sub searchGZip{
        my $file = shift(@_);
        my $line;
        my $gz;
        my $loglines = 0;
 
        open(PXY,"gunzip -c $file|");
        while() {
                        $loglines++;
        $line = $_;
                        if(&amp;matchLine($line)){
                                push(@matches,$line);
                        }
                $lines += $loglines;
        }
}
 
#----------------------------------------------------------------------------
# matchLine
#----------------------------------------------------------------------------
sub matchLine{
        my $line = shift;
 
        foreach(@searchStrings){
 
                return 1 if $line =~ /$_/
        }
        return 0;
}

Version 3
Not satisfied with the time, I got rid of the function call for each line in this version.

#!/usr/bin/perl
# DrJ research labs...
use strict;
use Compress::Zlib;
use threads;
use threads::shared;
use File::Find;
my @searchStrings = ("drjohnstechtalk");
my @matches;
my $lines;
&amp;searchGZip("/tmp/p-02.gz");
#----------------------------------------------------------------------------
# searchGZip
#----------------------------------------------------------------------------
sub searchGZip{
        my $file = shift(@_);
        my $line;
        my $gz;
        my $loglines = 0;
 
        open(PXY,"gunzip -c $file|");
        while() {
                        $loglines++;
        $line = $_;
           push(@matches,$line) if $line =~ /$searchStrings[0]/;
                $lines += $loglines;
        }
}

That was indeed still faster, but not as fast as the command line.

Version 4
I wondered if I could do better still by also using the shells built-in match command, grep. That leads to this version:

#!/usr/bin/perl
# DrJ reseach labs...
use strict;
use Compress::Zlib;
use threads;
use threads::shared;
use File::Find;
my @searchStrings = ("drjohnstechtalk");
my @matches;
my $lines;
&amp;searchGZip("/tmp/p-02.gz");
#----------------------------------------------------------------------------
# searchGZip
#----------------------------------------------------------------------------
sub searchGZip{
        my $file = shift(@_);
        my $line;
        my $gz;
        my $loglines = 0;
 
        open(PXY,"gunzip -c $file|grep $searchStrings[0]|");
        while() {
        $line = $_;
           push(@matches,$line);
        }
}

Here’s table with the performance summaries.

Version	changes	Time (wall clock)
Original	–	63.2 s
2	gunzip -c instead of gzopen/gzreadline	18.6 s
3	inline instead of function call	14.0 s
4	grep instead of perl match operator	10.8 s

So with these simple improvements the timing on this routine improved from 63.2 s to 10.8 s – a factor of six!!

I incorporated my changes into his original program and now it really is a big help when it runs multi-threaded. I can run eight threads and finish an analysis about six times as quick as searching from the command-line. That’s a significant speed-up which will make us more productive.

Conclusion
Some elegant perl code is taken apart and found to be the cause of a significant slow-down. I show how we rewrote it in a more primitive style that results in a huge performance gain – code that runs six times faster!