This post is archived and probably outdated.

Searching through archive files

2009-09-03 12:35:00

To learn a bit about PHP Gtk I'm working on some small GUI application to read the PHP manual, quite rough and the sole purpose is to play with PHP Gtk. People who know me know that I really love iterators in PHP so obviously this app is using iterators, too. In this post I want to share an example where iterators are really useful:

As said the app is about browsing the PHP manual. The manual is provided as tar.gz to the app and I wanted to have a fulltext search. For accessing the tar.gz content I'm using phar. Yes, phar is not only for phar files but can work on different kinds of archives (tar.gz, tar.bz2, zip), too.

So that's my search implementation:

class FullTextSearch extends FilterIterator {
    protected $needle;
 
    public function __construct(PharData $archive, $needle) {
        $flags = RecursiveIteratorIterator::LEAVES_ONLY;
        $it = new RecursiveIteratorIterator($archive, $flags);
        parent::__construct($it);
 
        $this->needle = $needle;
    }
 
    public function accept() {
        $current = $this->current();
        // This is not 100% perfect but should be good enough for this case:
        if (strpos($current->getFilename(), '.htm') === false) {
            return false;
        }
 
         // This is bad for larger files ...
         $content = file_get_contents($current->getPathname());

         return strpos($content, $this->needle) !== false;
    }
}

$needle = 'search';
$archive = new PharData('php_manual_en.tar.gz');

$search = new FullTextSearch($archive, $needle);
foreach ($search as $filename => $fileobject) {
    echo "Found in $filename.\n";
}

The code has some places marked which might need some improvements for general purpose but shows how nice iterator-based solutions can be.

If you aren't used to iterators you most likely wonder what's going on, so let's look into it:

First we need some basic knowledge. An Iterator in PHP is, basically, an object that can be used in foreach statements and does something - an ArrayIterator, for instance, walks over an array returning all the array elements. Now PharData objects are RecursiveDirectoryIterators. This means you can put the phar data object into a foreach statement and you'll get a list of all files - oh wait it's not that easy. Actually you will receive only the root elements. Confused? - On the one hand I said it's recursive but on the other hand it only returns the root elements? - Well having a RecursiveIterator means that the object provides methods to check whether the current element has children and can provide an Iterator to iterate over these children. foreach won't call these methods - that's the job of the RecursiveIteratorIterator (RII). The RII is a so-called outer iterator which means it iterates over the elements of another iterator. In this case it will walk over the files in the archive and will, for every file, check whether it is a directory. In case the current entry is an directory it will work, recursively, in that directory till all files were returned.

Having this basic knowledge we could write code like this:

$it = new RecursiveIteratorIterator($archive);
foreach ($it as $filename => $fileobject) {
    echo $filename."\n";
}

This gives use a list of all files and directories in the archive. The next thing I'm having in the first snippet is a FilterIterator. A FilterIterator is - again - an outer iterator doing exactly what the name says: It filters the elements from it's inner iterator. For that it calls the accept() when stepping to the next element. If accept() returns true the element is given to the caller (being the foreach statement or another iterator) if it returns false the element is ignored and the FilterIterator checks the next element. So in this case I only care about elements containing the needle from my search.

With all this there's just one little thing left in the code: Treating directories like file and searching through them won't work and in this case I absolutely don't care about the directories themselves so I ask the RecursiveIteratorIterator to step over the directory handles and directly go to the children.