String Search Inside PDF File Using PHP and Linux System

PDF files are Portable Document Format that supports text, links, buttons, forms, images graphic and other media independent of software, hardware and operating systems.

If you open a PDF file in notepad, the characters are not readable, those are encrypted in a special layout oriented structure. So direct string search will not be possible using PHP.

We will need some mechanism that will give us the PDF data in readable format.

There are many Linux libraries available that converts PDF into HTML or Text.

While looking in the web, I came across this
https://en.wikipedia.org/wiki/Pdftotext

So using pdftotext we can easily search any string inside PDF.

The logic is we will convert the PDF file into text using Linux command through PHP and search string in the text file. By the help of Linux grep, we can find the exact matches with page number and with pagination.

We can use exec or shell_exec function to convert PDF into text.

exec("pdftotext $pdf");

Now we have text format of the PDF file, let’s dig more and play with Linux grep command.

To know more about the grep command please visit below link
http://linuxcommand.org/man_pages/grep1.html

Please see below functions.



public static function pdfSearch($file = '', $kw = '', $page = 1, $recordPerPage = 10) {
        $file = str_replace('.pdf', '.txt', $file);
        $kw = addslashes($kw);

        $scomm = "grep -H -n -R -i '$kw' $file | head -" . ($page * $recordPerPage) . " | tail -$recordPerPage";

        $results = array();
        exec($scomm, $results);

        $tcomm = "grep -H -n -R -i -c '$kw' $file";

        $tresults = array();
        exec($tcomm, $tresults);

        $tresults = explode('.txt:', $tresults[0]);

        $pcomm = "grep -o -n $'\cL' $file | cut -f1 -d:";

        $presults = array();
        exec($pcomm, $presults);

        return array($tresults[1], self::pdfSearchResultParse($results, $presults));
    }

    public static function pdfSearchResultParse($results, $pages) {
        $outarr = array();

        foreach ($results as $row) {
            $t1 = explode('.txt:', $row);
            if (!isset($t1[1])) {
                continue;
            }
            $t2 = explode(':', $t1[1]);
            $line = $t2[0];
            $str = $t2[1];
            $pg = self::getPdfSearchResultPage($line, $pages);
            $temp = array();
            $temp['page'] = $pg;
            $temp['phrase'] = $str;
            $temp['line'] = $line;
            $outarr[] = $temp;
        }
        return $outarr;
    }

    public static function getPdfSearchResultPage($line, $pages) {
        foreach ($pages as $p equal greater than $pv) {
            if ($line lessthan equals to $pv) {
                return $p + 1;
            }
        }
    }


Hope this helps.

Share this Post