Read Word documents with PHP (up to Word 2003)

Posted by Danny Herran on Feb 10, 2011 in Backend | 8 comments

Reading Word documents with PHP on a Linux box can be a real pain. It requires DOM which is only available on the Windows platform. However, Unkwntech from Stack Overflow made quite a nice function to read Word documents and extract its information. It will only parse text content, but it is enough for most of the tasks we will be doing anyway.

Before we start, let me tell you that this is a pure PHP solution. It doesn’t require any additional applications. However, if you want a more robust solution to read Word documents, I would strongly recommend Antiword; it comes for Linux, Mac OSX, Windows and many other platforms.

So, lets take a look at Unkwntech solution, compatible with Word 97/XP/2003:

<?php
/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/

function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $line = @fread($fileHandle, filesize($userDoc));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
} 

$userDoc = "cv.doc";

$text = parseWord($userDoc);
echo $text;
?>

The explanation is pretty clear from the description. I’ve tried it with some random Word 2003 document and it worked great. Give it a try and post any feedback in the comments section below.

All the credit goes to the creator of this function Unkwntech from Stack Overflow.