Read Word documents with PHP (up to Word 2003)

Posted by Danny Herran on Feb 10, 2011 in Backend | 8 comments

Reading Word documents with PHP on a Linux box can be a real pain. It requires DOM which is only available on the Windows platform. However, Unkwntech from Stack Overflow made quite a nice function to read Word documents and extract its information. It will only parse text content, but it is enough for most of the tasks we will be doing anyway.

Before we start, let me tell you that this is a pure PHP solution. It doesn’t require any additional applications. However, if you want a more robust solution to read Word documents, I would strongly recommend Antiword; it comes for Linux, Mac OSX, Windows and many other platforms.

So, lets take a look at Unkwntech solution, compatible with Word 97/XP/2003:

<?php
/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/

function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $line = @fread($fileHandle, filesize($userDoc));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
} 

$userDoc = "cv.doc";

$text = parseWord($userDoc);
echo $text;
?>

The explanation is pretty clear from the description. I’ve tried it with some random Word 2003 document and it worked great. Give it a try and post any feedback in the comments section below.

All the credit goes to the creator of this function Unkwntech from Stack Overflow.

  • Missing Top LIne

    Hi,

    I found that this skips the first line of the document. Any idea on how to prevent that?

    Otherwise works perfectly!

    – Missing Top Line.

    • Missing Top Line

      Hi,

      I fixed it with this:

      function parseWord($userDoc)
      	{
      		$fileHandle = fopen($userDoc, "r");
      		$line = @fread($fileHandle,filesize($userDoc));
      		$lines = explode(chr(0x0D),$line);
      		$outtext = ""; 
      		$pos = strrpos($lines[1], chr(0x00));
      		$outtext.=substr($lines[1],$pos)."";
      		foreach($lines as $thisline)
      		{
      			$pos = strpos($thisline, chr(0x00)); 
      	
      			if(($pos !== FALSE) || (strlen($thisline)==0))
      			{
      				
      			} else {
      				$outtext.=$thisline."";
      			}
      		}
      		
      		
      		return $outtext;
      	}
      
  • how would I use the above code with antiword or abiword? Any example code?

  • bapu

    it simply good code but iam seeing the doucument without break. please help me

  • Ash

    does anyone has the version to read documents from word 2007 yet?? badly need this

  • Vic

    Hi, first I want to ty for the awesome solution, its helping me a lot but i have a comment and a question…

    Comment: to add a linebreak on each line I added a new variable on wich use reg_place() command and added that clean line to the var “outtext” something like this:

    function parseWord($userDoc){
    $fileHandle = fopen($userDoc, “r”);
    $line = @fread($fileHandle,filesize($userDoc));
    $lines = explode(chr(0x0D),$line);
    $outtext = “”;
    $pos = strrpos($lines[1], chr(0x00));
    $outtext.=substr($lines[1],$pos).””;
    foreach($lines as $thisline){
    $pos = strpos($thisline, chr(0x00));
    if(($pos !== FALSE) || (strlen($thisline)==0)){
    }
    else{
    $newline = $thisline;
    $newline = preg_replace(“/^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)/”,””,$newline);
    $outtext.=$newline.””;
    }
    }
    return $outtext;
    }

    And now my question…
    I’m gettin a lot of trash characters after the ok content of my doc, things like:” !‰HÀüèæ’…?³KÍ3È IÚzó… ”
    Am i doing something wrong? can anyone help me with this?

    • Vic

      its now showing my linebreak “” on the line:

      $outtext.=$newline.””;

      It should go where the quotes are after =$newline.

  • can not open china doc

    不可以打开其他语言的文档,除了英文。
    Can not open the documents of other languages​​, in addition to English.