Use antiword to extract text from .doc files
I know what you're thinking: "Why not just use OpenOffice to get the text you need?" There's a good reason. If you've ever used one word processor to get raw text from another you know that formatting is often left behind. End of line characters, etc can remain making the cutting and pasting of text from one source to another a problem (especially when going from a .doc file to an html end point.) This has caused me plenty of issues when I have written articles off-line to be pasted into, say, ghacks. I have seen formatting strings left behind only to have to go back and delete them.
When extracting text with a tool like antiword you won't have this problem. And even though antiword is a command-line only tool, it isn't complicated to install or use. With this tool you can either extract the text immediately to standard output (the terminal window) or you can extract it to a text. Both methods are simple, both are effective.
The installation of antiword can be done two ways: Command line or GUI. If you want to use the GUI fire up your Add/Remove Software utility, do a search for antiword, select the results, and click apply. You will also want to install catdoc as well, which can be installed with the same method.
If you are partial to the command line you can open up a console and issue a command similar to:
sudo apt-get install antiword catdoc
yum install antiword catdoc
One of those is sure to install the applications on your machine.
Now, how is this tool used?
The basic structure of the antiword command is:
antiword [OPTIONS] file.doc
When the command structure above is used you will see the text from the .doc file scroll by in the console window. The options are not many, but are useful:
-a [PAPERSIZE] Output in Adobe PDF format. You have to specify the papersize for the document. Valid papersizes are: a3, a4, a5, b4, b5, executive, folio, legal, letter, note, quarto, statement, or tabloid.
-f Output in formatted text form. This will print bold text like *bold*, italics like /italics/, and underlinedtext as _underlined_.
-i This defines the image level. 0 = use non-standard Ghostscript extensions. 1 = No images. 2 = Postscript level 2. 3 = Postscript level 3.
-m Which unicode mapping file to use. You can find a listing of available mapping files in /usr/share/antiword.
So to see the text from file.doc you would issue the command:
antiword -f file.doc
which would quickly scroll the content of the file in the console window. Not much help unless you need to copy and past the final bit - or you can maximize the console to see all of the text. Instead you can cat the text to a file like so:
antiword -f file.doc > file.txt
This text can now be viewed with the command:
Let's say you want to export the text from a .doc document into a .pdf document. Believe it or not this is simple as well. For this you will need the -p option along with the associated paper size. So let's say we want to export the document into a letter sized PDF document. To do this issue the command:
antiword -p letter file.doc > file.pdf
You might run into mapping issues here. If you do most likely you will need to tell antiword to use the 8859-1 mapping with the command:
antiword -m 8859-1 -p file.doc > file.doc
The file.doc file will be a readable PDF document you can now use.
Obviously this is only the "bare bones" of antiword. Using this command and others you really get creative and set up automated extraction scripts and much more. If you do much pasting into formats that can't handle carriage returnes or end of line marks, antiword is the perfect solution for you.Advertisement