Use antiword to extract text from .doc files - gHacks Tech News

Use antiword to extract text from .doc files

I know what you're thinking: "Why not just use OpenOffice to get the text you need?" There's a good reason. If you've ever used one word processor to get raw text from another you know that formatting is often left behind. End of line characters, etc can remain making the cutting and pasting of text from one source to another a problem (especially when going from a .doc file to an html end point.) This has caused me plenty of issues when I have written articles off-line to be pasted into, say, ghacks. I have seen formatting strings left behind only to have to go back and delete them.

When extracting text with a tool like antiword you won't have this problem. And even though antiword is a command-line only tool, it isn't complicated to install or use. With this tool you can either extract the text immediately to standard output (the terminal window) or you can extract it to a text. Both methods are simple, both are effective.

Installing antiword

The installation of antiword can be done two ways: Command line or GUI. If you want to use the GUI fire up your Add/Remove Software utility, do a search for antiword, select the results, and click apply. You will also want to install catdoc as well, which can be installed with the same method.

If you are partial to the command line you can open up a console and issue a command similar to:

sudo apt-get install antiword catdoc

yum install antiword catdoc

One of those is sure to install the applications on your machine.

Now, how is this tool used?

Basic usage

The basic structure of the antiword command is:

antiword [OPTIONS] file.doc

When the command structure above is used you will see the text from the .doc file scroll by in the console window. The options are not many, but are useful:

-a [PAPERSIZE] Output in Adobe PDF format. You have to specify the papersize for the document. Valid papersizes are: a3, a4, a5, b4, b5, executive, folio, legal, letter, note, quarto, statement, or tabloid.

-f Output in formatted text form. This will print bold text like *bold*, italics like /italics/, and underlinedtext as _underlined_.

-i This defines the image level. 0 = use non-standard Ghostscript extensions. 1 = No images. 2 = Postscript level 2. 3 = Postscript level 3.

-m Which unicode mapping file to use. You can find a listing of available mapping files in /usr/share/antiword.

So to see the text from file.doc you would issue the command:

antiword -f file.doc

which would quickly scroll the content of the file in the console window. Not much help unless you need to copy and past the final bit - or you can maximize the console to see all of the text. Instead you can cat the text to a file like so:

antiword -f file.doc > file.txt

This text can now be viewed with the command:

less file.txt

PDF format

Let's say you want to export the text from a .doc document into a .pdf document. Believe it or not this is simple as well. For this you will need the -p option along with the associated paper size. So let's say we want to export the document into a letter sized PDF document. To do this issue the command:

antiword -p letter file.doc > file.pdf

You might run into mapping issues here. If you do most likely you will need to tell antiword to use the 8859-1 mapping with the command:

antiword -m 8859-1 -p file.doc > file.doc

The file.doc file will be a readable PDF document you can now use.

Final thoughts

Obviously this is only the "bare bones" of antiword. Using this command and others you really get creative and set up automated extraction scripts and much more. If you do much pasting into formats that can't handle carriage returnes or end of line marks, antiword is the perfect solution for you.

We need your help

Advertising revenue is falling fast across the Internet, and independently-run sites like Ghacks are hit hardest by it. The advertising model in its current form is coming to an end, and we have to find other ways to continue operating this site.

We are committed to keeping our content free and independent, which means no paywalls, no sponsored posts, no annoying ad formats or subscription fees.

If you like our content, and would like to help, please consider making a contribution:

Comments

  1. Ralph said on June 8, 2009 at 10:11 pm
    Reply

    Why not just open the doc in Word and “Save as” DOS Text?

  2. Jack Wallen said on June 9, 2009 at 1:23 am
    Reply

    If you’re using Linux then 1) you don’t have Word (unless you are running it in Crossover Office) and 2) Even saving a .doc in .txt format in OpenOffice can cause problems in certain web apps (like pasting text into WordPress.)

  3. adam said on July 3, 2009 at 10:06 am
    Reply

    M$ word needs a valid licence – this costs money also not much use on a server.

Leave a Reply

Check the box to consent to your data being stored in line with the guidelines set out in our privacy policy

Please note that your comment may not appear immediately after you post it.