topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Friday March 29, 2024, 6:52 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: NEED: Microsoft Word (doc/docx) to text converter.  (Read 7986 times)

tn_dang

  • Supporting Member
  • Joined in 2008
  • **
  • default avatar
  • Posts: 28
    • View Profile
    • Donate to Member
NEED: Microsoft Word (doc/docx) to text converter.
« on: August 17, 2010, 02:32 PM »
Hello:

I would like to know if somebody knows an utility which:
  • converts Microsoft Word document (doc/docx) to text.
  • free for commercial uses.
  • can be invoked at the command line.
  • runs on Windows XP and later.
  • can be ran without Microsoft Word installed on the computer.

Thank you for for your time and efforts.

Regards,

Thanh
« Last Edit: August 17, 2010, 02:34 PM by tn_dang »

justice

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,898
    • View Profile
    • Donate to Member
Re: NEED: Microsoft Word (doc/docx) to text converter.
« Reply #1 on: August 17, 2010, 03:14 PM »
wordpad for windows 7

edit:
2010-08-18_083722.pngNEED: Microsoft Word (doc/docx) to text converter.
« Last Edit: August 18, 2010, 02:38 AM by justice »

rjbull

  • Charter Member
  • Joined in 2005
  • ***
  • default avatar
  • Posts: 3,199
    • View Profile
    • Donate to Member
Re: NEED: Microsoft Word (doc/docx) to text converter.
« Reply #2 on: August 17, 2010, 04:07 PM »
I don't know about DOCX, but for plain DOC, Antiword for Windows worked well for me.

There's also catdoc, which was more for DOS and Linux, but I could never get it to work.  I see there's now a 32-bit Windows version, catdoc ported to Windows, but the author says it doesn't support DOCX either.  A quick Google search turned up a few leads that might be worth following.
« Last Edit: August 17, 2010, 04:10 PM by rjbull »

mwb1100

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,645
    • View Profile
    • Donate to Member
Re: NEED: Microsoft Word (doc/docx) to text converter.
« Reply #3 on: August 17, 2010, 04:23 PM »
I don't think it fully meets your requirements, but might get you part way there.  Windows includes a "Generic / Text Only" printer driver that you can install to the FILE: port.

As far as I remember, it does ask several questions when you print to it, but maybe a scripting or AHK guru could automate those.  Also, generally Word needs to be installed to print the document, but a free Viewer/Compatibility Pack (http://support.microsoft.com/kb/891090) may work well enough for you.

Word does have a 'save to text' option that could be scripted, but I suspect that the free Viewer/Compatibility Pack versions don't have that ability.

fenixproductions

  • Honorary Member
  • Joined in 2006
  • **
  • Posts: 1,186
    • View Profile
    • Donate to Member
Re: NEED: Microsoft Word (doc/docx) to text converter.
« Reply #4 on: August 17, 2010, 06:50 PM »
This may help:
http://www31.ocn.ne....ishida/xdoc2txt.html

Don't be scared with Japanese :)

Also, if someone is skilled with Visual Basic 6 he/she could write something from this:
http://freemind.s57....Plugin/en/index.html

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 7,540
  • @Slartibartfarst
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: NEED: Microsoft Word (doc/docx) to text converter.
« Reply #5 on: August 17, 2010, 10:29 PM »
    @tn_dang:
    When Microsoft released their Office Open XML, I believe their MS Office Word documents then conformed to that standard - i.e., including files with extension .docx

    I have noticed that Google docs can import files with extension .docx, and seems to be able to turn them into Google doc format without any apparent loss of formatting - I think they become something rather like a .RTF (Rich Text Format) document - though this (no loss of formatting) does not seem to be always the case when converting files with the extension .doc
    As an aside, I am unsure (have not tested) whether this conversion (.docx ---> Google docs format) works perfectly without loss of formatting 100% of the time, but I have noticed it seems to work on those few occasions when I have done this to files with .docx extension  - i.e., when I have wanted to share files with people whose software could not read files with .docx extensions.

    With the latest changes to Google docs functionality, I think you now have the option to either:
    (a) import .docx files to Google docs, and convert them to Google docs format as they are being imported, or
    (b) import .docx files to Google docs, and convert them to Google docs format after they have been imported.
    Furthermore, I gather that you can do this conversion with single documents, or with batches of documents.

    Once a file has been converted to Google docs format:
    (a) The storage space it takes up does not decrement your total available/residual Google docs storage space.
    (b) The file can be Exported in these formats:
    • HTML
    • Open Document
    • PDF
    • RTF
    • Plain Text <----------THIS IS WHAT YOU WANTED
    • Microsoft Word

    The criteria you state include "can be invoked at the command line", and this approach I have described does not meet that criterion.
    Anyway, I would suggest that you try it and see for yourself, if you are interested.
    Hope this helps or is of use.[/list] (Why is this editor tacking "list" onto my post?)
    « Last Edit: August 17, 2010, 11:14 PM by IainB »

    fenixproductions

    • Honorary Member
    • Joined in 2006
    • **
    • Posts: 1,186
      • View Profile
      • Donate to Member
    Re: NEED: Microsoft Word (doc/docx) to text converter.
    « Reply #6 on: August 18, 2010, 02:22 AM »
    With a little programming skills it should be very easy (for non-lazy person) to write simple DOCX->TXT converter.

    I did something like that for TC in the past:
    http://www.totalcmd....g/office2007wlx.html
    http://www.totalcmd..../office2007_wdx.html

    In general only two things need to be done:
    - unpack DOCX file (it is ZIP, you know),
    - strip "document.xml" (located in "word" folder) off the XML tags.

    P.S. Mentioned plugins are opensource, so anyone can grab them. If i remember correctly: WDX one has some converting method implemented already which is used for text searching in TC.

    Krishean

    • Honorary Member
    • Joined in 2008
    • **
    • Posts: 75
    • I like pie
      • View Profile
      • Draconis Labs
      • Donate to Member
    Re: NEED: Microsoft Word (doc/docx) to text converter.
    « Reply #7 on: August 18, 2010, 02:34 AM »
    i believe openoffice can do what you ask
    awhile ago i had to do something having to do with converting one format to another, and this article came in handy: http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html
    it may be a little outdated, but the concept still holds true. and openoffice 3 supports reading microsoft's docx formats

    EDIT: theres also a command line utility here: http://odf-converter.sourceforge.net/ for converting between office open xml formats (docx) and openoffice formats (odf)
    Any sufficiently advanced technology is indistinguishable from magic.

    - Arthur C. Clarke
    « Last Edit: August 18, 2010, 02:37 AM by Krishean »

    IainB

    • Supporting Member
    • Joined in 2008
    • **
    • Posts: 7,540
    • @Slartibartfarst
      • View Profile
      • Read more about this member.
      • Donate to Member
    Re: NEED: Microsoft Word (doc/docx) to text converter.
    « Reply #8 on: August 24, 2010, 10:14 AM »
    @tn_dang: Not sure whether this might be of use/help:

    Text Mining Tool
    Text Mining Tool is a freeware program for extraction of text from files of the types: PDF, DOC, RTF, CHM, HTML
    It apparently works as converter of PDF, DOC, RTF, CHM, HTML files to text.

    mwb1100

    • Supporting Member
    • Joined in 2006
    • **
    • Posts: 1,645
      • View Profile
      • Donate to Member
    Re: NEED: Microsoft Word (doc/docx) to text converter.
    « Reply #9 on: August 24, 2010, 01:18 PM »
    An interesting and simple possiblity might be to get a Win32 copy of the Unix 'strings' filter program.  The author of "Pro Git" suggests using it to be able to diff versions of Word files: http://progit.org/book/ch7-2.html

    I imagine that formatting is completely shot to hell,  but depending on what you want to do, it might be good enough.

    Maybe the SysInternals version would work: http://technet.micro...ernals/bb897439.aspx