ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

NEED: Microsoft Word (doc/docx) to text converter.

<< < (2/2)

IainB:
@tn_dang:
When Microsoft released their Office Open XML, I believe their MS Office Word documents then conformed to that standard - i.e., including files with extension .docx

I have noticed that Google docs can import files with extension .docx, and seems to be able to turn them into Google doc format without any apparent loss of formatting - I think they become something rather like a .RTF (Rich Text Format) document - though this (no loss of formatting) does not seem to be always the case when converting files with the extension .doc
As an aside, I am unsure (have not tested) whether this conversion (.docx ---> Google docs format) works perfectly without loss of formatting 100% of the time, but I have noticed it seems to work on those few occasions when I have done this to files with .docx extension  - i.e., when I have wanted to share files with people whose software could not read files with .docx extensions.

With the latest changes to Google docs functionality, I think you now have the option to either:
(a) import .docx files to Google docs, and convert them to Google docs format as they are being imported, or
(b) import .docx files to Google docs, and convert them to Google docs format after they have been imported.
Furthermore, I gather that you can do this conversion with single documents, or with batches of documents.

Once a file has been converted to Google docs format:
(a) The storage space it takes up does not decrement your total available/residual Google docs storage space.
(b) The file can be Exported in these formats:

* HTML
* Open Document
* PDF
* RTF
* Plain Text <----------THIS IS WHAT YOU WANTED
* Microsoft Word
The criteria you state include "can be invoked at the command line", and this approach I have described does not meet that criterion.
Anyway, I would suggest that you try it and see for yourself, if you are interested.
Hope this helps or is of use.[/list] (Why is this editor tacking "list" onto my post?)

fenixproductions:
With a little programming skills it should be very easy (for non-lazy person) to write simple DOCX->TXT converter.

I did something like that for TC in the past:
http://www.totalcmd.net/plugring/office2007wlx.html
http://www.totalcmd.net/plugring/office2007_wdx.html

In general only two things need to be done:
- unpack DOCX file (it is ZIP, you know),
- strip "document.xml" (located in "word" folder) off the XML tags.

P.S. Mentioned plugins are opensource, so anyone can grab them. If i remember correctly: WDX one has some converting method implemented already which is used for text searching in TC.

Krishean:
i believe openoffice can do what you ask
awhile ago i had to do something having to do with converting one format to another, and this article came in handy: http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html
it may be a little outdated, but the concept still holds true. and openoffice 3 supports reading microsoft's docx formats

EDIT: theres also a command line utility here: http://odf-converter.sourceforge.net/ for converting between office open xml formats (docx) and openoffice formats (odf)

IainB:
@tn_dang: Not sure whether this might be of use/help:

Text Mining Tool
Text Mining Tool is a freeware program for extraction of text from files of the types: PDF, DOC, RTF, CHM, HTML
It apparently works as converter of PDF, DOC, RTF, CHM, HTML files to text.

mwb1100:
An interesting and simple possiblity might be to get a Win32 copy of the Unix 'strings' filter program.  The author of "Pro Git" suggests using it to be able to diff versions of Word files: http://progit.org/book/ch7-2.html

I imagine that formatting is completely shot to hell,  but depending on what you want to do, it might be good enough.

Maybe the SysInternals version would work: http://technet.microsoft.com/en-us/sysinternals/bb897439.aspx

Navigation

[0] Message Index

[*] Previous page

Go to full version