ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

Manually recovering text from corrupt MS Word DOC and DOCX files


Ever get one of these?

Ever get one of these and then have the suggested repair option fail?

Igor has posted his first in a promised series of tutorials over at Dedoimedo that might come in handy some day:

How to recover corrupt Microsoft Word files
Updated: November 11, 2013

This is a hot topic. A very hot topic. What do you do if you have several Word files, either saved as .doc or .docx, which no longer open? You do not have backups, which you should, or perhaps the backups are corrupt, too? How do you gain back the long page after page of valuable text? This article will help you with this unpleasant task.

I will show you two somewhat unusual methods of recovering contents of your files. The success is not guaranteed, and your final formatting might not be preserved. But you will definitely be able to save your text, which ought to be the most important part. Best of all, Linux to the rescue! We will use Linux to do some of the recovery work. Follow me.


Let's align expectations upfront. File recovery is a tedious guesswork. Sometimes, it will work, sometimes it won't. You can make a best effort to retrieve the data, but it may really not be there. If data has been overwritten with nonsense, you will not be able to reassemble what was lost, ever. For instance, if a certain file has zeros accidentally written in its middle, the bytes that represent actual content will be gone.

Moreover, what I am showing you here is an incomplete workaround. There's no exact science, and most likely, no two cases will ever be quite the same. On top of that, some basic expertise is needed, including the ability to use regular expressions to some extent. The Linux requirement can also make it more difficult for most Windows users. However, between the tough choice of losing everything and hopefully recovering 50-70% of your missing stuff, you should definitely give this tutorial a try. It's free, it's non-destructive, so you can always try expensive professional services later on. There's always time to give someone your money.
I will be spending the coming months breaking and corrupting Word files in all sorts of ways, to see if I can find anything that can be of generic use for a wide population of my readers and their friends and family. Be patient, the tutorial shall yet arrive, from out of fire and smoke of despair. Or something like that.


--- End quote ---

Knew some of this. But I also picked up a few things I didn't. Definitely worth the read.

Article link here.



I would like to remark that sad as it is, .doc and .docx are two wholly different use cases. For about a year in my old job, I was the only one who knew that the contract writing software didn't in fact accept docx-anything, so I converted it behind the scenes.

Broader level:
The "expectations" remark (paragraphs!) is about right.

If you're dealing with dead files that by all rights should open, you're already in trouble. The rest is guessing hoping you get lucky. Some ideas:

A. Try to write as a Rich Text RTF file. Sometimes that's a middle ground to pure text.

B. Change File Formats.
Sometimes X native program refuses to behave but Y or Z "import" function works. Not to be sloppy, but in this category just save a "scratch" copy to play with, and abuse the scratch copy as aggressively as you like, since you're looking at Data Zero on the real copy, who cares? Typical tricks include just changing the file ending of the copy to .rtf, ignore the crap and save 38 of the 50 important lines of text! Other tricks include opening in custom semi-custom software, and more.

you should know that the first question must be, if your file is doc or docx or whatever

After reading the tips on the site, an idea popped in my head. There is still one option to try. You have software like BCompare, which is capable of seeing differences between 2 or 3 text-based files at once. However, it also can look for differences between .doc(x), .pdf and image file types. It already translates such files to text before it starts comparing (without altering the file content). Bonus is that it tries to keep as much of the paragraph structure as possible.

This is quite a powerful functionality from BCompare. It isn't free, but there might be free/open source alternatives with similar functionality.

I onced recovered a damaged .doc (or .docx ?) file that wouldn't open in Word by using a compatible product, Softmaker Office 2012. I was able to recover the text, not the full formating AFAIR.
Maybe this could work again ?


[0] Message Index

Go to full version