ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

DonationCoder.com Software > Post New Requests Here

Extracting paragraphs from a word file (doc, rtf, docx)

<< < (5/6) > >>

Lintalist:
I haven't read it to closely but, just a suggestion: convert the RTF to a text file first from there it is pretty easy I'd say. Text = good :)

Suggestions:

1. use a rtf to txt cmdline tool (many available including pandoc or search for rtf2txt you'll find quite a few)

2. or read the rtf file into the clipboard (in AHK I'd use the WinClip library) and then read the plain text format from memory (so you again have  "text" to work with.

Ath:
You decide. I think there is a lot of possibilities.
-Contro (March 25, 2017, 05:47 AM)
--- End quote ---
Well, I'm not picking this one up, sorry, as it's not my cup of tea.
As Lintalist said "Text = good", that's quite easy to process, rtf/doc/pdf is doable if only the plain text parts need handling (like my ScriptLineCounter does), all other stuff is a PITA :tellme:

Any other takers?

Contro:
I am trying.
When done I will put the answer here.
Seems not the easy part...
 :-*

Contro:
Tomos Method simplifying  :P

Manipulating text splitting a word file in almos impossible. xml files are difficult....
Seems more easy try with a pdf file.
So I convert the rtf file to pdf.
Then i can split in pages.
But seems better try to split by bookmarks.
But the original rtf have no bookmarks. With adobe I obtain a lot of useless bookmarks.
If I do manual I have to create less than 10 bookmarks.
So I did.

SourceForge present several free pdf splitters by bookmarks. But all my trials fails.
So I try pdfsam-basic as open source.

I have 9 bookmarks but only obtain 7 splitted pdf. What happens ?
pdfsam , and I think others too what really does is split the pdf by the page that contains the bookmark. If a page have two bookmarks we only obtain one pdf....

So really split a pdf file by bookmark is a split by a range of pages. Really there is not a extract of text of anykind.

What's next ?

Manipulate the rtf file to make certain each bookmark is in a differente page.
Create bookmarks and try again.

 :-*

This time goes well.
I use the bookmark name to identify the splitted files.

Extracting paragraphs from a word file (doc, rtf, docx)

What's next

PDFsam-basic works very well.
I think this in general :
Splitting a word file by bookmarks or strings always will be a difficult operation and the results impredictible when inserting the new files in the master documents , specially problems with formatting...
Splitting a pdf file by content (text, bookmark, strings, ....) is interpreted as split by the page that contained the string.
Commercial software offer specially split a big file recognizing the account number to seperate invoices...

The very expensive engines to combines documents usually uses C#, Java, VB, etc and Visual Studio. But I Think are native documents generating, not modifying that is my purpose.

So the Ath Tomos most simplest way is prepare a little the word document.

I haven't found an interactive bookmarks generator for a pdf file. Only automatic and depending of the target files the results may be not the expected ones.

I will try the new actions wizard from Adobe acrobat DC.

I have observ that if we want to mix is better in pdf format. But usually if we have a TOC or bookmarks from the word file finally we'll lose everything except the last merge or combination.

So I need renumber the final document and generate bookmarks for the final document.

This have been an exercise with one of the documents to be inserted.
I have to do this with about 10 documents. ...

Contro:
Now I need.....

Programmatically combine several pdf files in one.

Renumber all the pages of the resulting final pdf.

Create a TOC for the resulting pdf and bookmarks.

Define the final pdf to be allways open with the bookmarks active.

Running to try.

Any help welcome.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version