topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday March 28, 2024, 6:08 pm
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Last post Author Topic: Extracting paragraphs from a word file (doc, rtf, docx)  (Read 12043 times)

Contro

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 3,940
    • View Profile
    • Donate to Member
Extracting paragraphs from a word file (doc, rtf, docx)
« on: March 16, 2017, 06:36 AM »
I need a script to :

1) Store a set of strings to be found in a word document. (and the number of lines to collect...)
2) Find those strings in the word document selecting the entire paragraph (several lines usually) associated and saving to a csv file.

The string always begin in an unique way , but the content may vary from one case to another. That's why i need to scan the file.

I need an array of data.

Is it possible ?

 :-*

It's like a collector of keys in a word file.
We can edit the csv file later and establish other connections....




Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #1 on: March 16, 2017, 08:10 AM »
Well, I've done a custom text processing app a couple of years back (in fact, exactly 5 years ago today I released version 1.0.0.0...): ScriptLineCounter

It is quite complex to grasp at first, and poorly documented, so do read the linked DC thread, but it may be of help/interest to you 8)

Contro

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 3,940
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #2 on: March 17, 2017, 09:17 AM »
Well, I've done a custom text processing app a couple of years back (in fact, exactly 5 years ago today I released version 1.0.0.0...): ScriptLineCounter

It is quite complex to grasp at first, and poorly documented, so do read the linked DC thread, but it may be of help/interest to you 8)
Running to try.

For doc word documents or txt files ?
I know this is not an easy question because after a time I come back to make the question again....

 :-\

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #3 on: March 17, 2017, 12:04 PM »
SLC handles txt, doc, docx, odt, pdf and csv files for its specific task (afair), maybe the type of filtering you need can be accieved by, or added to, it.

Contro

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 3,940
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #4 on: March 17, 2017, 02:04 PM »
SLC handles txt, doc, docx, odt, pdf and csv files for its specific task (afair), maybe the type of filtering you need can be accieved by, or added to, it.

I know ! . I take a look at the post you sent me. The filter may be not extract all the lines, only the ones according to text finding. We'll talk about this. I am very interested. And anything you can do is welcome.

Best Regards


Contro

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 3,940
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #5 on: March 22, 2017, 02:53 PM »
I come back ATH.

I adjunt the file I need to collect paragraphs . It's a RTF file. A technical file with electrical calculations. In other words an Electrical memory.
My target is search that file scanning the beginning of some strings . When found complete the string to the end of the line and collect the number of lines we want to a csv file.

I hope you analyze the file and tell me is possible use your script for this purpose.

Is always this type of file. I need an array to write down the beginnings of strings and the number of lines to collect from the word rtf document.

 :-*

Contro

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 3,940
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #6 on: March 22, 2017, 02:55 PM »
Tell me please is you see the attachment as a zip file containing the rtf.

I see nothing !!!!

tomos

  • Charter Member
  • Joined in 2006
  • ***
  • Posts: 11,959
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #7 on: March 22, 2017, 05:13 PM »
^ no attachment there
Tom

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #8 on: March 23, 2017, 02:17 AM »
I see nothing !!!!
Me too :o

Contro

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 3,940
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #9 on: March 23, 2017, 06:25 AM »
AsusPortatil - 23_03_2017 , 11_29_02.pngExtracting paragraphs from a word file (doc, rtf, docx) :(

Sure it's my fault.
Trying with various formats of compressing the rtf file that is not accepted .



Edited : the attachthumb appear in the code before posting and after nothing !!!
I will try to post saving the file on the cloud.

AsusPortatil - 23_03_2017 , 11_29_02.pngExtracting paragraphs from a word file (doc, rtf, docx)

Contro

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 3,940
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #10 on: March 23, 2017, 06:29 AM »
So I can insert a screenshot, but not the rar or the zip file.

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #11 on: March 23, 2017, 06:30 AM »
Trying with various formats of compressing the rtf file that is not accepted .
Several alternative options:
    • Compress to a zip with a (simple) password
    • Use a different extension for the .rtf file like .artef and then compress into a zip file
And a screenshot is utterly unusable

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #12 on: March 23, 2017, 06:35 AM »
So I can insert a screenshot, but not the rar or the zip file.

Maybe you shouldn't try to display the .zip file in the message, only attach it to the message, by NOT clicking the 'insert attachment 1' link?

Contro

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 3,940
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #13 on: March 23, 2017, 06:36 AM »
enlace al rtf en google drive

https://drive.google...UE0/view?usp=sharing


Contro

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 3,940
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #14 on: March 23, 2017, 06:43 AM »
added the rtf in a zip file with password ath




Edited : Don't go.

Edited : You are right . Password ath

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #15 on: March 23, 2017, 07:38 AM »
Got it.

And now to accurately describe what parts to select and extract and illustrate with examples, please.

Btw: This can't be handled by ScriptLineCounter, so I'll have to use another way to extract the desired content, either using common tools (grep/sed/awk, but they won't work with rtf directly) or a specific program I'll have to write.

Contro

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 3,940
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #16 on: March 24, 2017, 08:44 AM »
I hope to find the more simple way. As you know that is not one of my qualities....
I think may be a lot of ways to do this.
My target is mingle or mix blocks of text from the rtf file with a doc file.

I insert the rtf file with the blocks of text I need to select or save for the next step...

I have used different colors when the blocks are contigous...
 :P

I have found scripts to extract all the comments to a word file
or extract all bookmarks to a new doc file.

I think this is possible, but not easy.

There are also in web links to c# language and others (mainly VB) to extract or automating word inserting/extracting text in some places...


https://www.codeproj...om-Microsoft-Word-Fi
https://msdn.microso...SPPError=-2147217396


I think this may be accomplish in several ways. Help me to find the simplest way.

Best Regards




Contro

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 3,940
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #17 on: March 24, 2017, 01:25 PM »
I think may be enough to split the rtf file or the word file by epigraphs and giving any splitted file a sequential name.

I know is possible in word split a document in pages. Why not in epigraphs ? .
But this word epigraph is not the correct word in english I afraid.
I refer to the organization of the text
Index
1. Name
2. Calculations
3. Drawings
4. Tables
5. Budget
Really are the bookmarks . Split a word file by the bookmarks.
 :-\

But in the rtf file probably don't exists


1.- MEMORIA DESCRIPTIVA
2.- MEMORIA JUSTIFICATIVA
3.- MÉTODOS DE INSTALACIÓN EMPLEADOS
4.- DEMANDA DE POTENCIA
5.- CUADROS RESUMEN POR CIRCUITOS
6.- CUADROS RESUMEN POR TRAMOS
7.- MEMORIA DETALLADA POR CIRCUITOS
8.- CUADROS RESUMEN DE PROTECCIONES
9.- LISTADO DE MATERIALES

These are the marks or bookmarks , the places where to split the word document.

I will treat then some of these "components" to be inserted in the master document.
 :-*
« Last Edit: March 24, 2017, 01:35 PM by Contro »

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #18 on: March 24, 2017, 04:17 PM »
Hm, so in fact the file needs to be split by paragraph, and selected by the paragraph-title.

Do the formula's (images) in the document need to be preserved? Because I don't think I can copy those to a csv file, as that doesn't have graphics support. And if you later need to re-assemble the snippets into a new document, they would probably need to be inserted again.
If this is the case, then a nice Word macro, with a matching specialist in creating that, would be a better choice instead of me (I'm kind of allergic to VB and VBS )

Contro

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 3,940
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #19 on: March 25, 2017, 05:47 AM »
Hm, so in fact the file needs to be split by paragraph, and selected by the paragraph-title.

Do the formula's (images) in the document need to be preserved? Because I don't think I can copy those to a csv file, as that doesn't have graphics support. And if you later need to re-assemble the snippets into a new document, they would probably need to be inserted again.
If this is the case, then a nice Word macro, with a matching specialist in creating that, would be a better choice instead of me (I'm kind of allergic to VB and VBS )
In the last supposition the idea of a csv is abandoned.
We obtain some files splitted.
They are renamed according any pattern I can use later.

If you decide use the csv is now another option.
I can insert in the master document the images and some other lines and do programmatically the rest.
But I think we'll have the same problem with the tables...
So perhaps the best option is split the word file in several by the "epigraphs".

You decide. I think there is a lot of possibilities.

Lintalist

  • Participant
  • Joined in 2015
  • *
  • Posts: 120
    • View Profile
    • Lintalist
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #20 on: March 25, 2017, 08:23 AM »
I haven't read it to closely but, just a suggestion: convert the RTF to a text file first from there it is pretty easy I'd say. Text = good :)

Suggestions:

1. use a rtf to txt cmdline tool (many available including pandoc or search for rtf2txt you'll find quite a few)

2. or read the rtf file into the clipboard (in AHK I'd use the WinClip library) and then read the plain text format from memory (so you again have  "text" to work with.


Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #21 on: March 25, 2017, 08:56 AM »
You decide. I think there is a lot of possibilities.
Well, I'm not picking this one up, sorry, as it's not my cup of tea.
As Lintalist said "Text = good", that's quite easy to process, rtf/doc/pdf is doable if only the plain text parts need handling (like my ScriptLineCounter does), all other stuff is a PITA :tellme:

Any other takers?

Contro

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 3,940
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #22 on: March 25, 2017, 01:43 PM »
I am trying.
When done I will put the answer here.
Seems not the easy part...
 :-*

Contro

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 3,940
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #23 on: March 25, 2017, 03:11 PM »
Tomos Method simplifying  :P

Manipulating text splitting a word file in almos impossible. xml files are difficult....
Seems more easy try with a pdf file.
So I convert the rtf file to pdf.
Then i can split in pages.
But seems better try to split by bookmarks.
But the original rtf have no bookmarks. With adobe I obtain a lot of useless bookmarks.
If I do manual I have to create less than 10 bookmarks.
So I did.

SourceForge present several free pdf splitters by bookmarks. But all my trials fails.
So I try pdfsam-basic as open source.

I have 9 bookmarks but only obtain 7 splitted pdf. What happens ?
pdfsam , and I think others too what really does is split the pdf by the page that contains the bookmark. If a page have two bookmarks we only obtain one pdf....

So really split a pdf file by bookmark is a split by a range of pages. Really there is not a extract of text of anykind.

What's next ?

Manipulate the rtf file to make certain each bookmark is in a differente page.
Create bookmarks and try again.

 :-*

This time goes well.
I use the bookmark name to identify the splitted files.

AsusPortatil - 25_03_2017 , 20_01_12.pngExtracting paragraphs from a word file (doc, rtf, docx)

What's next

PDFsam-basic works very well.
I think this in general :
Splitting a word file by bookmarks or strings always will be a difficult operation and the results impredictible when inserting the new files in the master documents , specially problems with formatting...
Splitting a pdf file by content (text, bookmark, strings, ....) is interpreted as split by the page that contained the string.
Commercial software offer specially split a big file recognizing the account number to seperate invoices...

The very expensive engines to combines documents usually uses C#, Java, VB, etc and Visual Studio. But I Think are native documents generating, not modifying that is my purpose.

So the Ath Tomos most simplest way is prepare a little the word document.

I haven't found an interactive bookmarks generator for a pdf file. Only automatic and depending of the target files the results may be not the expected ones.

I will try the new actions wizard from Adobe acrobat DC.

I have observ that if we want to mix is better in pdf format. But usually if we have a TOC or bookmarks from the word file finally we'll lose everything except the last merge or combination.

So I need renumber the final document and generate bookmarks for the final document.

This have been an exercise with one of the documents to be inserted.
I have to do this with about 10 documents. ...


Contro

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 3,940
    • View Profile
    • Donate to Member
Re: Extracting paragraphs from a word file (doc, rtf, docx)
« Reply #24 on: March 25, 2017, 03:29 PM »
Now I need.....

Programmatically combine several pdf files in one.

Renumber all the pages of the resulting final pdf.

Create a TOC for the resulting pdf and bookmarks.

Define the final pdf to be allways open with the bookmarks active.

Running to try.

Any help welcome.