topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Saturday December 14, 2024, 4:26 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Last post Author Topic: Finished Coding Snack: Separate .txt file at any specified punctuation mark  (Read 42427 times)

Quidnunc

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 16
    • View Profile
    • Donate to Member
I need to be able to separate text at any specified punctuation mark onto it's own line and then insert a blank line between it and the next separation.  If this can be done simply (I only use a computer, I have no idea of how they do what they do).  I assume this can be done via something like the find & replace box.  I would be happy to make a decent donation if anyone can devise a way of doing this (and tell me how to use it).

Quidnunc  
« Last Edit: November 28, 2011, 09:03 PM by mouser »

rjbull

  • Charter Member
  • Joined in 2005
  • ***
  • default avatar
  • Posts: 3,205
    • View Profile
    • Donate to Member
Could you give a before-and-after example of what you mean, with dummy text if necessary?

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 5,287
    • View Profile
    • Donate to Member
Also, which text editor are you using?

AbteriX

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 1,149
    • View Profile
    • Donate to Member
You can do that by regular expression f.ex.

FROM:
I need to be able to separate text at any specified punctuation mark onto it's own line and then insert a blank line between it and the next separation.  If this can be done simply (I only use a computer, I have no idea of how they do what they do).  I assume this can be done via something like the find & replace box.  I would be happy to make a decent donation if anyone can devise a way of doing this (and tell me how to use it).

TO:
I need to be able to separate text at any specified punctuation mark onto it's own line and then insert a blank line between it and the next separation.
If this can be done simply (I only use a computer,
I have no idea of how they do what they do).
I assume this can be done via something like the find & replace box.
I would be happy to make a decent donation if anyone can devise a way of doing this (and tell me how to use it)

USE:
[X] Regular Expression

Search for: (.+?)(\.|,)\s*

With EmEditor replace with \1\2\n
With HippoEDIT replace with $1$2\n


Explanation:
(.+?) means: search one-or-more of any sign, non-greedy, and store that in group no. 1
(\.|,) means: search an literal dot "\."  OR an coma, and store that in group no. 2
\s* means: match none-or-more space(s), we don't store that match but drop them (if any)

Then we replace with what is in group 1 by using \1 or $1, that's the matched sentence.
Then we replace with what is in group 2 by using  \2 or $2, that's the matched punctuation mark.
Then we add an line break or two by using \n

Please note that this may not work like that with all editors. The regex implementation is slightly different between different editors.


To get what you want:
I need to be able to separate text at any specified punctuation mark onto it's own line and then insert a blank line between it and the next separation.

If this can be done simply (I only use a computer,

I have no idea of how they do what they do).

I assume this can be done via something like the find & replace box.

I would be happy to make a decent donation if anyone can devise a way of doing this (and tell me how to use it)


just USE two '\n'




HTH?  :D

Quidnunc

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 16
    • View Profile
    • Donate to Member
Hi
Thanks for the replies.
Abterix: You have suggested using Regular Expressions, however, I only have the vaguest idea as to what a regular expression is and how it works. To use it I would guess I would have to have some software installed (you mention EmEditor & HippoEDIT which no doubt I would have to buy and learn how to use).

RJBull: Here is an example.
Before
A breeze ruffled the neat hedges of Privet Drive, which lay silent and tidy under the inky sky, the very last place you would expect astonishing things to happen. Harry Potter rolled over inside his blankets without waking up. One small hand closed on the letter beside him and he slept on, not knowing he was special, not knowing he was famous, not knowing he would be woken in a few hours' time by Mrs Dursley's scream as she opened the front door to put out the milk bottles, nor that he would spend the next few weeks being prodded and pinched by his cousin Dudley ...He couldn't know that at this very moment, people meeting in secret all over the country were holding up their glasses and saying in hushed voices: To Harry Potter - the boy who lived!'

After
A breeze ruffled the neat hedges of Privet Drive, which lay silent and tidy under the inky sky, the very last place you would expect astonishing things to happen.
 
Harry Potter rolled over inside his blankets without waking up.

I might, on occasion, also need to chop this next long sentence into meaningful grammatical units at each comma.
 
One small hand closed on the letter beside him and he slept on, not knowing he was special, not knowing he was famous, not knowing he would be woken in a few hours' time by Mrs Dursley's scream as she opened the front door to put out the milk bottles, nor that he would spend the next few weeks being prodded and pinched by his cousin Dudley ...
He couldn't know that at this very moment, people meeting in secret all over the country were holding up their glasses and saying in hushed voices: To Harry Potter - the boy who lived!'

I hope that explains what I want to do.

Skywire: At the moment I use a concordancer, I also use an online text analyser for counting #sentences and other statistical info and (from this website) a line numberer all of which are very useful. Before I use these tools I would like to prepare the text by splitting into individual sentences (or clauses) each on its own line and, if at all possible, to number them.




timns

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 1,211
    • View Profile
    • Donate to Member
It would be extremely easy to roll a tiny program to read a .txt file, split according to AbteriX's regex rules and spit out a new version with the line breaks.

At the moment if your text contained (say) numbers with decimal points, they would also be split by that expression. But it would not be difficult to cater for that, along with perhaps splitting on ?, ! etc.
« Last Edit: November 25, 2011, 11:51 AM by timns »

Quidnunc

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 16
    • View Profile
    • Donate to Member
Hi Timns

By a small program do you mean something like a dialogue box in which I could paste some text and have it do what I ask.  If so that would be wonderfull.  I know this will sound terrible and I don't want to seem unappreciative but the less work I have to do the better because as already stated, I only use a computer I have very little understanding of the way they work or the associated language.

timns

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 1,211
    • View Profile
    • Donate to Member
Hi Timns

By a small program do you mean something like a dialogue box in which I could paste some text and have it do what I ask.  If so that would be wonderfull.  I know this will sound terrible and I don't want to seem unappreciative but the less work I have to do the better because as already stated, I only use a computer I have very little understanding of the way they work or the associated language.

That's exactly what I mean. The heavyweight alternative is that you can already do this in Word and many text editors, but if you want a small snack-sized solution then you shall have one - it's what DC is all about!

Unless anyone else steps up before the weekend, I'll have a go at this. If anyone else DOES decide to have a go, please let me know   :Thmbsup:

Quidnunc

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 16
    • View Profile
    • Donate to Member
Hi again

I will look forward to seeing this and will be forever grateful to you, and a donation will be made thereafter.  Some members might wonder why I want such a tool; well I spend a lot of time analysing texts for meaning and understanding (as distinct from just reading it).  I am currently looking at 'Harry Potter and The Philosophers Stone', building a time-line, discovering inconsistencies etc.

Quidnunc

AbteriX

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 1,149
    • View Profile
    • Donate to Member
Abterix: You have suggested using Regular Expressions, however, I only have the vaguest idea as to what a regular expression is and how it works.
Just google for 'regular expression wiki'

To use it I would guess I would have to have some software installed (you mention EmEditor & HippoEDIT which no doubt I would have to buy and learn how to use).
Not necessary, PSPad can do it also and is freeware.

But an dedicated app would be much more nifty of course.

BTW, our DonationCoder "Clipboard Help and Spell" can do it as well:
(here with an additional feature idea: 'split long lines without punctuation too at char number n' )

Clipboard Help and Spell - RegEx use2b.pngFinished Coding Snack: Separate .txt file at any specified punctuation mark

Quidnunc

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 16
    • View Profile
    • Donate to Member
AbertriX

As you will no doubt know now, Timns has kindly offered to write  a small program to do exactly what I need based on the information I have given (as opposed to using an existing program which has far more features than I need, want or understand).

However I thank you for your time and responses, it is indeed heartening to find that there are people willing and able to devote their time and energies helping a stranger solve a problem.

Quidnunc   

rjbull

  • Charter Member
  • Joined in 2005
  • ***
  • default avatar
  • Posts: 3,205
    • View Profile
    • Donate to Member
@Quidnunc: If you don't know it already, you might like to check out TextSTAT - Simple Text Analysis Tool (freeware):
Concordance software for Windows, GNU/Linux and MacOS

TextSTAT is a simple programme for the analysis of texts. It reads plain text files (in different encodings) and HTML files (directly from the internet) and it produces word frequency lists and concordances from these files. This version includes a web-spider which reads as many pages as you want from a particular website and puts them in a TextSTAT-corpus. The new news-reader, too, puts news messages in a TextSTAT-readable corpus file.
TextSTAT reads MS Word and OpenOffice files. No conversion needed, just add the files to your corpus...
In TextSTAT you can use regular expression which provides you with powerful search possibilities. The programme is multilingual. Because it uses Unicode internally, TextSTAT can cope with many different languages and file encodings.

@timns: are you thinking of making checkboxes for each punctuation mark, so Quidnunc can build regular expressions without having to understand them?

timns

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 1,211
    • View Profile
    • Donate to Member
I'll definitely have some way of controlling which characters cause a line-break. There are rather a lot of candidates when one starts looking: '.,?!"-:; and of course ♫ 

timns

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 1,211
    • View Profile
    • Donate to Member
How's this:

screenshot0048_2011-11-26.pngFinished Coding Snack: Separate .txt file at any specified punctuation mark

You just have to enter the punctuation or other characters at the top, and dump your text into the upper pane. Clicking go produces output as shown.

I added some convenience buttons for clearing, copying and pasting.

timns

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 1,211
    • View Profile
    • Donate to Member
Later I'll put this in my coding area, with a proper little write-up, but for now this URL:

http://head-in-the-c...m/game/Scissors.jnlp

... will launch the program for you. You may get a warning that it's signed by an unknown publisher. That's ok, it's me  :Thmbsup:

Instructions:

1. Select and copy the text that is to be re-formatted so it's in your PC's (or Mac or Linux box) clipboard
2. Click the 'Paste' button or use Ctrl+V in the top window
3. Click the 'Go' button
3a. Go "ooh" and "aah" in admiration (optional step)
4. Select a subset of the results, or leave everything unselected (which gets you all text in the results area), and click the 'Copy' button - your formatted text is now available on the clipboard to do whatever-it-is-you-need-to-do with it.

screenshot0050_2011-11-26.pngFinished Coding Snack: Separate .txt file at any specified punctuation mark
« Last Edit: November 26, 2011, 03:12 PM by timns »

cranioscopical

  • Friend of the Site
  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 4,776
    • View Profile
    • Donate to Member
Nice work, timns, and at that price it's a snip!
 

Quidnunc

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 16
    • View Profile
    • Donate to Member
Timns, you are a wonderful person and so clever too!  Your program is GREAT and does exactly what I wanted and will save me loads of time.  To show my appreciation, both to you and to the others who's efforts I have taken advantage of, I have made a donation.  I'm not sure how it affects you personally but I hope you get some benefit from it.

My grateful thanks to all those in general who do this work and to you in particular.

Quidnunc

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 40,914
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
It's fun to read threads like this!

ps. In case you haven't figured it out yet, you can click on the gold coin under Timns name to send him some of your donation -- and I encourage you to do so.

anandcoral

  • Honorary Member
  • Joined in 2009
  • **
  • Posts: 783
    • View Profile
    • Free Portable Apps
    • Donate to Member
This what DC is for  :D

Thanks @Timns for keeping the DC flag high  :Thmbsup:

Regards,

Anand

timns

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 1,211
    • View Profile
    • Donate to Member
Ah ha! Well that's excellent news and I greatly appreciate the feedback and kind comments...

... and the generous donation!  Thank you kindly :Thmbsup:

If you need any tweaks or twiddles just let me know.

Quidnunc

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 16
    • View Profile
    • Donate to Member
Timns

If you are happy to do a tweak then could Iask for this to be included:

After splitting  the text into sentences etc, to:
1) number each complete unit (a unit being the separated text, even if it wraps to another line i.e. only complete units to be numbered but not wrapped lines.
2) put a space between the number and start of text. 


Nit-picking I know but if it is possible without too much trouble then it would be even better than it already is.

timns

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 1,211
    • View Profile
    • Donate to Member
Extremely simple to do, in fact, so no problem.

Literally like this?

1 blah blah blah blah blah,

2 blah BLAH blah blah-blah.

3 bla blah blah Blahhhhhh blah <long line>
blah blah blah.

4 and blah?

How would you like the numbers formatted?

Quidnunc

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 16
    • View Profile
    • Donate to Member
Yes, that is exactly how I would like it to appear but I'm not sure what you mean by 'formatting the numbers'.  Do you mean just as numbers-1 2 3....or in some other way-e.g. Roman numerals?  If so just plain numbers will be fine.

timns

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 1,211
    • View Profile
    • Donate to Member
Yes, that is exactly how I would like it to appear but I'm not sure what you mean by 'formatting the numbers'.  Do you mean just as numbers-1 2 3....or in some other way-e.g. Roman numerals?  If so just plain numbers will be fine.

Some examples, all of which would be equally easy:

1 sentence
1. sentence
1/ sentence
(1) sentence
001 sentence
001 <tab character> sentence

etc. etc.

Quidnunc

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 16
    • View Profile
    • Donate to Member

001. sentence would be great (as there could be 000's of sentences scope to accommodate this, say up to five figures, would be good).

Trevor