Home | Blog | Software |

Did you miss your activation email?

• October 18, 2017, 05:20 PM
• Proudly celebrating 10 years online.
• Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.

### Author Topic: DONE: Delete double lines (all but the first) in a text file  (Read 23285 times)

• Participant
• Joined in 2006
• Posts: 1
##### DONE: Delete double lines (all but the first) in a text file
« on: March 06, 2006, 12:43 PM »
Hello,

sorry for my poor english.

I am looking for an application which is able to delete double entries in a large text file. I did only find a macro for UltraEdit, but if the file is greater than 1 mb it hangs. I am sure that there is already such an app available, but I couldn´t find it with google. I could only find other people looking for such a piece of software Sometimes I code some little things in vbs, but I am a absolute beginner. I know I have to create 2 further files:

File 1: already available master file
File 2: Temporäry File
File 3: Results File

cut first (not empty) line from file 1 and paste it to file 2
delete all lines in file 1 that are equal to this line
cut line 1 in file 2 and paste it to file 3
etc. etc.

I would appreciate some help.

Many thanks

chrisi
« Last Edit: March 17, 2006, 01:48 PM by brotherS »

#### jgpaiva

• Global Moderator
• Joined in 2006
• Posts: 4,727
##### Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #1 on: March 06, 2006, 01:10 PM »
I made a script similar to your request for a recent request.
I'll modify it and post it here so as you can test.
But i have to warn you: it'll be slow, and it'll be limited to a max of 64mb of text.
Anyways, i'll give it a go.
(it's a script in ahk, if it was made in C, i'm sure it'd be about a million times faster, but I don't remember C too well, and I can't compile it for windows)

#### skrommel

• Fastest code in the west
• Developer
• Joined in 2005
• Posts: 927
##### Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #2 on: March 06, 2006, 05:03 PM »
I really should start reading the whole posts!

Here's one, but it sorts the file, and it's limited to 1 GB.

Skrommel

;DelDuplicates.ahk
; Removes duplicate lines from a text file
;Skrommel @2006

infile=C:\Temp\in.txt
outfile=C:\Temp\out.txt

#MaxMem 1024
SetBatchLines,-1
If ErrorLevel=0
{
Sort,file,U
FileDelete,%outfile%
FileAppend,%file%,%outfile%
file=
}

Try this one!

Skrommel

;DelDouble.ahk
; Removes double lines from text files
;Skrommel @2006

fromfile=C:\Temp\in.txt
tofile=C:\Temp\out.txt

SetBatchLines,-1
FileDelete,%tofile%
prevline=
{
FileAppend,%A_LoopReadLine%n,%tofile%
}
« Last Edit: March 06, 2006, 06:21 PM by skrommel »

#### PhilKC

• Charter Member
• Joined in 2005
• Posts: 117
##### Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #3 on: March 06, 2006, 05:23 PM »
Pseudo code:

in.Close();
ArrayList checker = new ArrayList();
for (int i=0;i<lines.Length;i++)
if (!checker.Contains(lines[i]))
StreamWriter out = new StreamWriter(outputFile);
for (int i=0;i<checker.Count;i++)
out.WriteLine(checker(i));
out.Close();

That was from memory, so, I have no idea if it would compile... (It's C# )

PhilKC

#### jgpaiva

• Global Moderator
• Joined in 2006
• Posts: 4,727
##### Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #4 on: March 06, 2006, 05:57 PM »
Here's the modified version i mentioned.
It's algorithm is quite good, but ahk is a script language, so, it takes more time than C, for sure.
It took 1 minute 45 seconds to find repeated entries on a 9000 lines file, on my laptop centrino 2.0.
Still, it does solve your problem.
Doesn't alter the initial file, but the file created doesn't have the repeated entries.
It has a small bug: the progress bar doesn't correspond to the truth. In the end of the file, it's way faster than in the beggining. Just leaving the heads-up, in case you start thinking about giving up at the beggining.
It is supposed to be able to hadle 64mb of plain text, by the ahk references.

(btw: the .ahk file needs autohotkey to run, and the exe file only accepts a file called "textfile.txt" as input, and only outputs to a file called "out.txt". Both are in the attached compressed file)

.exe version
.ahk version
« Last Edit: May 02, 2006, 04:48 PM by jgpaiva »

#### TWmailrec

• Charter Member
• Joined in 2005
• Posts: 128
##### Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #5 on: March 13, 2006, 07:10 PM »
Re: IDEA: Delete double Lines (all but the first) in a Text-File

The solution from jgpaiva (RepeatedEntries.ahk) solves a problem I had, but can it be modified to ignore blank lines ( CR only to aid intelligability)??

#### jgpaiva

• Global Moderator
• Joined in 2006
• Posts: 4,727
##### Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #6 on: March 14, 2006, 06:32 PM »
Here is a new version, that checks for blank lines.
Note: a line that only has SPACEs or TABS, is considered a blank line. I hope this was what you were asking for.

.exe version
.ahk version
« Last Edit: May 02, 2006, 04:48 PM by jgpaiva »

#### TWmailrec

• Charter Member
• Joined in 2005
• Posts: 128
##### Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #7 on: March 15, 2006, 10:53 PM »
Many thanks to jgpaiva for the new program mod.
The repeated strings msgbox now works well, but the output file
did not copy blank lines.
Is there any way to replicate the blank lines in the output file?
Im new to Autohotkey program language & cant cope with loops.

TWmailrec

#### jgpaiva

• Global Moderator
• Joined in 2006
• Posts: 4,727
##### Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #8 on: March 16, 2006, 06:39 AM »
EhEh TW..
There are a few "return"s missing, though.
I didn't get what you meant, you mean the problem was onkly in the messagebox?
You only wanted the msgbox fixed, but still having the blank lines in the file?

#### Gerome

• Charter Honorary Member
• Joined in 2006
• Posts: 154
##### Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #9 on: March 16, 2006, 01:42 PM »
Yo !
Here's the modified version i mentioned.
It's algorithm is quite good, but ahk is a script language, so, it takes more time than C, for sure.
It took 1 minute 45 seconds to find repeated entries on a 9000 lines file, on my laptop centrino 2.0.
Still, it does solve your problem.
Doesn't alter the initial file, but the file created doesn't have the repeated entries.
It has a small bug: the progress bar doesn't correspond to the truth. In the end of the file, it's way faster than in the beggining. Just leaving the heads-up, in case you start thinking about giving up at the beggining.
It is supposed to be able to hadle 64mb of plain text, by the ahk references.

(btw: the .ahk file needs autohotkey to run, and the exe file only accepts a file called "textfile.txt" as input, and only outputs to a file called "out.txt". Both are in the attached compressed file)

I've taken your script sources copied 2520 times onto themselves : it gave a 3,2 MB text file...
Tested your script under Win2k Sp4 256 Mb Ram without any other programm running and after 1 hour it has only found 50% of the duplicates...
There were only 168 840 lines... and took 35 Mb of RAM trying to aggregate...

#### jgpaiva

• Global Moderator
• Joined in 2006
• Posts: 4,727
##### Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #10 on: March 16, 2006, 01:57 PM »
It's algorithm is quite good, but ahk is a script language, so, it takes more time than C, for sure.
I do know that can't handle a big input file.
And i also know that the same script in C would take about 10 seconds to solve that problem.
Implemented with an hash table in C, probably would take even less.
And i also know how to implement it in C, i even have the code, because i did it for school.
I could even use the solution that PhilKC presented.
If I wanted to do that search in an efficient way, i'd code it in C, and use it under linux.
My problem is that i've never compiled any C program in windows, and the original post on this thread required something that i thought ahk could solve.

Many thanks to jgpaiva for the new program mod.
And it did.
The question here, is that noone else presented a better solution. I presented mine.
I did the best thing I can do in windows. It sure does run faster than other executable presented at this thread.

#### Gerome

• Charter Honorary Member
• Joined in 2006
• Posts: 154
##### Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #11 on: March 16, 2006, 02:04 PM »
Hey!

If you did it under linux, you can then compile it the same way under windows.
Simply compile with GCC and it'll work same way...

#### TWmailrec

• Charter Member
• Joined in 2005
• Posts: 128
##### Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #12 on: March 16, 2006, 02:31 PM »
To jgpaiva from TW

No, what I meant was the message box now works fine,
but the blank lines are still stripped out of the output file.
I was hoping to preserve the blank lines in the output file, (repeated or not). I cant see why they are stripped out but they are:

TW

#### jgpaiva

• Global Moderator
• Joined in 2006
• Posts: 4,727
##### Re: IDEA: Delete double Lines (all but the first) in a Text-File
« Reply #13 on: March 16, 2006, 02:33 PM »
To jgpaiva from TW

No, what I meant was the message box now works fine,
but the blank lines are still stripped out of the output file.
I was hoping to preserve the blank lines in the output file, (repeated or not). I cant see why they are stripped out but they are:
They are because I thought that was what you wanted
I'll make it copy blank lines.

@Gerome I can't install gcc under windows, only through cygwin, and that's not worth the effort..

#### Gerome

• Charter Honorary Member
• Joined in 2006
• Posts: 154
##### Re: IDEA: Delete double lines (all but the first) in a text file
« Reply #14 on: March 16, 2006, 02:41 PM »
Hi,

Quote
@Gerome I can't install gcc under windows, only through cygwin, and that's not worth the effort..

????????????
Install MinGW or alike : DevCPP does this for you excellently...

#### jgpaiva

• Global Moderator
• Joined in 2006
• Posts: 4,727
##### Re: IDEA: Delete double lines (all but the first) in a text file
« Reply #15 on: March 16, 2006, 02:51 PM »
Install MinGW or alike : DevCPP does this for you excellently...
That's good, i'll give it a go next time I need to compile C. But by now, and for the next 4 months, I only see Lisp, Pov-Ray, VRML and Java
Maybe next semester. But thanks by the pointer, it'll surelly be useful!!

#### jgpaiva

• Global Moderator
• Joined in 2006
• Posts: 4,727
##### Re: DONE: Delete double lines (all but the first) in a text file
« Reply #16 on: March 20, 2006, 02:27 PM »
Ok, now i got around to updating this script.
The blank lines bug is fixed, and the msgbox also is right. I think it's now as you requested, TW

.exe version
.ahk version
« Last Edit: May 02, 2006, 04:49 PM by jgpaiva »

#### TWmailrec

• Charter Member
• Joined in 2005
• Posts: 128
##### Re: DONE: Delete double lines (all but the first) in a text file
« Reply #17 on: March 21, 2006, 06:44 PM »
To jgpaiva Charter Member, RepeatedEntries.ahk

Re: DONE: Delete double lines (all but the first) in a text file

Program is now perfect!
(I added back in the %1% check for drag
and drop and command line parameter.)

Many thanks
TW

#### lanux128

• Global Moderator
• Joined in 2005
• Posts: 6,260
##### Re: DONE: Delete double lines (all but the first) in a text file
« Reply #18 on: April 05, 2006, 12:37 AM »
jgpaiva & TWmailrec:
i liked your script very much that i had modified it a bit for my own usage & added a gui...

but i not too clear on why both of your scripts differ... e.g. jgpaiva's script adds a linefeed to the end of every line while TWmailrec's deletes all empty lines. is this on purpose?

in any case, here's the code & screenshot of the gui.

the modified code
; Date: Apr. 03, 2006
#Persistent
#SingleInstance force
SetBatchLines,-1
Title=Delete Duplicate Lines

GoSub, ShowMain
Return

ShowMain:
Gui, Add, GroupBox, x6 y6 w360 h172, %Title%
Gui, Font, s8 CDefault, Tahoma
Gui, Add, Text, x16 y25 w180 h20, Original File:
Gui, Add, Button, x326 y45 w30 h20 gSelectFile, ...
If File =
Else
GuiControl,, File, %file%
;---0---
Gui, Add, Text, x16 y80 w180 h20, Output File:  ;Must not be the same...
If FileOut =
Else
GuiControl,, FileOut, %FileOut%
Gui, Add, Button, x100 y140 w75 h25 gProcess, Process
Gui, Add, Button, x200 y140 w75 h25, Quit
Gui, Show, x270 y110 h185 w375, %Title%
Return

SelectFile:
FileSelectFile, File, 1, %A_MyDocuments%, Select text-file for processing, Text Files (*.csv; *.txt)
If File =   ;user presses Cancel...
Return
GuiControl,, File, %file%
SplitPath, File,CurFile,CurFolder,CurExt,CurFileNoExt,
FileOut=%CurFolder%\%CurFileNoExt%_after.%CurExt%
GuiControl,, FileOut, %FileOut%
Return

Process:
If File =
Return
filetowrite=%FileOut%
;To add check-box option, to overwrite existing output file?
;IfExist,%filetowrite%
;  {
;   FileDelete,%filetowrite%
;  }
StringSplit,index,CompleteFile,rn,rn
found=
count:=index0
count2:=count
GoSub,CreateGui2
ProgressFlag:=false
loop,%count%
{
GuiControl,2:,bar,%A_Index%
If ProgressFlag
break
position:=A_Index
Word:=index%position%
If word is space
{
FileAppend,%word%n,%filetowrite%
continue
}
IfInString,found,%Word%
continue
count2-=1
loop, %count2%
{
position2:=position+a_index
Word2:=index%position2%
if Word=%word2%
found=%found% %Word% ,
}
Fileappend,%Word%`n,%filetowrite%
}
if found=
{
Msgbox,, %Title%, No duplicate lines were found.
GoSub, 2GuiEscape
}
else
{
StringTrimRight,found2,found,2
Msgbox,, %Title%, The following strings were repeated: %found2%
GoSub, 2GuiEscape
}

Return

CreateGui2:
Gui, 2:Add,Text,,Now checking for duplicate entries. Press esc to skip.
Gui, 2:Add, Progress,vbar w300 h20 -smooth Range0-%count%,
Gui, 2:Show, ,%Title%
Return

2GuiClose:
GoSub, ShowMain
;exitapp

2GuiEscape:
Gui, 2:destroy
ProgressFlag:=true
GoSub, ShowMain
Return

ButtonQuit:
GuiEscape:
GuiClose:
ExitApp

DONE: Delete double lines (all but the first) in a text file

« Last Edit: April 05, 2006, 01:34 AM by brotherS »

#### jgpaiva

• Global Moderator
• Joined in 2006
• Posts: 4,727
##### Re: DONE: Delete double lines (all but the first) in a text file
« Reply #19 on: April 05, 2006, 04:43 AM »
jgpaiva & TWmailrec:
i liked your script very much that i had modified it a bit for my own usage & added a gui...

but i not too clear on why both of your scripts differ... e.g. jgpaiva's script adds a linefeed to the end of every line while TWmailrec's deletes all empty lines. is this on purpose?
TW's script is the same i made before, but with a few modifications he introduced to suit him better. My latest script doesn't remove blank lines because TW asked me for it not to remove them
But now, the script works well for you, right?

#### lanux128

• Global Moderator
• Joined in 2005
• Posts: 6,260
##### Re: DONE: Delete double lines (all but the first) in a text file
« Reply #20 on: April 05, 2006, 09:45 PM »
...My latest script doesn't remove blank lines because TW asked me for it not to remove them
But now, the script works well for you, right?

yes, it works for me. so i'm getting a bit ambitious but with my limited skill in AHK, i need your help...
i want to add a check-box that overwrites the output file (see screenshot) which i've managed but i can't implement in the code.

Quote
Gui, Add, Checkbox, x16 y115 CheckedGray vOverwrite_File, Overwrite output file?
...
;Refer check-box option, to overwrite existing output file?
If Overwrite_File
IfExist,%filetowrite%
{
FileDelete,%filetowrite%
}

now the above code doesn't overwrite the existing file, it only appends it. do you know why?
« Last Edit: April 05, 2006, 09:57 PM by lanux128 »

#### f0dder

• Charter Honorary Member
• Joined in 2005
• Posts: 9,100
##### Re: DONE: Delete double lines (all but the first) in a text file
« Reply #21 on: April 06, 2006, 12:56 AM »
Install MinGW or alike : DevCPP does this for you excellently...
That's good, i'll give it a go next time I need to compile C. But by now, and for the next 4 months, I only see Lisp, Pov-Ray, VRML and Java
Maybe next semester. But thanks by the pointer, it'll surelly be useful!!
Or better, install the Microsoft Visual C++ 2003 toolkit. It's a better compiler, and it's free (as in money, not as in source... but who cares, I bet the majority of you haven't made tweaks to gcc or binutils ).

#### jgpaiva

• Global Moderator
• Joined in 2006
• Posts: 4,727
##### Re: DONE: Delete double lines (all but the first) in a text file
« Reply #22 on: April 06, 2006, 03:52 AM »
@lanux: Please try "If Overwrite_File = 1" instead of "If Overwrite_File". I guess that has to do with the fact that the checkbox can have 3 states.

Or better, install the Microsoft Visual C++ 2003 toolkit. It's a better compiler, and it's free (as in money, not as in source... but who cares, I bet the majority of you haven't made tweaks to gcc or binutils ).
MSVCpp is free? I thought it was payed... It is a full bundle, with IDE included, right?
What's the difference between MSVCpp and DevCpp?

#### f0dder

• Charter Honorary Member
• Joined in 2005
• Posts: 9,100
##### Re: DONE: Delete double lines (all but the first) in a text file
« Reply #23 on: April 06, 2006, 04:03 AM »
The free vc2003 toolkit is just the compiler+linker+libc - you need to get platformsdk (includes+libs for GUI development) too, but that's also free. And yes, it's free - even for commercial or non-windows development, and it's the full optimizing compiler.

DevCpp is a GUI + the GNU GCC compiler. vc2003 typically produces better code than GCC, and iirc it's even more C++ conformant than the versions of GCC that has been ported to win32. If you need a GUI for it, you can check out code::blocks... or the (free) express edition of vc2005 can probably be modified to be used for it.

#### DanD

• Charter Member
• Joined in 2006
• Posts: 8
##### Re: DONE: Delete double lines (all but the first) in a text file
« Reply #24 on: April 14, 2006, 11:59 AM »
I have Perl from http://www.activestate.com/ on my (Windows) PC.  From a command line prompt with Perl you can do something like

D:\Dan\Perl>perl -a -n -e "if (@F) { print unless $h{$_}; $h{$_} = 1 } else { print }" < dc1.txt
line one
line two

line three

line four

D:\Dan\Perl>type dc1.txt
line one
line one
line two

line three

line two
line one
line four
line three

D:\Dan\Perl>

(use output redirection
... > result.txt
to capture the result).

Dan