topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Saturday December 14, 2024, 6:01 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: How to concatenate a number of text files which may have different encoding?  (Read 4448 times)

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 7,544
  • @Slartibartfarst
    • View Profile
    • Read more about this member.
    • Donate to Member
I wanted to concatenate all my MBAM (Malwarebytes) text Log files into a single text file, with a view to using the latter as a database that I could search and analyse in some way (maybe as input to a spreadsheet or DB tool).

STEPS:
  • 1. Set up a SCRAP working directory.
  • 2. Copy all MBAM Log files from their folder into the SCRAP folder - there were 87 files, ranging from mbam-log-2010-12-28 (23-38-53).txt to mbam-log-2013-04-17 (04-38-12).txt.
  • 3. Mass renamed the 87 log files to the form xnn.txt (where nn = 01 to 87).
  • 4. Concatenated all 87 filenames into one long string with a "+" sign between each filename - e.g. x01.txt+x02.txt+…
    Spoiler
    x01.txt+x02.txt+x03.txt+x04.txt+x05.txt+x06.txt+x07.txt+x08.txt+x09.txt+x10.txt+x11.txt+x12.txt+x13.txt+x14.txt+x15.txt+x16.txt+x17.txt+x18.txt+x19.txt+x20.txt+x21.txt+x22.txt+x23.txt+x24.txt+x25.txt+x26.txt+x27.txt+x28.txt+x29.txt+x30.txt+x31.txt+x32.txt+x33.txt+x34.txt+x35.txt+x36.txt+x37.txt+x38.txt+x39.txt+x40.txt+x41.txt+x42.txt+x43.txt+x44.txt+x45.txt+x46.txt+x47.txt+x48.txt+x49.txt+x50.txt+x51.txt+x52.txt+x53.txt+x54.txt+x55.txt+x56.txt+x57.txt+x58.txt+x59.txt+x60.txt+x61.txt+x62.txt+x63.txt+x64.txt+x65.txt+x66.txt+x67.txt+x68.txt+x69.txt+x70.txt+x71.txt+x72.txt+x73.txt+x74.txt+x75.txt+x76.txt+x77.txt+x78.txt+x79.txt+x80.txt+x81.txt+x82.txt+x83.txt+x84.txt+x85.txt+x86.txt+x87.txt

  • 5. Turned this string into a DOS COPY command: copy [string] /A concatALL.txt /A
  • 6. Ran the COPY command.
  • 7. Examined the output file concatALL.txt

Examination of the output file concatALL.txt showed that it was fine up until somewhere before the middle of the file, when the text became weird-looking with embedded spaces, right after the end of the content from file 34.
Further inspection of the source files showed that:
  • files 01-34 had been encoded in Windows 1252 Western European.
  • files 35-87 had been encoded in Unicode, UTF-16 little endian.

So:
  • I concatenated files 01-34 into file concat01.txt
  • I concatenated files 35-87 into file concat02.txt

The latter two files each seemed to read fine in a text editor. It was only when you tried to concatenate them that problems arose.
I then copied/pasted all the text in file concat02.txt onto the end of concat01.txt, then saved and closed the concat01.txt file. Opening it subsequently, the text read fine all the way through. Problem solved.

My question is: What approach could have enabled me to do this quicker, or with more automation?

x16wda

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 888
  • what am I doing in this handbasket?
    • View Profile
    • Read more about this member.
    • Donate to Member
You could try using this command:

   for %i in (mbam*.txt) do type %i >> combined.txt

(If you put this in a batch file, you'd need to double the % sign.)  That's the quickie way (for a boring plain ANSI English user) although depending on content you might lose some of the more interesting characters.
vi vi vi - editor of the beast

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,644
    • View Profile
    • Donate to Member
Off the top of my head:

for /r c:\scrap %a in (*.txt) do type "%~a" >> C:\output.txt

Haven't tested it, (middle of the night atm), but in theory it recursively lists each text file and appends it to C:\output.txt by using the type commend.

If you want the input files sorted by name then:

for /f "usebackq tokens=*" %a in (`dir /on /b *.txt`) do type "%~a" >> C:\output.txt

For sorted by date use /od instead of /on

Try typing the files in a CLI, if they look OK then the above should work.

EDIT: Huh! x16wda beat me by seconds  ;D

rjbull

  • Charter Member
  • Joined in 2005
  • ***
  • default avatar
  • Posts: 3,205
    • View Profile
    • Donate to Member
Batch file method should work, but I've sometimes used ports of the Unix cat command:
Usage: cat [OPTION] [FILE]...
Concatenate FILE(s), or standard input, to standard output.

  -A, --show-all           equivalent to -vET
  -b, --number-nonblank    number nonblank output lines
  -e                       equivalent to -vE
  -E, --show-ends          display $ at end of each line
  -n, --number             number all output lines
  -s, --squeeze-blank      never more than one single blank line
  -t                       equivalent to -vT
  -T, --show-tabs          display TAB characters as ^I
  -u                       (ignored)
  -v, --show-nonprinting   use ^ and M- notation, except for LFD and TAB
      --help               display this help and exit
      --version            output version information and exit

With no FILE, or when FILE is -, read standard input.

  -B, --binary             use binary writes to the console device.


Report bugs to <[email protected]>.

AbteriX

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 1,149
    • View Profile
    • Donate to Member
Or

copy *.txt concatALL.txt



.

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 7,544
  • @Slartibartfarst
    • View Profile
    • Read more about this member.
    • Donate to Member
Ah, thanks for that. I had forgotten the TYPE command.
I knew there was some way to normalise all the encoding to the output file, but could not recall what it was.

Unfortunately, the COPY command seems to be no good in any shape or form, as there is apparently no way to force it to normalise the encoding to the output file.

What worked: for %i in (x*.txt) do type %i >> combined.txt
I have assumed it took the input files in alphabetic order - the out put looks as though it did, anyway.