ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > Living Room

How to concatenate a number of text files which may have different encoding?

(1/2) > >>

IainB:
I wanted to concatenate all my MBAM (Malwarebytes) text Log files into a single text file, with a view to using the latter as a database that I could search and analyse in some way (maybe as input to a spreadsheet or DB tool).

STEPS:

* 1. Set up a SCRAP working directory.
* 2. Copy all MBAM Log files from their folder into the SCRAP folder - there were 87 files, ranging from mbam-log-2010-12-28 (23-38-53).txt to mbam-log-2013-04-17 (04-38-12).txt.
* 3. Mass renamed the 87 log files to the form xnn.txt (where nn = 01 to 87).
* 4. Concatenated all 87 filenames into one long string with a "+" sign between each filename - e.g. x01.txt+x02.txt+…
Spoilerx01.txt+x02.txt+x03.txt+x04.txt+x05.txt+x06.txt+x07.txt+x08.txt+x09.txt+x10.txt+x11.txt+x12.txt+x13.txt+x14.txt+x15.txt+x16.txt+x17.txt+x18.txt+x19.txt+x20.txt+x21.txt+x22.txt+x23.txt+x24.txt+x25.txt+x26.txt+x27.txt+x28.txt+x29.txt+x30.txt+x31.txt+x32.txt+x33.txt+x34.txt+x35.txt+x36.txt+x37.txt+x38.txt+x39.txt+x40.txt+x41.txt+x42.txt+x43.txt+x44.txt+x45.txt+x46.txt+x47.txt+x48.txt+x49.txt+x50.txt+x51.txt+x52.txt+x53.txt+x54.txt+x55.txt+x56.txt+x57.txt+x58.txt+x59.txt+x60.txt+x61.txt+x62.txt+x63.txt+x64.txt+x65.txt+x66.txt+x67.txt+x68.txt+x69.txt+x70.txt+x71.txt+x72.txt+x73.txt+x74.txt+x75.txt+x76.txt+x77.txt+x78.txt+x79.txt+x80.txt+x81.txt+x82.txt+x83.txt+x84.txt+x85.txt+x86.txt+x87.txt


* 5. Turned this string into a DOS COPY command: copy [string] /A concatALL.txt /A
* 6. Ran the COPY command.
* 7. Examined the output file concatALL.txt
Examination of the output file concatALL.txt showed that it was fine up until somewhere before the middle of the file, when the text became weird-looking with embedded spaces, right after the end of the content from file 34.
Further inspection of the source files showed that:

* files 01-34 had been encoded in Windows 1252 Western European.
* files 35-87 had been encoded in Unicode, UTF-16 little endian.
So:

* I concatenated files 01-34 into file concat01.txt
* I concatenated files 35-87 into file concat02.txt
The latter two files each seemed to read fine in a text editor. It was only when you tried to concatenate them that problems arose.
I then copied/pasted all the text in file concat02.txt onto the end of concat01.txt, then saved and closed the concat01.txt file. Opening it subsequently, the text read fine all the way through. Problem solved.

My question is: What approach could have enabled me to do this quicker, or with more automation?

x16wda:
You could try using this command:

   for %i in (mbam*.txt) do type %i >> combined.txt

(If you put this in a batch file, you'd need to double the % sign.)  That's the quickie way (for a boring plain ANSI English user) although depending on content you might lose some of the more interesting characters.

4wd:
Off the top of my head:

for /r c:\scrap %a in (*.txt) do type "%~a" >> C:\output.txt

Haven't tested it, (middle of the night atm), but in theory it recursively lists each text file and appends it to C:\output.txt by using the type commend.

If you want the input files sorted by name then:

for /f "usebackq tokens=*" %a in (`dir /on /b *.txt`) do type "%~a" >> C:\output.txt

For sorted by date use /od instead of /on

Try typing the files in a CLI, if they look OK then the above should work.

EDIT: Huh! x16wda beat me by seconds  ;D

rjbull:
Batch file method should work, but I've sometimes used ports of the Unix cat command:
Usage: cat [OPTION] [FILE]...
Concatenate FILE(s), or standard input, to standard output.

  -A, --show-all           equivalent to -vET
  -b, --number-nonblank    number nonblank output lines
  -e                       equivalent to -vE
  -E, --show-ends          display $ at end of each line
  -n, --number             number all output lines
  -s, --squeeze-blank      never more than one single blank line
  -t                       equivalent to -vT
  -T, --show-tabs          display TAB characters as ^I
  -u                       (ignored)
  -v, --show-nonprinting   use ^ and M- notation, except for LFD and TAB
      --help               display this help and exit
      --version            output version information and exit

With no FILE, or when FILE is -, read standard input.

  -B, --binary             use binary writes to the console device.


Report bugs to <[email protected]>.

AbteriX:
Or

copy *.txt concatALL.txt



.

Navigation

[0] Message Index

[#] Next page

Go to full version