Welcome Guest.   Make a donation to an author on the site July 28, 2014, 09:25:56 AM  *

Please login or register.
Or did you miss your validation email?


Login with username and password (forgot your password?)
Why not become a lifetime supporting member of the site with a one-time donation of any amount? Your donation entitles you to a ton of additional benefits, including access to exclusive discounts and downloads, the ability to enter monthly free software drawings, and a single non-expiring license key for all of our programs.


You must sign up here before you can post and access some areas of the site. Registration is totally free and confidential.
 
Read the full one-year retrospective report on DonationCoder.com.
   
   Forum Home   Thread Marks Chat! Downloads Search Login Register  
Pages: [1]   Go Down
  Reply  |  New Topic  |  Print  
Author Topic: How do I get rid of hidden characters like †in text file ?  (Read 7899 times)
patteo
Charter Member
***
Posts: 436


View Profile Read user's biography. Give some DonationCredits to this forum member
« on: January 03, 2009, 09:55:46 AM »

I have clipped some text of a webpage and it appears to contain just text.

However, when I examine it with different utilities like AptEdit or Metapad, I notice that there are hidden characters like â€.

When I load the file into Windows Notepad, I cannot see them.

They are probably some hidden control codes.

Does anyone know of any utility that I can run the file through that will clean it of such hidden characters.

It would be nice if the utility accepts commandline commands as well so I need not open the file and can run it as a batch file as well.
Logged
40hz
Supporting Member
**
Posts: 10,418



see users location on a map View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #1 on: January 03, 2009, 10:52:12 AM »

Grab a copy of PureText and have at it:

Link: http://www.download.com/P...3000-2384_4-10069166.html

Quote
Publisher's description of PureText
From Steve Miller:


Have you ever copied some text from a web page or a document and then wanted to paste it as simple text into another application without getting all the formatting from the original source? PureText makes this simple by adding a new Windows hot-key (default is WINDOWS+V) that allows you to paste unformatted text to any application.

Version 2.0 adds Vista support, a new default hot key combo, optional sound, and various other visual enhancements and bug fixes.

 Thmbsup
Logged

Don't you see? It's turtles all the way down!
PhilB66
Supporting Member
**
Posts: 1,510


View Profile Give some DonationCredits to this forum member
« Reply #2 on: January 03, 2009, 11:25:46 AM »

PureText is a wonderful little tool. The home page is @ http://www.stevemiller.net/puretext/. Skrommel's PlainPaste and Copy Plain Text (a firefox addon) are good alternatives.
Logged
f0dder
Charter Honorary Member
***
Posts: 8,774



[Well, THAT escalated quickly!]

see users location on a map View Profile WWW Read user's biography. Give some DonationCredits to this forum member
« Reply #3 on: January 03, 2009, 11:51:46 AM »

Hidden characters?

I'm guessing at UTF-8 encoding. Processing the documents in an editor/tool that doesn't know about non-ascii character encoding will risk ruining the documents.
Logged

- carpe noctem
40hz
Supporting Member
**
Posts: 10,418



see users location on a map View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #4 on: January 03, 2009, 12:49:06 PM »

Hidden characters?

I'm guessing at UTF-8 encoding. Processing the documents in an editor/tool that doesn't know about non-ascii character encoding will risk ruining the documents.

I believe the latest iteration of PureText supports Unicode under Windows.

PureText is a wonderful little tool. The home page is @ http://www.stevemiller.net/puretext/. Skrommel's PlainPaste and Copy Plain Text (a firefox addon) are good alternatives.

Thanks for posting the direct link on the Steve Miller homepage . 

Thmbsup I opted to post download.com because I kept timing-out every time I tried to go to stevemiller.net.  Seems to be working ok now...

Logged

Don't you see? It's turtles all the way down!
tomos
Charter Member
***
Posts: 8,349



see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #5 on: January 03, 2009, 03:17:58 PM »

staying slightly on-topic:-

I've sometimes seen this on webpages -
they cant display an apostrophe as in ' (I don't mean the accent`)
or they can have problems with umlauts - instead they show stuff like the †-
I'm wondering is that to do with the site in question or my browser?
(just idly wondering to be honest - and what did they use before the euro symbol came along tellme tongue)
Logged

Tom
cyberdiva
Supporting Member
**
Posts: 906


see users location on a map View Profile Give some DonationCredits to this forum member
« Reply #6 on: January 03, 2009, 06:31:27 PM »

I have clipped some text of a webpage and it appears to contain just text.
However, when I examine it with different utilities like AptEdit or Metapad, I notice that there are hidden characters like â€.
I don't think these are "hidden characters"; it's more likely that the document was encoded in UTF-8, and that AptEdit and Metapad can't handle UTF-8.  I had a similar problem with text that looked normal in UltraEdit, but when I tried to import it into a different program, all kinds of strange characters appeared.  I went back into UltraEdit and saved the document as ANSI/ASCII rather than UTF-8.  Then, when I imported it again into the program that couldn't handle UTF-8, it looked normal.
Logged
patteo
Charter Member
***
Posts: 436


View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #7 on: January 04, 2009, 06:48:20 AM »

Thanks all you most helpful donationcoders for your suggestions.

I tried PureText and Plainpaste and Thornsoft Clipmate (the cleaning part of the utility)

But no luck - still the same problem.

Eventually, I tracked down the problem to either " and ' characters in this particular instance and when I deleted all of them, the problem went away.

I suspect it has something to do with UTF-8 encoding as well.

Well, I imagine that something like that would not be too hard to fix but it does take some trial and error.
Logged
cyberdiva
Supporting Member
**
Posts: 906


see users location on a map View Profile Give some DonationCredits to this forum member
« Reply #8 on: January 04, 2009, 08:34:41 AM »

Eventually, I tracked down the problem to either " and ' characters in this particular instance and when I deleted all of them, the problem went away.
I suspect it has something to do with UTF-8 encoding as well.
I've seen this when someone cuts and pastes from MS Word or other programs that use slanted apostrophes and quotation marks rather than the straight up-and-down ' and " .   If this occurs in text over which I have some control, I'll change all the slanted stuff to straight up and down characters -- a simple search-and-replace will usually do the trick.  And yes, I think UTF-8 can represent the slanted stuff, but programs that can't handle UTF-8 will show the slanted stuff as strange characters.
Logged
Curt
Supporting Member
**
Posts: 6,307

see users location on a map View Profile Give some DonationCredits to this forum member
« Reply #9 on: January 04, 2009, 02:34:10 PM »

@ patteo:  If you are on Firefox it is as simple as installing an add'on like Copy Plain Text, or Extended Copy Menu (there is also a version for Internet Explorer), or even Auto Context, and then use the new line in the right-click context menu, "Copy As Text (without formats)".

Edited:
Another way is, to change/adjust View > Charset > ..., before copying.
« Last Edit: January 04, 2009, 02:38:51 PM by Curt » Logged
f0dder
Charter Honorary Member
***
Posts: 8,774



[Well, THAT escalated quickly!]

see users location on a map View Profile WWW Read user's biography. Give some DonationCredits to this forum member
« Reply #10 on: January 04, 2009, 03:26:52 PM »

tomos: when you see that "broken utf-8" on the web, it's probably the HTML document encoding type that hasn't been set properly - all current browsers should be able to render utf-8 just fine.

cyberdiva: some editors do support unicode documents, but fail to auto-detect document encoding if the document doesn't start with a BOM... kinda similar to the broken webpages not specifying document encoding.
Logged

- carpe noctem
cyberdiva
Supporting Member
**
Posts: 906


see users location on a map View Profile Give some DonationCredits to this forum member
« Reply #11 on: January 25, 2009, 11:29:45 AM »

cyberdiva: some editors do support unicode documents, but fail to auto-detect document encoding if the document doesn't start with a BOM... kinda similar to the broken webpages not specifying document encoding.
Thanks for the info, f0dder.  AFAIK, the program I had trouble with most recently unfortunately does not support unicode: askSam 7, a flexible database program that I like a lot (when I don't hate it  Grin ).  There have been a number of messages on their forum complaining about the total lack of unicode support. 
Logged
f0dder
Charter Honorary Member
***
Posts: 8,774



[Well, THAT escalated quickly!]

see users location on a map View Profile WWW Read user's biography. Give some DonationCredits to this forum member
« Reply #12 on: January 26, 2009, 02:39:24 AM »

The thing wrt. Unicode is that you really ought to design your programs with it in mind from the start, it can quickly become a major bother trying to retrofit support. And if you use 3rd-party components without unicode support, you might have to include both ANSI and UNICODE stuff, massage data around in the right formats, etc. Slightly++ messy smiley
Logged

- carpe noctem
Pages: [1]   Go Up
  Reply  |  New Topic  |  Print  
 
Jump to:  
   Forum Home   Thread Marks Chat! Downloads Search Login Register  

DonationCoder.com | About Us
DonationCoder.com Forum | Powered by SMF
[ Page time: 0.042s | Server load: 0.37 ]