ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

DonationCoder.com Software > Post New Requests Here

REQ: Convert/Export Certain Browser's "Groups" as Chrome Bookmarks

<< < (2/5) > >>

Cocoa:
Hey, 4wd

Thanks so much for your continued efforts in helping and supporting this app. I got caught up with some unexpected stuff over the weekend, didn't get a chance to test the new version until now. I double-clicked the script to run it and ran into an error after choosing the folder for groups. It seems many of the commands are not supported in my Windows 7 SP1 x64 (Ult). Is there some extra dependency?

[attachimg=#1][/attachimg][attachimg=#2][/attachimg]

And to answer your earlier question about encoding. The characters in my first post are UTF-8 because they are copied and displayed in a webpage. When displayed in the web browser that you're currently viewing, they will display according to the html and/or browser settings. Chrome and many modern browsers will automatically select the correct code page for displaying text that's not in Unicode (assuming you have the code page installed), but you can also set your own display encoding manually in your browser, and that can also result in mojibake when it doesn't match the actual encoding that's being used. Since there can be multiple code pages even for the same language (due to regional differences), it makes a fairly troublesome affair, which is why most use Unicode nowadays. Anyway, what you see in a webpage is entirely separate from the cgp files, which is probably still in ANSI, the encoding that Green Browser uses by default for generating them (and I have no control over). It's also why the mismatch in encoding causes mojibake when the text from cgp gets imported into Chrome/Slimjet.

Took a closer look and it seems generation for the html was actually in the process of working when that error caused it to fail. I have here the html file generated that contains the contents of 1 full group, as well as partial contents from a second group. I've also uploaded the original cgp files of those two groups. You said the archive/extract process might be messing with the encoding. I know this is possible for the file names, but unlikely for the contents of a file. However, just to eliminate all possibilities, I've used an outside hosting site to upload the original files without compression, since the attachment format restrictions of this forum doesn't allow cgp or html.
(*links removed since they're no longer needed)

1. Unfortunately encoding conversion still isn't working right. The non-Romance characters still end up as gibberish when imported. Since the mojibake ends up with even more unusual symbols than normal being displayed when using a wrong encoding, perhaps it was one such character/symbol that caused the app to generate an error and exit? This appears to be something strangely similar to what happened with the convertor for cgp to IE favorites.

2. The unique top level "folder" name needs to be <H3> tag, not <H4>, or it will not be properly imported into Chrome/Slimjet. Any and all following folders after that also needs to be <H3> tag, preceded by <DT> and the contents enclosed by a pair of

--- Code: Text ---<DL><p> tagsThe importing process doesn't recognize anything else. The only things denoting the boundaries of each folder are the

--- Code: Text ---<DL><p> and </DL><p> tag pairsEverything following the DL tag and before /DL tag, regardless of whether it's a <H3> folder or an url, will mean that it's inside that particular "folder". So if you want to create different layers of folders, you only need to make sure the pairs are placed appropriately, which means the "folder" names always need to be <H3> tags and MUST be followed immediately by an opening DL, p tags. Current html is also missing the DL, p tags after the top level folder. The basic code I've included in my second post are what I have tested to work, so it should work as long as the format in the generated html closely follows that.

[attachimg=#3][/attachimg]

3. If it's not too much trouble, it's probably best to have the file name of the html generated include date as well as time. Otherwise there will still be a small but distinct possibility that it will conflict with an already generated file name. e.g. instead of "CGP2HTML-%hour%%min%%sec%.html", how about "CGP2HTML-%year%%month%%day%.%hour%%min%%sec%.html"?

4. One additional issue that just occurred to me when looking over my cgp files. The file names of some cgp files themselves can be in non-Romance characters. Let's consider together how we should handle this once the other issues have been resolved.

4wd:
It seems many of the commands are not supported in my Windows 7 SP1 x64 (Ult). Is there some extra dependency?-Cocoa (June 23, 2015, 05:25 PM)
--- End quote ---

They are not commands, they're fragments of the url/name that is read - something that wasn't a problem in the abbreviated .cgp in your OP.

You can't convert from ANSI characters to UTF-8 characters from what I've tried and read.  Sure you can convert the file format but that won't fix the original input characters.
(EDIT: I'm stoopid :-[)

Here's the problem, you're not talking about just converting a text file to another text file in a different format/encoding.  You're talking about having a program read each URL, fetch that page, parse it, and then store the page title plus the URL in another bookmark format and character encoding.

This is why both the converter program and command file aren't providing you with what you want - a "format" converter is not what is required.

1. Unfortunately encoding conversion still isn't working right. The non-Romance characters still end up as gibberish when imported. Since the mojibake ends up with even more unusual symbols than normal being displayed when using a wrong encoding, perhaps it was one such character/symbol that caused the app to generate an error and exit?
--- End quote ---

Exactly, it was the gibberish that is the name field - there is no way around that, it starts out as gibberish, it'll remain gibberish.  The only option is to refetch the page and save the title with the correct encoding.

I'll have a play in Powershell since it supports Unicode.

EDIT: Exactly where are you located and what locale/language is your system set to?

I've been playing around in Powershell and quite frankly, I can't get it to work consistently with some of the sites in your .cgp file.

eg.  Reading the page title through Powershell:


--- Code: Text ---http://www.cnn.co.jp/fringe/35065380.html?tag=rcol;editorSelect correctly gives:
CNN.co.jp : NASAの「空飛ぶ円盤」、ハワイで飛行実験


--- Code: Text ---http://www.excite.co.jp/world/chinese/ always gives gibberish even though the page is supposedly UTF-8:
中国語翻訳 - エキサイト 翻訳

Not only that but there is mixed encoding, it's not all UTF-8:

--- Code: Text ---http://www.messe.gr.jp/girls/index.html?category_id=&kw=%2588%25E4%258F%25E3%2598a%2595F&pageID=5 this page is Shift-JIS which means the title will be illegible when output as UTF-8.

Cocoa:
You're talking about having a program read each URL, fetch that page, parse it, and then store the page title plus the URL in another bookmark format and character encoding.
...
Exactly, it was the gibberish that is the name field - there is no way around that, it starts out as gibberish, it'll remain gibberish.  The only option is to refetch the page and save the title with the correct encoding.

I've been playing around in Powershell and quite frankly, I can't get it to work consistently with some of the sites in your .cgp file.
-4wd (June 23, 2015, 09:28 PM)
--- End quote ---
At first I was very confused about the first sentence, since from the beginning, I was talking about just converting text from the cgp files into text stored in the html format, but now I understand where you're coming from. The thing is, fetching titles from the sites themselves wouldn't work, and not just because you have no control over the encoding any particular site chooses to use (yes, some use Unicode, but many will use a specific encoding which varies depending on the language). Most importantly, because my original "names" for the urls are more than just the titles for the pages. I used it more for notes than anything else, summarizing the site's content in a way that makes sense to me, and often include important reminders of other things. That's why it's crucial to retain the original "name" fields.

I truly appreciate your patience and varied approach in finding a solution to this challenge. ;D

I thought it would be possible to convert directly from ANSI to Unicode, but as someone without any technical background in computers, I don't really know how the process works, so I guess I over simplified things. You said if it starts out as gibberish, it would remain gibberish. Thinking back on the some of the little built-in tools I've occasionally used from other software to fix mojibake (such as this Chacon plugin for foobar2000), they have generally always required you to choose a character encoding for the source text, then an output encoding.

In Chacon, which is used to fix tags for audio files, it provides a preview window to allow you to switch between each encoding and see immediately if the characters are displayed correctly, i.e. if the correct source character encoding is chosen, prior to conversion and final overwrite to the tags themselves. It allows non-Unicode characters that was generated on one computer to be displayed correctly on another computer with a different system locale. The preview is useful due to the fact that not only do you have no idea what system locale was used on the computer from which the file originated, even when you know the actual language the characters are suppose to be in, the encoding itself doesn't always match the actual language of the characters used. With a large enough characterset, it's possible to display nearly all other characters from other languages in the same characterset intended for another language. e.g. It's possible to display Japanese characters using Chinese encoding, and maybe vice versa. (It's also why most European languages only have one "Western" charset.) However, sometimes there are some characters not included, and another important function of the preview is to allow you to see what might be lost if you chose a certain encoding, and which encoding might retain the most info, etc. I didn't think it would've been necessary to do it this way for converting from ANSI into Unicode, but I overlooked the fact that Unicode is ultimately still just another encoding, albeit one that is able to include all languages.


I will understand if you consider this too involved to implement. There is always the option to set it aside. I can try manually editing the "UTF-8" field for the right encoding. In your first version of the tool CGP2HTML, the html generated was able to be displayed correctly after I edited the "UTF-8" field to the correct encoding that I assume the text was originally saved in. If you just obtain the text without attempting to convert (or however it was done in the first version),  I can manually edit this field into what I think might be the proper encoding, based on the non-Unicode settings of the system it was generated on.

Case in point, on my system, the "system locale" or what Windows calls "the language setting for non-Unicode programs" varies, changing from time to time, depending on what language I need to display. It is often set to "Chinese (Simplified, PRC)", and sometimes "Japanese (Japan)". As I said, there are many different code pages even just for Simplified Chinese, and I'm not sure exactly what that translates to code page wise, but when I open a cgp file directly in the Green Browser window as a "webpage", "GB2312" is the encoding it selects automatically, and appears to display correctly. When I try to open the cgp in Slimjet, I have to manually select an encoding to have it display properly, and since GB2312 wasn't available, I chose "GBK", which is apparently just an extended version of the GB2312 charset. It also appears to display properly. In the html generated by the first version of CGP2HTML, all I did was edit UTF-8 to GBK and it displayed correctly when imported into Slimjet.

4wd:
Case in point, on my system, the "system locale" or what Windows calls "the language setting for non-Unicode programs" varies, changing from time to time, depending on what language I need to display. It is often set to "Chinese (Simplified, PRC)", and sometimes "Japanese (Japan)".-Cocoa (June 24, 2015, 02:37 PM)
--- End quote ---

Bingo!  I wish you'd said that at the start.  OK, let's see how far I get this time :)

I was wrong about the character encoding conversion, oh well ... happens sometimes   ;)  Anyway, found a program that will do it at the command line so I can work it into the command file or you can bulk convert all your .cgp before running the reformat.  One or the other, I'll see what happens.

Cocoa:
 :-[ Sorry, besides the fact I switch between different system locales depending on what I need, I didn't think it was necessary to know the initial encoding in order to convert from ANSI to Unicode. I apologize for any unnecessary detours caused.  :P

I'm glad you found something to simplify the encoding conversion process. :) I think ultimately, all data is stored in binary format, so while it may appear to be gibberish, it's not actually gibberish, as long as it hasn't been corrupted. As I understand it, that data just needs the proper key to be "translated" into human readable form, which in this case, is what the character encodings are for. That's why the characters in the cgp files are able to be consistently displayed in Green Browser on my computer, as well as show up correctly in Slimjet, once the proper encoding was selected. (They might not show up properly on your system until you have the proper encoding installed, though I'm pretty sure most post-WinXP OS's already have most encoding installed by default). It's only when the data gets translated with the wrong encoding and then stored in the "translated" form, then they will get corrupted and likely lose portions, if not all, original info, even if translated correctly again. After all, some things are "lost in translation", especially incorrect translation. That's when gibberish will truly stay gibberish.

Please take your time and try things out to see what works. I'll look forward to seeing the result and testing things again.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version