Show Posts - jity2

Post New Requests Here / REQ: CLI : basically unzip + run PDFTextChecker and move resulting files

« on: November 14, 2019, 02:41 AM »

Dear all,

I am trying to speed up my monthly archiving manual work by doing some automatic things. I am on Win8.1 64 bits home.

[0) I am using Synckback pro V8 to periodically move files (zip, rar, pdf, jpg and png) to a folder named "folder1".
When this Synckback pro profile has run, it can run a program (see image

) after the above files has been moved ("Run after profile").
I have discovered that I can run a .bat file which itself run several bat files (https://stackoverflow.com/questions/1103994/how-to-run-multiple-bat-files-within-a-bat-file). ]

The main.bat file (or powershell?) should do the following :

1) delete duplicate files (MD5 checksum?)
(maybe using this powershell script https://n3wjack.net/2015/04/06/find-and-delete-duplicate-files-with-just-powershell/ ?)

2) unzip+unrar zip and rar files recursively (and once done delete the originals) of "folder1"
I thought using this old coding snack in ahk (RecurUnZip https://www.donationcoder.com/forum/index.php?topic=21424.msg192366#msg192366) but I need something automatic as my "folder1" path doesn't change.
(I have tried to adapt this powershell code with its comments alas unsucessfully https://superuser.com/questions/620056/recursively-unzip-files-where-they-reside-then-delete-the-archives/620077 !)

3) Reduce filepath to less than 260 characters (because some old programs can't open filepath that are long than 260 characters them later on). I manually use "Path Scanner" (http://www.parhelia-tools.com/products/pathscanner/Download.aspx)

4) Delete pdf files that are less than 2ko (because they are garbage) (I manually do this by using the freeware Everything and rank by size pdf files)

5) Run PDFTextChecker (https://www.donationcoder.com/forum/index.php?topic=27311.msg255322#msg255322) on "folder1" by itself (it creates 2 files "!Not_Searchable.txt" and "!Searchable.txt"). Move the Not_Searchable pdf files in "folder2" and Searchable files in "folder3" (maybe using this old coding snack https://www.donationcoder.com/forum/index.php?topic=35339.msg330784#msg330784 ?).

Then I use ABBYY Finereader 12 Corporate (not the last version which limits the number of page you can OCR per month !) which does an OCR automatically every day of "folder2" (be careful if you follow this process as it sometimes delete pdf files without warnings! So use the options to keep original files in a separated folder! Then at the end of the month I use manually the freeware "pathsync"(https://www.cockos.com/pathsync/) to check differences and find those Finereader bugs!)

Many thanks in advance, ;)
Jity2

Finished Programs / DONE: Mass convert already locally saved html (+htm +mht) files to pdf

« on: March 14, 2019, 12:53 PM »

Dear all,

I would like to convert the html files of my archives into pdf (text OCRed + local related saved images included) so I can make keyword searches in them with Google Drive.
I need that the related images saved with the html file (usually in a related folder) be included as well as the url available in the html. I would prefer that the tool does its job with an offline mode as all the info that I have is already saved in the local html files, so it doesn't spend times to try all missing urls.

My about 20 years archives (many thousand of files) were mostly saved with Internet Explorer (Maxthon) for a few years then mostly with Firefox, and now Google Chrome (and httrack).

The idea : I choose one big folder "Source" (usually a month archives), it scan alone all the html, htm and mth files, in all the subfolders, than create in a big folder "Target" all the converted pdf files with the original names in the same subfolders.
Example :
C:\Source\2009\2019-04\15\2009_04_15_075256.html
...
C:\Target\2009\2019-04\15\2009_04_15_075256.pdf

I have tried to use wkhtmltopdf which is based on webkit (Safari https://github.com/wkhtmltopdf/wkhtmltopdf/issues/3163) with the following script (based on a old AHK script found here) :

[Select]

WorKingDir := "C:\prog\wkhtmltox\bin"
pdParams := "wkhtmltopdf.exe "
FileSelectFolder,SourcePath,,0,Select Source Folder
If SourcePath =
ExitApp

FileSelectFolder,TargetPath,*%SourcePath%,0,Select Target Folder
If TargetPath =
ExitApp

If (SourcePath==TargetPath){
Msgbox 0x40000,, % "SourcePath and TargetPath cant be the same " TargetPath
ExitApp
}

Loop, Files, %SourcePath%\*.htm, R
{
SplitPath, A_LoopFileFullPath, , , , OutNameNoExt
pCmd := pdParams """" A_LoopFileFullPath """" " " """" TargetPath "\" OutNameNoExt "_.pdf" """"
RunWait % comspec " /c " pCmd , % WorKingDir
}
ExitApp

Results seems ok but here are the problems that I have found :
- The output folders are not created as wkhtmltopdf puts all created pdf files into the Target folder without subfolders. This creates problems when html files have the same name as wkhtmltopdf overwrites them !
Feature request: create output folders if necessary
https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2421

- Sometimes it creates small unnecessary pdf files (2ko!). I can later delete them. I think they are created as when saving a webpage CTRL+S there are also some small htm files created in a related folder.
Example:
C:\A\save01.html
C:\A\save01\image.gif
C:\A\save01\image.htm
...
so in fact this is normal and ok ! ;)

- I have tried to implement the following trick :
offline mode: does not try to look for missing component online for locally saved html files
https://github.com/wkhtmltopdf/wkhtmltopdf/issues/3294

I have replaced this line of code:

[Select]

pdParams := "wkhtmltopdf.exe "

with :

[Select]

pdParams := "wkhtmltopdf.exe --proxy=http://127.0.0.1:0 "

Alas it didn't work as it is still slowly trying to crawl missing urls online.

- No silent mode as flashing cmd windows don't let me continue to work on my computer!

If I am not using the correct tool, I'd be pleased to try other ideas (maybe based on other rendering engines ?). ;)
I realize that no method is error free so text or images may not be rendered perfectly each time but as long as I have most of them it will be fine. ;)

Many thanks in advance ;)
Jity2

Win 8.1 64bits home

General Software Discussion / [REQ] Web Extension : Silently save complete html + screenshot then close tab

« on: March 08, 2018, 03:21 AM »

Dear all,

Main idea:
With one click on a button, the Web Extension (in Google Chrome or Firefox) silently saves the complete webpage as html and as a complete screenshot. Then it closes silently the tab.

Process detailed :
1) Save the complete webpage like if I use the shortcut [CTRL+S] and rename the page as MONTH_DAY_MIN_SEC.html + its related folder ("htmlfilename_files" containing small images and etc....)
And all of that without prompting to enter filename and/or directory.

2) Save a complete screenshot of the page
This can be the equivalent of this extension : ScreenGrab "Save Complete page" :
https://chrome.google.com/webstore/detail/screengrab/fccdiabakoglkihagkjmaomipdeegbpk
(or something like this trick : tip : [CTRL+I], then [CTRL+P], then typewrite capture Full screenshot page (see https://zapier.com/blog/full-page-screenshots-in-chrome/). In my tests this don't capture always all the page. But this may better than nothing!)

3) close the tab
Like when I use [CTRL+W]
(or a webextension like https://chrome.google.com/webstore/detail/close-tab/hpmolahefgjbmidmoifolgpjkmhdmalf )

Optional :
- the button can be a standard icon or a bookmark button. It can change color when the saving process is running.
- it could be great if the saved content in saved in different folder each day (like: C:\NEWS\MONTH_DAY\MONTH_DAY_MIN_SEC.html + folder : C:\NEWS\MONTH_DAY\MONTH_DAY_MIN_SEC_files ...)
- the save html webpage can be replaced by "Save Page WE" https://chrome.google.com/webstore/detail/save-page-we/dhhpefjklgkmgeafimnjhojgjamoafof
but it creates only one html file containing images a little like a .mht file. It works ok but probably not 100% on all pages.
- save a WARC file of the webpage ("WARC is now recognized by most national library systems as the standard to follow for web archival" https://en.wikipedia.org/wiki/Web_ARChive). Example : WARCreate https://chrome.google.com/webstore/detail/warcreate/kenncghfghgolcbmckhiljgaabnpcaaa
- Force closing the tab if after 10 seconds not everything is saved ?
- maybe saving the screenshot first is faster than saving html first ?

Why am I asking for this ?
With the new Firefox Quantum released a few months ago, I am forced to change my habits. I save a lot of pages daily with one click on a bookmark. I am very happy with it. But I currently uses an outdated Firefox + Imacro addon versions in order to save webpages very fast. See https://www.reddit.com/r/DataHoarder/comments/82ir3k/automatically_archiving_all_visited_web_pages/dvbnycf/ and https://www.donationcoder.com/forum/index.php?topic=23782.0
Previously I used a Firegesture shortcut to save webpages. This resulted in too many webpages saved for my taste! So now I prefer one click on a bookmark in the top middle of my screen !
But why saving the same webpage with different formats ?
Because I realized that sometimes the saved html page is not saved properly. Then when I tries many years later to open it with the current browser of the day, it does not display well because of some small inserted urls or something ! Too bad for me as there is a small missing image that either I have to dig in the related html folder to see or it was not saved for some reasons !
So being able to see a screenshot might not be a bad idea (and in the Windows Explorer I can also display screenshots as 'extra large icon'). ;)
Furthermore, now years later, I use locally Dtsearch for indexing all my contents and online Google Drive which does automatically OCR (even it is far from perfect!) on screenshots ! ;)

Many thanks in advance ;)

General Software Discussion / Screenshots all page content of openned pdf file in sumatraPDF in burst mode ?

« on: February 09, 2017, 04:36 AM »

Dear all,

First some background for the idea :
I have some PDF files which are damaged. My goal is to OCR what can be repaired (I recently tested "Nuance Power PDF Advanced2". IMHO it can OCR many pdf that have problems that other OCR softwares can't even open. But alas it has still problem with some pdf files.)

I have tried several tools and techniques. The best ones so far being :
- 3-Heights™ PDF Analysis & Repair (they also sell a shell version).
https://www.pdf-tools.com/pdf20/en/products/pdf-converter-validation/pdf-analysis-repair/
(The free version can be used here : https://www.pdf-online.com/osa/repair.aspx )
The problem is that it doesn't repair all defects properly. ;(

- and a batch script using SumatraPDF and the printer Bullzip ( see https://www.donationcoder.com/forum/index.php?topic=42713.msg399623 ).
The problem here is that it takes a lot of time,CPU and memory. For instance a pdf of 100MB uses 16GB of temporary SSD space in order to produce ("print") finally, after 10 minutes, a 300MB pdf !
Also for several pdf files, the process is done and at the end no pdf file is created ! ;(

So I got this idea :
I realize that the nice thing is that I can open most of the pdf (that have errors) with SumatraPDF. ;)
So it would be great if some software could once the pdf openned in SumatraPDF, take a screenshot of each pages in burst mode (one screenshot then turn to the next page, then repeat). Then I could probably make a pdf from the image files and OCR them very fast ?
I did test SCREENSHOT CAPTOR VERSION 4 https://www.donationcoder.com/Software/Mouser/screenshotcaptor/index.html but I wasn't able to do it (the automatic page "down/up" did not work - win8.1 64) !

Thanks in advance ;)

Finished Programs / IDEA: Batch merge many pdf (result : only one big pdf per subfolder)?

« on: June 25, 2016, 08:04 AM »

Dear all,

I need to merge many pdf files located into thousands of subfolders (several levels) full of pdf and other files. At the end, I need only one big pdf file per subfolder.

Example: “C:\Main_folder\” contains :

C:\Main_folder\subfolder_wgs\jshhd545.pdf

C:\Main_folder\subfolder_wgs\jshhd545.htm

C:\Main_folder\subfolder_wgs\ejkehe5485.pdf

…

C:\Main_folder\subfolder_ghdfdhd\jdjdhjd5545.pdf

C:\Main_folder\subfolder_ghdfdhd\jdsdjdh44.pdf

…

C:\Main_folder\subfolder_yuege255\uejgd56564\kdfhk5465.txt

C:\Main_folder\subfolder_yuege255\uejgd56564\kdfhk5465.pdf

…etc..

Desired results:

C:\Main_folder\subfolder_wgs\subfolder_wgs.pdf

C:\Main_folder\subfolder_ghdfdhd\subfolder_ghdfdhd.pdf

C:\Main_folder\subfolder_yuege255\uejgd56564\subfolder_yuege255.pdf (or C:\Main_folder\subfolder_yuege255\uejgd56564\uejgd56564.pdf)

etc…

note: It would be great if results could be added into a new big folder (like C:\Main_folder2\ for instance).

I am on Win8.1 64bits.
Free or open source solutions preferred.
Thanks in advance ;)

Topics - jity2 [ switch to compact view ]

Post New Requests Here / REQ: CLI : basically unzip + run PDFTextChecker and move resulting files

Finished Programs / DONE: Mass convert already locally saved html (+htm +mht) files to pdf

General Software Discussion / [REQ] Web Extension : Silently save complete html + screenshot then close tab

General Software Discussion / Screenshots all page content of openned pdf file in sumatraPDF in burst mode ?

Finished Programs / IDEA: Batch merge many pdf (result : only one big pdf per subfolder)?