topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • November 17, 2018, 07:14 PM
  • Proudly celebrating 13 years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - jity2 [ switch to compact view ]

Pages: [1] 2 3 4 5next
1
Dear all,

After some tests:
I will keep using ScreenGrab and save mht and (scrolled) png files locally. And when I know in advance that I need the maximum data of a webpage I will use the Google Chrome addons "Save Page We" with all options and/or "WARCreate".
Then use DtSearch to make keywords searches inside them.

For Google Drive (GD), the only correct option for me right now is to convert mht files into Docx files. But I am not sure that I want to spend time doing this each month. It would be much more easier if Google Drive was indexing the content of mht files  like they do for html files.
I'll keep asking them but I am not much optimist !
Mht files are afterall similar to email files. See this interesting old link for several file format options :"What's the best “file format” for saving complete web pages (images, etc.) in a single archive?"  https://stackoverflo...ages-images-etc-in-a
(also : https://gsuite-devel...hable-in-google.html
https://developers.g...ference/files/insert
https://developers.g...dk#drive_integration
more generally:
https://support.goog...s/answer/35287?hl=en
https://www.google.c...ts/file_formats.html

Note: html files saved with "SavePage WE" in GD: html are indexed but like you see it displayed when you preview an html file. The result is awful like it was read with a simple text reader. So if you make a two keyword searches that has some html code garbage inside it, GD won't find it. The strange thing is that the thumbnail of those html files are displayed correctly with images if you click on "I" like info. It appears on the right part of your screen and displays only to top part of the html. IMHO this thumbnail is using some kind of sandbox ?? I thought that GD didn't displayed html correctly because they were afraid of some worm javascript (or..?) that would causes damage to GD ? )

From mht to Docx:
I use a Word macro [ALT+F11] to convert mht files with WORD 2013 in Docx (it can also convert them into non-OCRed pdf,..etc.) :
Here is the code copied/adapted from http://muzso.hu/2013...-in-microsoft-office ] :

Option Explicit

Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)

Sub ConvertDocs()
    Dim fs As Object
    Dim oFolder As Object
    Dim tFolder As Object
    Dim oFile As Object
    Dim strDocName As String
    Dim intPos As Integer
    Dim locFolder As String
    Dim fileType As String
    Dim office2007 As Boolean
    Dim lf As LinkFormat
    Dim oField As Field
    Dim oIShape As InlineShape
    Dim oShape As Shape
    On Error Resume Next
    locFolder = InputBox("Enter the path to the folder with the documents to be converted", "File Conversion", "C:\Users\your_username\Documents\mht_to_docx_results\")



  '  If Application.Version >= 12 Then
  '      office2007 = True
        Do
            fileType = UCase(InputBox("Enter one of the following formats (to convert to): TXT, RTF, HTML, DOC, DOCX or PDF", "File Conversion", "DOCX"))
        'Loop Until (fileType = "HTML") 'fileType = "TXT" Or fileType = "RTF" Or   Or fileType = "PDF" Or fileType = "DOC" Or fileType = "DOCX"
        Loop Until (fileType = "TXT" Or fileType = "RTF" Or fileType = "HTML" Or fileType = "PDF" Or fileType = "DOC" Or fileType = "DOCX")
       
  '  Else
   '     office2007 = False
    '    Do
     '       fileType = UCase(InputBox("Enter one of the following formats (to convert to): TXT, RTF, HTML or DOC", "File Conversion", "TXT"))
     '   Loop Until (fileType = "TXT" Or fileType = "RTF" Or fileType = "HTML" Or fileType = "DOC")
    'End Select
   
  '  End If
   
   
    Application.ScreenUpdating = False
    Set fs = CreateObject("Scripting.FileSystemObject")
    Set oFolder = fs.GetFolder(locFolder)
    Set tFolder = fs.CreateFolder(locFolder & "Converted")
    Set tFolder = fs.GetFolder(locFolder & "Converted")
    For Each oFile In oFolder.Files
        Dim d As Document
        Set d = Application.Documents.Open(oFile.Path)
        ' put the document into print view
     '   If fileType = "RTF" Or fileType = "DOC" Or fileType = "DOCX" Then
     '       With ActiveWindow.View
     '           .ReadingLayout = False
     '           .Type = wdPrintView
     '       End With
     '   End If
        ' try to embed linked images from fields, shapes and inline shapes into the document
        ' (for some reason this does not work for all images in all HTML files I've tested)
       ' If Not fileType = "HTML" Then
       '     For Each oField In d.Fields
       '         Set lf = oField.LinkFormat
       '         If oField.Type = wdFieldIncludePicture And Not lf Is Nothing And Not lf.SavePictureWithDocument Then
       '             lf.SavePictureWithDocument = True
       '             Sleep (2000)
       '             lf.BreakLink()
       '             d.UndoClear()
       '         End If
       '     Next
 '           For Each oShape In d.Shapes
 '               Set lf = oShape.LinkFormat
 '               If Not lf Is Nothing And Not lf.SavePictureWithDocument Then
 '                   lf.SavePictureWithDocument = True
 '                   Sleep (2000)
 '                   lf.BreakLink()
 '                   d.UndoClear()
 '               End If
  '          Next
 '           For Each oIShape In d.InlineShapes
 '               Set lf = oIShape.LinkFormat
 '               If Not lf Is Nothing And Not lf.SavePictureWithDocument Then
 '                   lf.SavePictureWithDocument = True
  '                  Sleep (2000)
  '                  lf.BreakLink() = d.UndoClear()
  ''              End If
  '          Next
  '      End If
        strDocName = d.Name
        intPos = InStrRev(strDocName, ".")
        strDocName = Left(strDocName, intPos - 1)
        ChangeFileOpenDirectory (tFolder)
        ' Check out these links for a comprehensive list of supported file formats and format constants:
        ' http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word.wdsaveformat.aspx
        ' http://msdn.microsoft.com/en-us/library/office/bb238158.aspx
        ' (In the latter list you can see the values that the constants are associated with.
        '  Office 2003 only supported values up to wdFormatXML(=11). Values from wdFormatXMLDocument(=12)
        '  til wdFormatDocumentDefault(=16) were added in Office 2007, and wdFormatPDF(=17) and wdFormatXPS(=18)
        '  were added in Office 2007 SP2. Office 2010 added the various wdFormatFlatXML* formats and wdFormatOpenDocumentText.)
       ' If Not office2007 And fileType = "DOCX" Then
       '     fileType = "DOC"
       ' End If
        Select Case fileType
            Case Is = "TXT"
                strDocName = strDocName & ".txt"
                d.SaveAs FileName:=strDocName, FileFormat:=wdFormatText
                'ActiveDocument.SaveAs FileName:=strDocName, FileFormat:=wdFormatText
            Case Is = "RTF"
                strDocName = strDocName & ".rtf"
                d.SaveAs FileName:=strDocName, FileFormat:=wdFormatRTF
            Case Is = "HTML"
                strDocName = strDocName & ".html"
                d.SaveAs FileName:=strDocName, FileFormat:=wdFormatFilteredHTML
            Case Is = "DOC"
                strDocName = strDocName & ".doc"
                d.SaveAs FileName:=strDocName, FileFormat:=wdFormatDocument
            Case Is = "DOCX"
                strDocName = strDocName & ".docx"
                ' *** Word 2007+ users - remove the apostrophe at the start of the next line ***
                d.SaveAs FileName:=strDocName, FileFormat:=wdFormatDocumentDefault
            Case Is = "PDF"
                strDocName = strDocName & ".pdf"
                ' *** Word 2007 SP2+ users - remove the apostrophe at the start of the next line ***
                d.ExportAsFixedFormat OutputFileName:=strDocName, ExportFormat:=wdExportFormatPDF
        End Select
        d.Close
        ChangeFileOpenDirectory (oFolder)
    Next oFile
    Application.ScreenUpdating = True
End Sub


This works ok. The Docx files are about half smaller than mht !
But:
- it crashed with mht files with video links inside them!
- And alas Word tends to freeze my computer while converting some files!
- And sometimes wide webpages are cut (so not all the images are displayed properly) !
- Be careful: you need to have a folder with only mht files (not any other kind of files) before being able to convert them in docx otherwise the Word macro won't work !

I also did test a long time ago this batch software (https://www.coolutil...m/TotalHTMLConverter https://www.coolutil...om/online/MHT-to-DOC) but it was too slow and converted mht into doc not docx. Not all images were added from html files to the doc files...


(I also tested this option but it didn't work well :
From png to pdf:
- Seems to work with a trial version of Nuance Power PDF Advanced, but the OCR result is not very good (in order to check open the OCRed pdf file and save it as a txt file)!
- Acrobat Pro 8 : in two passes : one for creating the pdf file. And a second for doing the OCR. Alas OCR is not always possible due to size 45”x45” limits reached (even for some png files with 200”x200” with this trick https://acrobatusers...at-topics/ocr-error/). ;(  )

In GD:
Word files indexed. Limit : 1 Million characters.
PNG indexed only if less than 2 MB. And only the title is OCRed if the saved page is too wide.
(also big pdf files are not indexed in full. From memory it is something along : max 100 first pages of OCRed pdf files and max 10 first pages for non-OCRed files. There is maybe also a 50MB indexing limit. But for bigger files you still preview the file and make a keyword search CTRL+F and it will find the correct pages in the full pdf file. But the full content of this pdf won't be indexed in full automatically by GD)

Note: Mht can’t be previewed neither indexed in OneDrive !

The only last option that I see would be that ScreenGrab would save the webpage as an OCRed pdf file (like some print screen drivers) + mht. Like that this would work fine for most files natively in GD.
But I am sure this is technically doable !

Please let me know if you have other ideas ! ;)

2
Dear all,

Thanks for the comments.

Main idea:
With one click on a button, the Web Extension (in Google Chrome or Firefox) silently saves the complete webpage as html and as a complete screenshot. Then it closes silently the tab. 

Hopefully for me, the author of ScreenGrab (https://chrome.googl...jmaomipdeegbpk?hl=en) kindly helped me and released a new version that can do a screenshot, save the webpage as an .mht file and close the tab only if I use a shortcut (like CTRL+Q for instance). I encourage you to, like me, make a donation to him. ;)

Notes:
- This works only in Google Chrome and not in Firefox.
- YEAR_DAY_TIME_.mht (only one file instead of several files. In my case, In the past after a few million files, I have hitted some kind of limit - windows or NTFS??- which prevent me adding more files inside the partition where I unzipped my files. This happens faster if my html filenames were longer than just YEAR-DAY_TIME.html + its associated folder).
- The mht results can sometimes contain less, same or more (yes I didn't think it was possible!) information than the standard CTRL+S.
- The mht files are best displayed with Google Chrome than with Internet Explorer.
- I can index the mht files fine locally with DtSearch.
- Alas online, I can't make searches inside mht files with Google Drive (it doesn't index mht files. It exists an external viewer but I am reluctant to use it.)
I am in the process of deciding what to do : convert all my old html files and new mht files as .docx or pdf (with Word or ..). I also noticed that the png (or jpg) files were not correctly OCRed in Google Drive (often only the Title of an test article webpage is in fact OCRed. The OCR is ok only if I manually convert the png file as a Google Document (which doesn't count in your Google Drive storage. But be careful if you have more than 1 Million files in it as its GUI will start to slow down - G-Suite is far more powerful). Previewing a docx in Google Drive and making a keyword search is also faster than doing the same with a Google Document surprisingly.
- The mht and the png files from ScreenGrab are stored in "C:\Users\{username}\Downloads\ScreenGrab". If you need, like me, the data elsewhere you can use Syncback.




3
BTW, one good website that may be useful to you for Picasa and Google Photo is : https://sites.google...ite/picasaresources/

4
Maybe try with Bulk Rename Utility / actions/ import rename pairs (see their help file) + renaming options/prevent duplicates ?

5
Dear all,

Main idea:
With one click on a button, the Web Extension (in Google Chrome or Firefox) silently saves the complete webpage as html and as a complete screenshot. Then it closes silently the tab. 

Process detailed :
1) Save the complete webpage like if I use the shortcut [CTRL+S] and rename the page as MONTH_DAY_MIN_SEC.html + its related folder ("htmlfilename_files" containing small images and etc....)
And all of that without prompting to enter filename and/or directory.

2) Save a complete screenshot of the page
This can be the equivalent of this extension : ScreenGrab "Save Complete page" :
https://chrome.googl...kihagkjmaomipdeegbpk
(or something like this trick :  tip : [CTRL+I], then [CTRL+P], then typewrite capture Full screenshot page (see https://zapier.com/b...reenshots-in-chrome/). In my tests this don't capture always all the page. But this may better than nothing!)

3) close the tab
Like when I use [CTRL+W]
(or a webextension like https://chrome.googl...midmoifolgpjkmhdmalf )


Optional :
- the button can be a standard icon or a bookmark button. It can change color when the saving process is running.
- it could be great if the saved content in saved in different folder each day (like: C:\NEWS\MONTH_DAY\MONTH_DAY_MIN_SEC.html + folder : C:\NEWS\MONTH_DAY\MONTH_DAY_MIN_SEC_files ...)
- the save html webpage can be replaced by "Save Page WE" https://chrome.googl...geafimnjhojgjamoafof
but it creates only one html file containing images a little like a .mht file. It works ok but probably not 100% on all pages.
- save a WARC file of the webpage ("WARC is now recognized by most national library systems as the standard to follow for web archival" https://en.wikipedia...org/wiki/Web_ARChive). Example : WARCreate https://chrome.googl...lcbmckhiljgaabnpcaaa
- Force closing the tab if after 10 seconds not everything is saved ?
- maybe saving the screenshot first is faster than saving html first ?


Why am I asking for this ?
With the new Firefox Quantum released a few months ago, I am forced to change my habits. I save a lot of pages daily with one click on a bookmark. I am very happy with it. But I currently uses an outdated Firefox + Imacro addon versions in order to save webpages very fast. See https://www.reddit.c...d_web_pages/dvbnycf/ and http://www.donationc...ex.php?topic=23782.0
Previously I used a Firegesture shortcut to save webpages. This resulted in too many webpages saved for my taste! So now I prefer one click on a bookmark in the top middle of my screen !
But why saving the same webpage with different formats ?
Because I realized that sometimes the saved html page is not saved properly. Then when I tries many years later to open it with the current browser of the day, it does not display well because of some small inserted urls or something ! Too bad for me as there is a small missing image that either I have to dig in the related html folder to see or it was not saved for some reasons !
So being able to see a screenshot might not be a bad idea (and in the Windows Explorer I can also display screenshots as 'extra large icon'). ;)
Furthermore, now years later, I use locally Dtsearch for indexing all my contents and online Google Drive which does automatically OCR (even it is far from perfect!) on screenshots ! ;)


Many thanks in advance ;)

6
 :-[ Thanks for the links NetRunner . ;)

7
Hi Netrunner,

After trying to understand your WSW requests I am not familiar with :
Maybe use a third party service for canary : https://ifttt.com/makers/canaryhome ? or ask Microsoft flow ?...

>download page with PhantomJS or similar if the page is not compatible with the built-in IE (as it seems IE won't be updated anymore after Edge lunched and the developer said there is no plans to use another engine with JavaScript support).
Can you give a page example so I can check with WSW ?

Thanks ;)


8
Dear all,
I am a WSW customer for a little more than 10 years now! I don't know why Martin (WSW's developer) stopped his forum. Maybe lack of time ?
For my part I have always preferred to ask him many questions directly. He has always provided fast and helpful answers.
Alas I am not a coder so I can't help much with WSW scripts. But this is a great software. So thank you Martin anyway. ;)

>Netrunner :
For your RSS Full text script, I advise you to test this http://fivefilters.org/content-only/ in conjunction with WSW. ;)

See ya ;)

9
Dear all,

First some background for the idea :
I have some PDF files which are damaged. My goal is to OCR what can be repaired (I recently tested "Nuance Power PDF Advanced2". IMHO it can OCR many pdf that have problems that other OCR softwares can't even open. But alas it has still problem with some pdf files.)

I have tried several tools and techniques. The best ones so far being :
- 3-Heights™ PDF Analysis & Repair (they also sell a shell version).
https://www.pdf-tool...pdf-analysis-repair/
(The free version can be used here : https://www.pdf-onli....com/osa/repair.aspx )
The problem is that it doesn't repair all defects properly. ;(

- and a batch script using SumatraPDF and the printer Bullzip ( see http://www.donationc...opic=42713.msg399623 ).
The problem here is that it takes a lot of time,CPU and memory. For instance a pdf of 100MB uses 16GB of temporary SSD space in order to produce ("print") finally, after 10 minutes, a 300MB pdf !
Also for several pdf files, the process is done and at the end no pdf file is created ! ;(

So I got this idea :
I realize that the nice thing is that I can open most of the pdf (that have errors) with SumatraPDF.  ;)
So it would be great if some software could once the pdf openned in SumatraPDF, take a screenshot of each pages in burst mode (one screenshot then turn to the next page, then repeat). Then I could probably make a pdf from the image files and OCR them very fast ?
I did test SCREENSHOT CAPTOR VERSION 4 http://www.donationc...hotcaptor/index.html but I wasn't able to do it (the automatic page "down/up" did not work - win8.1 64) !


Thanks in advance ;)

10
update: A few months later now, I stopped using Otixo (too expensive now for me).
For uploading my data I use : SynckbackPro and Syncdocs (for converting some of my files into (free) Google Documents).
And now, once I have uploaded my data into one service (Google Drive, Amazon Drive...), I now use rclone.org with a cheap VPS (I have used OVH. Similar are: Digital Ocean, Scaleway...) About $3.5/month. Not only this is cheap but it is far much faster than using my DSL line to move data between online services. ;)

Hope this helps ;)

11
2016 edition : How Backblaze Personal Backup lost me 2TB on purpose !

tl;dr : Backblaze uses file checksum to their advantage, so their customers can loose easily most of their previously uploaded data when changing hard drive (or moving a big folder inside a new partition) !


Dear all,

Instead of using the 30 days data retention policy to the customer advantage (by transferring automatically ALL the files already uploaded previously), Backblaze software is designed to lose most of its customer updated data when the customer remove an old drive and add a new bigger one (with the same data). Logic would be that they first find all the already uploaded files whatever their size and then starts uploading the missing ones ! After all, their software does a file checksum for something. But, in fact, it ranks the files by size and then starts to upload the smallest files first. So if a customer has uploaded a lot of big files before (max 2GB zip files in my case), they are lost IF all the smaller files are not uploaded during the 30 days lapse.(*)

For years, their client has also some on purpose bugs :

    1) It does not tell you that when you add a new drive in your computer you have to manually add it in their client settings.
    2) And when you do so, it again silently removes all the exclusion list that you ave added before ! And you have no choice but to again add manually all the folders you need to exclude !
    3) Caution: their new client version (v 4.20) added a "stupid feature : " exclusions now work across all attached drives. " Too bad for those that have the same folder name on different drives. Now you can't exclude only one in all drives or in none ! I just can't believe you did that ! ;("

It is why :

    - Their support team always answer: "Backblaze client needs time to find already uploaded data" ! They acknowledge this but it is hidden in their help file : "Backblaze prioritizes smaller files, and uploads larger files later." https://help.backbla...-handle-large-files-

    - They sell it as unlimited but recommend in initial backup in 30 days : "(...)3. Ideally, Backblaze should be able to complete your initial backup in 30 days. If your initial backup is estimated to take longer due to a lot of data or slow internet connection, then Backblaze is not the best solution for you.(...)" https://help.backbla...64608-Best-Practices

I like them a lot (see their blog) and I understand that they need to stay profitable (from memory 250GB threshold in 2012? 1TB threshold in 2016) but their customers must know that they should upload their files also elsewhere (Amazon Drive, ..etc..) where, contrarily to Backblaze Personal Backup, they can move the uploaded files very easily thanks to APIs (with for instance cloudhq.net or a cheap VPS and rclone).

(*) in my case: Win 8.1 64bits running 24/24 7/7 - 16GB RAM - DSL (upload max speed 10GB/day). Customer since 2011. About 3TB uploaded in July 2016. About 2TB lost as of today and only 1TB recognized (30 days period ended). ;( I have tried to add many folder exclusions (note: you can't just remove C:\ !) so backblaze could find easily my previously big files. Alas for me they were in 2 big folders previously. And I have added them into one big folder in my new drive. And could not remember or check easily within the 30 day retention period. Note2: I also lost about 2 weeks in order to realize that the new drive was not added automatically by Backblaze client) which exact many parts were previously uploaded. ;( Hopefully for me I use other storage services (Amazon Drive...). See here : http://www.donationc...ndex.php?topic=41873


update: 2016, Mid December : Backblaze lost me all my backup again! This time I did not add any new hard drive to my computer. I just put it on "run once every day at 10pm" for about one month. Then, I changed that again back to "continuous" and the 1.2 TB of data that I had there disappeared ! I just recovered 300GB since that day. ;(

12
Dear "4wd",
Many thanks. ;) It works like a charm. ;)
Thank you again ;)
See ya

13
Thanks "4wd". ;)

I had some difficulties (virus or hammering websites) finding the correct programs that you have recommended files in the thread (http://www.donationc....msg374877#msg374877) but I finally found a work around with these links :
http://filehippo.com...rsal_extractor/4795/
http://web.archive.o...g/web/20140315000000*/http://www.adultpdf.com/products/txttopdf/txttopdf.exe
https://www.pdflabs....e-2.02-win-setup.exe
Universal extractor did not work with pdftk but I was able to find the correct file once I installed PDKtk in : "C:\Program Files (x86)\PDFtk\bin\".

So anyway I was able to test your solution. It works correctly for one folder. ;) I hope you can adapt it to subfolders. ;)
Note: the header is not really needed for me but please do as you prefer. ;)

Thanks in advance for you help ;)
Jity

PS: in my case I don't have password protected pdf files but if someone has some you can remove them using the shareware "PDF Password Remover v3.1" http://www.verypdf.c...ord-remover-com.html using these instructions :
In windows find the "command prompt" then copy/paste the following (just adapt the correct path C:\...\) :
for /r "C:\test\" %F in (*.pdf) do "C:\Program Files (x86)\PDF Password Remover v3.1\pdfdecrypt.exe" -i "%F"

14
Dear all,

I need to merge many pdf files located into thousands of subfolders (several levels) full of pdf and other files. At the end, I need only one big pdf file per subfolder.



Example: “C:\Main_folder\” contains :

C:\Main_folder\subfolder_wgs\jshhd545.pdf

C:\Main_folder\subfolder_wgs\jshhd545.htm

C:\Main_folder\subfolder_wgs\ejkehe5485.pdf



C:\Main_folder\subfolder_ghdfdhd\jdjdhjd5545.pdf

C:\Main_folder\subfolder_ghdfdhd\jdsdjdh44.pdf



C:\Main_folder\subfolder_yuege255\uejgd56564\kdfhk5465.txt

C:\Main_folder\subfolder_yuege255\uejgd56564\kdfhk5465.pdf

…etc..

Desired results:

C:\Main_folder\subfolder_wgs\subfolder_wgs.pdf

C:\Main_folder\subfolder_ghdfdhd\subfolder_ghdfdhd.pdf

C:\Main_folder\subfolder_yuege255\uejgd56564\subfolder_yuege255.pdf  (or C:\Main_folder\subfolder_yuege255\uejgd56564\uejgd56564.pdf)

etc…



note: It would be great if results could be added into a new big folder (like C:\Main_folder2\ for instance).

I am on Win8.1 64bits.
Free or open source solutions preferred.
Thanks in advance ;)

15
Hi "4wd",
Many thanks. I did a few tests and this seems to be working great. ;)
I much appreciated. ;)
Thank you again ;)

16
Hi "4wd",

I haven't seen you edit before. Sorry. Thanks for your answer. ;)

Great find. My config is located at : C:\Program Files\Bullzip\PDF Printer\API\EXE\config.exe

So I guess I should add somewhere in your code the line :
C:\Program Files\Bullzip\PDF Printer\API\EXE\config.exe /S "Output_folder_or_subfolder" "I:\output_folder_or_subfolder\<basedocname>.pdf"


Also, would it be possible that the code sends only one file to be printed ?  Then it is printed by Bullzip. Then once bullzip has finished, the code sends another file to be printed ? I ask this because I have noticed that if I add many pdf to be printed, my computer becomes not responsive. I have checked and a 10 MB pdf file become easily a 2 GB or more (no typo!) file to be printed !

Thanks in advance ;)



17
Hi 4wd,
Many thanks for your help. ;)

I tested your code and it is working. ;) Alas I have made an error : I can select the output folder in Bullzip but it does not create the same subfolders of the source by itself. It put all processed pdf files into the same folder. ;(
This is very annoying for me as I need this.
Any idea ? Maybe using another pdf printer (apparently Bullzip runs in part ghostscript ; I also found this link but I don't understand everything! http://www.techrepub...-to-specific-folder/) ?

Many thanks in advance ;)

18
Dear all,

I have noticed that trying to OCR some of my PDF files causes a crash. But if I open the file with (and only!) SumatraPDF I can print the pdf to a pdf file (with bullzip) and get a new pdf file that I can then OCR successfully. ;)
So I need to print pdf-to-pdf many pdf files gathered inside one big folders and its many sub-folders.

I thought that an .ahk script would help me : I choose the folder where I have the original pdf files. It opens SumatraPDF silently, it logs the possible files in error then continue, and print to the default free printer Bullzip silently.
I have already chosen the default output folder and the silent part in the settings of Bullzip.

I have tried to adapt an old ahk file to my purpose, but it doesn't work. ;(

FileSelectFolder,SourcePath,,0,Select Source Folder
If SourcePath =
ExitApp

Loop,%SourcePath%\*.*,2,0
{
FolderSize = 0
Loop, %A_LoopFileFullPath%\*.*, , 1
FolderSize += %A_LoopFileSize%
If FolderSize >0
RunWait,C:\Program Files (x86)\SumatraPDF\SumatraPDF.exe  -print-to-default -silent "%A_LoopFileFullPath%\*.*"
}

Many thanks in advance ;)

ps: I also use these freeware a lot which helps reduce the list of pdf files which have problems with OCR : PDF Text Checker http://www.donationc...ex.php?topic=27311.0, PDFInfoGUI http://skwire.dcmemb.../fp/?page=pdfinfogui both from skwire and this command line script WINDOWS+R : http://www.donationc....msg330784#msg330784

19
Dear all,

A little more than one year later here are the changes to the "Online" part that I did. It means that I continue to use the above services except :

- Since a few months, I do not use anymore the paying version of cloudhq.net (I had transferred about all what I needed. And I realized that it was very difficult to search files into Amazon Cloud Drive (ACD) as Cloudhq's way to archive files into ACD. Cloudhq created basically too much folders (correctly named with date-hour) containing only a few files. Example: I modify one cell into a google spreadsheet => a new folder with a new xlsx file is created into ACD). Plus it created quite a mess into my ACD account (some folders seems "not accessible" into ACD when in fact there are there. My guess is that it is probably an ACD bug as I finally could find the correct folders/files by searching with keywords.Then test if I could access them (or no => use another keyword) and then moved them in a better simple main folder in the ACD root).

- I also use http://www.otixo.com/ when needed (example: copy/paste Google Drive to ACD for instance). This service is not as powerful as Cloudhq but it finally succeeds into finishing the job (you may have to do the copy again as there are sometimes "internal server" errors) . ;)

- I also use another great software for uploading into Google Drive or ACD which is much more powerful for my needs (uploads mainly)  : SynckbackPro (about $55) http://www.2brightsparks.com/
This one is far better for me as I have found that some files were not uploaded with Syncdocs (my guess : about 2 years ago files. I.e.: before Syncdocs was doing a hash check).
It fulfill a basic need : I select a folder on my computer and it checks if everything (folders and files) is really online (Google Drive or ACD in my case).
I only had small problems (as I didn't want to upload them all again) with files containing double spaces. My guess is that one online service recorded (Google Drive ?) some of them automatically as single space name. But I am not sure! So anyway in the end I had sometimes the double space version into ACD or locally or into Google Drive ! So I had to remove locally the double spaces before (thanks to Bulk rename utility) before using SyncbackPro properly.
Note: I keep using Syncdocs because contrarily to Syncback it can convert automatically Excel (and more: .html ...) files into Google Drive !

- And if you want to save $5 with Backblaze you have to buy a 2 year commitment (see in your online account over there).
Recently I had a bad surprise with Backblaze as their pc software did not recognize that I moved about half a TB into a different partition of my computer (if I remember well, previously that feature was working correctly!).


That's all for this time !
See you ;)

20

Yes, but how do you back those up, if the unthinkable happens, and due to some malfunction on Google's side they lose all your files?


Hi,

Good question. I have explained here how I do automatic backups of my Google data (Gmail + Drive+ folder in Drive for Google Photo+ calendar) : http://www.donationc....msg391828#msg391828
The most difficult part when doing backups is the restore one. So do some tests yourself !

I also recommend the manual backup of Google Take out. They have options for Google Drive : you can zip one or all Google Drive folders and transfer them as a zip file in Dropbox (or..) without downloading the zip on your computer. ;)


For "gt13013":
Sorry for my bad formatting. I have tried to make my tests very seriously but I guess you deserve more info :
The first dtsearch test is with the exact keywords that you provided a) b)...etc
I also added " means your help readme  files founds (not tested on every rows - only 6 - sorry!)". I have colored some cells on only 6 rows if it found "Readme36.doc" and "Readme36.pdf" files.
You may may remove it for clarity if you want to.

In the second one I modified the a) b) so Dtsearch understands what I want to search : so the syntax of the keyword are slightly modified.

Probably column "AS" should be moved to column "BA for instance".

The first GOOGLE DRIVE test is with the exact files that are in your zip folder once unzipped (no file is converted to Google Document).
I didn't have time to add the correct numbers in all cells on that part. Please add them. Thanks. ;)

Also I let the cell with numbers blank instead of color them in orange.

Hope this helps ;)

21
That being said I really *don't* see a lot of advantage in that setup except for one thing, which is a limitation of many RSS feeds, and maybe in some ways of the RSS format itself (or at least how it is commonly used): limited content length. Many RSS feeds only give you a snippet of the full content, or content differently formatted than the main website. In these cases having a Pocket version of the (presumably) full content is definitely ideal.
Hi,

I have always tried to get my rss feeds into emails. You can get full rss contents with http://fivefilters.org/content-only/. They also have an rss feed creator for websites that have no rss feeds.
I also use Website Watcher (their rss support has improved recently - you can get an email with only titles of the feed...etc. I read that the next update will support for regex filtering of rss feeds) and ifft (now you can get daily or weekly emails. Alas they replace the original urls with theirs. So if they disappear in the future you may not have access to the original urls. Fingers crossed!).

See ya ;)

22
I have made the test explained here :
https://www.dropbox....ation_test3.zip?dl=0

And it is a complete failure for Archivarius. Here is the updated spreadsheet with Archivarius results:
https://www.dropbox....ation_test3.xls?dl=0

I have just sent you my tests with your files on Dtsearch and Google Drive.

Dtsearch.jpg

Google Drive.jpg

I let someone else do the test on Office365.

For my part, even if their indexing is not yet perfect (no indexing inside zip files (same for indexing zip files inside zip files!) or .eml (it can be done into gmail) or files without extension...), I am more and more storing my unzipped files (converted - so they doesn't count as storage) in Google Drive (1TB).
I am in the process of trying to convert my html archives into Word 2013 .docx as I like its conversion results (it seems to nicely keep images and urls). I am not sure yet if I will convert all my pdf into gdocs (Drive pdf limits + result is not very good in my small tests), but who knows !
Now, I have about 200,000 files into Google Drive. I wonder if their GUI will still responds fastly with >1 million files !
Basically, I have added all my old Outlook Express emails into gmail. Personal photos and videos with Google Photos (ok not for professional photographers but it is fine for my tastes). Mp3 into Google Music (it is not my case by far but 50,000 files max is the limit - my guess it would represent about around 200GB max).
One can also adds their video (with private settings) into Youtube for free !
It is incredible what one can store quite cheaply into Google nowadays (I know 'I am the product'!)

I also keep dtsearch for more powerful desktop searches. ;)

Hope this helps ;)
See ya ;)

ps: my previous test comparing different indexing softwares :
http://www.donationc....msg373414#msg373414
http://www.donationc....msg373068#msg373068

23
I realize this is an old post but if someone needs ideas for alternatives : http://alternativeto.../software/backupify/

24
Found Deals and Discounts / Re: Black Friday Deals 2015
« on: November 27, 2015, 10:38 AM »
Amazon Will Sell You a Year of Unlimited Cloud Storage For $5 Today !
http://deals.kinja.c...d-storage-1744816188

25
Dear all,

A few months ago I have added a new layer into my backup strategy :
(short answer : I do a new automatic online backup but I avoid uploading again with my slow dsl line my data from Google Drive+Gmail into another cloud provider : Amazon Cloud Drive "Unlimited Everything". But it could have been another one like: Dropbox, OneDrive Office 365,...etc..Thanks to Cloudhq.net . They have many possibilities available)

1) Offline :
local copies from time to time into different hard drives and physical locations.

2) Online :
On my pc I use backblaze.com (about $5/month) for mirroring my pc (see my previous experience with them here http://www.donationc...opic=34797.msg373077 )
I also use syncdocs.com (one time $20 fee) for uploading some of my data into Google Drive.
Then I recommend these service for saving gmail and Google Drive into indirectly Amazon S3 :
- Individuals : cloudally.com (30$/year) for saving Gmail+ Google Drive. note: I have not tested them lately but it should do the work imho.
- Companies : spanning.com ($40/year. if you want to, you can use this link so both of us get a $5 rebate : https://spanningbackup.com/s5m5/A7E6XK ). As I was told there "the difficult part is not the backup, but the restore part". I did test them a lot on that and it works. Their point in time restore is quite good ! ;)

The new part is that in order to avoid uploading again all the data that I have in Google Drive, I use Cloudhq.net :
I transfer and sync continuously my Google Drive +Gmail data automatically with Cloudhq (about $10/month) into Amazon Unlimited Everything (about $60/year). It fits my needs as I don't have to upload again from my slow dsl line all my TBs data that I have already uploaded (Rough guess: I upload a little more than 1TB per year). I also can save my family emails and their related google drive on my amazon drive. ;)

Another thing that I do is that I save on a google Spreadsheet the date and main big folders names that I delete from Google Drive (it happens when I hit the 1TB limit in Google Drive). This can helps a lot when I do a restore later.

As the backblaze guys says : "backup before you wish you had ! " 

I hope this helps ;)

Pages: [1] 2 3 4 5next