Home | Blog | Software | Reviews and Features | Forum | Help | Donate | About us

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • October 24, 2016, 09:25:36 AM
  • Proudly celebrating 10 years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: Strategy for capturing and retaining OCRed text ("Alternative Text") from images  (Read 1477 times)


  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 6,073
  • Slartibartfarst
    • View Profile
    • Donate to Member
Sorry for the length, but I need to explain this in context and in some kind of sequence for it to make sense.
The question I have at the end is:
How could one make best/optimum use of the potential for bulk copying (as opposed to the current piecemeal capability) to capture OCRed "Alternative Text" in a more readily accessible form than the present Clipstory html files seem to offer?

  • 1. How I came to be using ClipStory: (bear with me; this is relevant)
    On 2012-02-18 I posted a Feature request for CHS - "Detritus" database(s)
    It was not of pressing importance, and I didn't know if/how the request could be met, but it was at the back of my mind when I discovered that my can't-live-without clipboard information management tool - CHS - had somehow, without my knowing, deleted the bulk of my hard-earned Favorites that it held in its database. Fortunately my backup cycles meant that I was able to find older CHS database backups that were much larger in size than the more recent shrunken CHS database.
    Lesson learned: Monitor your critically important databases for any significant changes in size.

    However, because it was likely to be such a chore, I have procrastinated and not given myself the time to go back and see if/how I might be able to recover the Favorites from those older backups.
    So, with the "detritus" idea at the back of my mind, and having also by that stage spent some time experimenting with NirSoft's InsideClipboard, when BitsDuJour announced they were giving away Clipstory, I got a free, licensed copy. I installed it and am still using/trialling it. This was to be my de facto detritus collecting tool. I have it set up to save all stuff copied into folders (which are backed up), thus:
    • ClipStory audio files
    • ClipStory files      
    • ClipStory images    
    • ClipStory text      
    • ClipStory webpages  

    Essentially, anything you copy/cut gets saved into Clipstory (there's a max limit on file size though). The "webpages" folder saves anything copied from an html source. This means there's some doubling-up - e.g., a snippet of text copied from a web page apparently goes into the Clipstory text folder and the webpages folder - but that's OK by me as I periodically empty all the folders except the text one.
    There is also some duplication with CHS, which retains in its database the same text and images as Clipstory. However, again, that's OK by me as I periodically empty out all the images and all the text in the CHS databases - except for that text that I wish to retain, which gets flagged as "Favorite" and is kept for good and for easy access via CHS.

  • 2. How MS OneNote OCRs/captures text from images:
    If you paste/drag an image into OneNote, or if you capture a screenclip image using its superb built-in screen-clipping tool, or if you paste/drag content containing images from a web browser, into OneNote, each image gets immediately OCR-scanned and any text found by OCR is made available almost instantly in the form of copyable (and search-indexed) text - what OneNote calls "Alternative Text" or "Alt Text". This is tedious when you want to copy text from multiple images, as it needs to be done on a per image basis.
    Here's a picture of the "Alt Text" that you get if you right click-such an image: (this was a single image in a OneNote table that had several text-containing images pasted into it)

    oneNote-Clipstory OCR 01 - Alt Text.png

    I had built that table in OneNote so that I could use it in a DCForum post about buying MS Office for $9.95 as a corporate "Home Use" special deal. When I had built it, I copied the table and pasted it into irfanview (it pasted in as an image), saved the image, and that image went into the DCF post.

  • 3. What Clipstory did with the copied table:
    It saved it as an HTML file in the ClipStory webpages folder. But that's not all.
    In my housekeeping, I had deleted all the Clipstory image files, and was about to delete the HTML files, when I thought I should just take a quick look and see what I was about to delete. It was then that I discovered that Clipstory saves some stuff from OneNote that is potentially more interesting/useful than one might at first realise - viz: "Alt Text" as html code.
    I reconstructed what seems to have happened.
    When I looked through the Clipstory html files, they were all in html.

    Reconstruction: This is the html content of the file (2013-9-28 8497.html) of the copied OneNote table:

Code: Text [Select]
  1. Version:1.0
  2. StartHTML:0000000105
  3. EndHTML:0000006237
  4. StartFragment:0000000539
  5. EndFragment:0000006197
  7. <html xmlns:o="urn:schemas-microsoft-com:office:office"
  8. xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
  9. xmlns="http://www.w3.org/TR/REC-html40">
  11. <head>
  12. <meta http-equiv=Content-Type content="text/html; charset=utf-8">
  13. <meta name=ProgId content=OneNote.File>
  14. <meta name=Generator content="Microsoft OneNote 15">
  15. </head>
  17. <body lang=en-GB style='font-family:Calibri;font-size:11.0pt;color:black'>
  18. <!--StartFragment-->
  20. <div style='direction:ltr;border-width:100%'>
  22. <div style='direction:ltr;margin-top:0in;margin-left:0in;width:11.625in'>
  24. <div style='direction:ltr;margin-top:0in;margin-left:0in;width:11.625in'>
  26. <div style='direction:ltr'>
  28. <table border=0 cellpadding=0 cellspacing=0 valign=top style='direction:ltr;
  29.  border-collapse:collapse;border-style:solid;border-color:#A3A3A3;border-width:
  30.  0pt'>
  31.  <tr>
  32.   <td style='border-width:0pt;background-color:#F2F2F2;vertical-align:top;
  33.   width:4.7347in;padding:2.0pt 3.0pt 2.0pt 3.0pt'>
  34.   <p style='margin:0in;font-family:Calibri;font-size:20.0pt;color:black;
  35.   text-align:right' lang=en-NZ><span style='font-weight:bold'>What you get
  36.   under the MS Office </span></p>
  37.   </td>
  38.   <td style='border-width:0pt;background-color:#F2F2F2;vertical-align:top;
  39.   width:6.8888in;padding:2.0pt 3.0pt 2.0pt 3.0pt'>
  40.   <p style='margin:0in;font-family:Calibri;font-size:20.0pt;color:black'
  41.   lang=en-NZ><span style='font-weight:bold'>2013 Home Use Program</span></p>
  42.   </td>
  43.  </tr>
  44.  <tr>
  45.   <td style='border-width:0pt;background-color:#F2F2F2;vertical-align:top;
  46.   width:4.7347in;padding:2.0pt 3.0pt 2.0pt 3.0pt'>
  47.   <p style='margin:0in'><img src="file:///C:\Temp\msohtmlclip1\01\clip_image001.png"
  48.   width=537 height=219
  49.   alt="Machine generated alternative text:&#38;#10;Word 2013&#38;#10;Want to create and share great-looking&#38;#10;documents that get noticed? Use galleries&#38;#10;of easy-to-use formats for r&#38;#233;sum&#38;#233;s, letters,&#38;#10;greeting cards, flyers, and more."></p>
  50.   <p style='margin:0in;font-family:Calibri;font-size:11.0pt;color:black'
  51.   lang=en-NZ>&nbsp;</p>
  52.   </td>
  53.   <td style='border-width:0pt;background-color:#F2F2F2;vertical-align:top;
  54.   width:6.8888in;padding:2.0pt 3.0pt 2.0pt 3.0pt'>
  55.   <p style='margin:0in'><img src="file:///C:\Temp\msohtmlclip1\01\clip_image002.png"
  56.   width=539 height=220
  57.   alt="Machine generated alternative text:&#38;#10;Excel 2013&#38;#10;Excels versatile tools help you analyze&#38;#10;information to make better decisions.&#38;#10;Improved charting tools and new visual&#38;#10;effects make it easier to present data and&#38;#10;highlight trends."></p>
  58.   </td>
  59.  </tr>
  60.  <tr>
  61.   <td style='border-width:0pt;background-color:#F2F2F2;vertical-align:top;
  62.   width:4.7347in;padding:2.0pt 3.0pt 2.0pt 3.0pt'>
  63.   <p style='margin:0in'><img src="file:///C:\Temp\msohtmlclip1\01\clip_image003.png"
  64.   width=540 height=215
  65.   alt="Machine generated alternative text:&#38;#10;PowerPoint 2013&#38;#10;Get your ideas noticed. With PowerPoint&#38;#8217;s&#38;#10;new formatting and graphics features you&#38;#10;can more effectively create dynamic&#38;#10;presentations."></p>
  66.   </td>
  67.   <td style='border-width:0pt;background-color:#F2F2F2;vertical-align:top;
  68.   width:6.8888in;padding:2.0pt 3.0pt 2.0pt 3.0pt'>
  69.   <p style='margin:0in'><img src="file:///C:\Temp\msohtmlclip1\01\clip_image004.png"
  70.   width=529 height=218
  71.   alt="Machine generated alternative text:&#38;#10;Outlook 2013&#38;#10;Better manage your time and information,&#38;#10;connect across boundaries, and improve&#38;#10;email control and protection."></p>
  72.   </td>
  73.  </tr>
  74.  <tr>
  75.   <td style='border-width:0pt;background-color:#F2F2F2;vertical-align:top;
  76.   width:4.7347in;padding:2.0pt 3.0pt 2.0pt 3.0pt'>
  77.   <p style='margin:0in'><img src="file:///C:\Temp\msohtmlclip1\01\clip_image005.png"
  78.   width=535 height=215
  79.   alt="Machine generated alternative text:&#38;#10;OneNote 2013&#38;#10;OneNote 2013 makes it easy to take flotes,&#38;#10;sketch a diagram and record a presentation,&#38;#10;all in one place. Your flotes are&#38;#10;automatically saved and searchable, and&#38;#10;they travel seam lessly to your favorite&#38;#10;devices."></p>
  80.   </td>
  81.   <td style='border-width:0pt;background-color:#F2F2F2;vertical-align:top;
  82.   width:6.8888in;padding:2.0pt 3.0pt 2.0pt 3.0pt'>
  83.   <p style='margin:0in'><img src="file:///C:\Temp\msohtmlclip1\01\clip_image006.png"
  84.   width=538 height=220
  85.   alt="Machine generated alternative text:&#38;#10;Access 2013&#38;#10;Track and report information with ease. Our&#38;#10;fluent user interface and interactive design&#38;#10;capabilities don&#38;#8217;t require deep database&#38;#10;knowledge."></p>
  86.   </td>
  87.  </tr>
  88.  <tr>
  89.   <td style='border-width:0pt;background-color:#F2F2F2;vertical-align:top;
  90.   width:4.7347in;padding:2.0pt 3.0pt 2.0pt 3.0pt'>
  91.   <p style='margin:0in'><img src="file:///C:\Temp\msohtmlclip1\01\clip_image007.png"
  92.   width=531 height=217
  93.   alt="Machine generated alternative text:&#38;#10;Publisher 2013&#38;#10;Create and distribute persuasive marketing&#38;#10;materials that reflect your brand identity."></p>
  94.   </td>
  95.   <td style='border-width:0pt;background-color:#F2F2F2;vertical-align:top;
  96.   width:6.8888in;padding:2.0pt 3.0pt 2.0pt 3.0pt'>
  97.   <p style='margin:0in'><img src="file:///C:\Temp\msohtmlclip1\01\clip_image008.png"
  98.   width=538 height=217
  99.   alt="Machine generated alternative text:&#38;#10;InfoPath 2013&#38;#10;Easily create electronic forms to gather&#38;#10;data for projects."></p>
  100.   </td>
  101.  </tr>
  102.  <tr>
  103.   <td style='border-width:0pt;background-color:#F2F2F2;vertical-align:top;
  104.   width:4.7347in;padding:2.0pt 3.0pt 2.0pt 3.0pt'>
  105.   <p style='margin:0in'><img src="file:///C:\Temp\msohtmlclip1\01\clip_image009.png"
  106.   width=538 height=219
  107.   alt="Machine generated alternative text:&#38;#10;Lync 2013&#38;#10;Lync 2013 is the client for Microsofts&#38;#10;enterprise-ready unified communications&#38;#10;platform. Lync connects people everywhere."></p>
  108.   </td>
  109.   <td style='border-width:0pt;background-color:#F2F2F2;vertical-align:top;
  110.   width:6.8888in;padding:2.0pt 3.0pt 2.0pt 3.0pt'>
  111.   <p style='margin:0in;font-family:Calibri;font-size:9.0pt;color:#595959'
  112.   lang=en-NZ>&nbsp;</p>
  113.   </td>
  114.  </tr>
  115. </table>
  117. </div>
  119. </div>
  121. </div>
  123. </div>
  125. <!--EndFragment-->
  126. </body>
  128. </html>
    (Look at all that
"Machine generated alternative text".)

When I viewed the file with Universal Viewer, I got this:
(Click to enlarge/reduce.)
OneNote-Clipstory OCR 02 - 2013-09-28 , 20_15_18.pngStrategy for capturing and retaining OCRed text ("Alternative Text") from images

That looked very familiar. The bits marked as "Machine generated alternative text" were images, and I couldn't select/copy that text, so I copied some of the (copyable) heading text "What you get under the MS Office". Then I started up OneNote and searched for that string. Found it in the table in OneNote straight away, and went back to compare the table with the view of file 2013-9-28 8497.html, but now that file looked like this:
(Click to enlarge/reduce.)
OneNote-Clipstory OCR 03 - 2013-09-28 , 20_10_42.pngStrategy for capturing and retaining OCRed text ("Alternative Text") from images

After a bit of mucking about, I figured out (but am not absolutely sure) that the html was probably linked to the original in OneNote, from whence the images had been copied, but the images were inaccessible until I started up OneNote, at which point they were fetched, and (it seems) put into Temp, from where they were fetched again and inserted into the web page displayed, thus covering up the Alt Text image placeholders in the view of the html - all made possible because I had opened up OneNote.
This would seem to be consistent with OneNote's being organised something like a huge and complex wiki - it hyperlinks everything it holds in rather clever ways. Everything you do in the Notebooks is linked to date, time, and author, and material in OneNote is cross-linked internally within OneNote itself and externally to sources of material from across the internet and the client PC. Thus, if you copy anything from OneNote, the copied content will include all the relevant links related to where it was located at the time it was copied. If you move stuff around, the links are tracked and dynamically reassigned as necessary, so there is continuity and you don't easily get dead/broken links.

Given the above, the Clipstory html files afford the potential to do bulk copying of multiple images' text, and thus overcome the tedious per image copying referred to above. The question I have is: How could one make best/optimum use of the potential for bulk copying (as opposed to the current piecemeal capability) to capture OCRed "Alternative Text" in a more readily accessible form than the present Clipstory html files seem to offer?
« Last Edit: September 28, 2013, 09:56:31 AM by IainB, Reason: Minor corrections. »