Join Sue Jenkins for an in-depth discussion in this video Pulling the text from images with OCR, part of Productivity Tips for Web Designers.
- Hi there, this is Sue Jenkins with productivity tips for web designers. In this weeks lesson you'll see how you can work with OCR to pull text from flattened optimized web files. Earlier this year I got an email from a viewer asking for help with a particular problem he was having in photoshop. He wanted to know how he could take the text from his .jpg image and paste it into Microsoft Word. Well I knew from experience that that wasn't possible. The question really intrigued me enough to seek out other possible answers.
I found two. For the first solution the answer to whether you can pull text from a file in photoshop can be yes or no depending on your original file format. Yes, you can copy text if the original document is an uncompressed file from an office program like Microsoft Word, Excel, or Powerpoint, or even Acrobat. Or from a design program like Photoshop, Illustrator, or InDesign. With an uncompressed document a user should be able to open up that file in its native program and use that program's tool to select the text and copy it.
On the other hand, no, you can't copy the text if your original document is a compressed graphic. Such as a .jpg, a .png, a .gif, a .tif, a .bmp, or a swf. In a compressed file there is no way to select or extract the text from the file in Photoshop because the file has already been flattened, thereby removing any text editing options. Fortunately, there is a semi decent non photoshop solution which I'll get to in a minute.
For arguments sake let's say your starting file is a .jpeg containing text like this pink square. To pull the text from the image you have to go outside of Photoshop. You really have two options. If your image contains very little text your best bet might be to just retype it yourself or if you're a terrible typist you might enjoy using voice recognition software so you can dictate your content. On the other hand if your flattened file contains a lot of text, try using OCR software.
OCR stands for Optical Character Recognition. Adobe Acrobat Pro has some OCR features and there are several websites that offer free OCR services through file uploads. For instance, onlineocr.net will extract text from .pdf and images like .jpg, .bmp, .tiff, and .gif and convert it into editable Word, Excel and text output formats using a simple file upload interface. So for instance, if I want to test out that image I'll select my file, in this case I'm choosing a .png graphic, and then I can choose my language and my output so I can save to Microsoft Word, and I'll choose plain text to keep it simple.
Then I need to enter the captia code and click the convert button. There's my output. It did a pretty good job maybe with only one typo that I can see. In a simple test using white serif text on a pink background the .pdf, .tiff, and .bmp files yielded the best results. Followed closely by the .gif and .jpg formats which each had a single text replacement error where the word "of" was swapped with a giant A. Bear in mind that if your file contains a lot of symbols or if the image is of poor quality the OCR results might include some symbol replacements some odd characters and maybe even some missing letters.
Overall OCR isn't perfect but it works pretty darn well and you can easily correct your typos manually. So there you have it, if you ever find yourself with a flattened graphic file containing text that you want to pull the text from rather than retyping, try OCR.
Skill Level Appropriate for all
Q: In "Organic and ethical SEO coding," the author mentions Google+ Authorship. I heard Authorship results are no longer shown in Google search results. Why? Are there benefits to keeping the Google+ Authorship markup on my site?
A: As of September 2014, Google discontinued Google+ Authorship for SEO. The only reason to keep the code on your site would be for Author Rank purposes. See http://searchengineland.com/google-authorship-dead-author-rank-202254 for more information.