Scanning hard copy
Video: Scanning hard copyHere I have a scan of a paper document, and you can see the text looks a little bit rough, but that's because it's a picture of text. It's not real text. But my colleague needs this text, because he needs to be able to search this in a repository of PDFs. So it can't search this as long as it's pixels. Acrobat needs to perform OCR--Optical Character Recognition--to convert these pixels into genuine searchable text. And that feature is over here under Tools > Text Recognition. And I can choose to search in this file or if I have multiple files open, I could OCR all of them.
- Next steps
Viewers: in countries Watching now:
Take a tour of Acrobat XI, compare its three editions, and get a fresh look at what you can do with Acrobat. This course demonstrates the basics of working with PDFs: how to create, combine, edit, export, and review documents. Author Claudia McCue also shows how PDFs integrate with Microsoft Office applications and introduces the basics of working with forms.
- Understanding the Portable Document Format (PDF)
- Inserting, replacing, and extracting pages
- Combining PDFs
- Creating PDFs from Word, PowerPoint, and Excel
- Converting web pages to PDF
- Scanning hard copies of documents
- Printing to PDF
- Exporting to other formats from Acrobat (such as the Excel .xls)
- Adding hyperlinks and bookmarks
- Marking up a PDF with annotations and drawings
- Using shared reviews
Scanning hard copy
Here I have a scan of a paper document, and you can see the text looks a little bit rough, but that's because it's a picture of text. It's not real text. But my colleague needs this text, because he needs to be able to search this in a repository of PDFs. So it can't search this as long as it's pixels. Acrobat needs to perform OCR--Optical Character Recognition--to convert these pixels into genuine searchable text. And that feature is over here under Tools > Text Recognition. And I can choose to search in this file or if I have multiple files open, I could OCR all of them.
So I just have the one file, so I'm going to choose In This File. Now, there are three options here. When you choose Edit, you have the option for Searchable image, Searchable Image (Exact), or ClearScan. Searchable Image tries to sort of clean up the document, and then the text that's created is not visible; it's sort of behind the image, if you will. Searchable Image (Exact) doesn't clean up the image, and that's used, for example, in legal environments or insurance offices where they need to have the original look of the document for legal reasons.
So that's left intact. You still get the invisible searchable image. The ClearScan tries to create a font to mimic the original; if it can't, then you end up still with that sort of veneer of an image. So let's try Searchable Image first. And this is actually 1200 dpi scan. It offers to downsample it to 600 to make it a smaller PDF, and that's going to be all right. It will still keep detail. So when I click OK, and OK, Acrobat begins processing. And when you watch, did you see it sort of shift to the left? It deskews it, it straightens it out, and it seems to be actually for Acrobat's own purposes, to make it easier for Acrobat to recognize the content.
Well, now we need to find out how good a job it did. And again, remember that that searchable text is going to be invisible. So you can see that the text is still made out of pixels--at least that's all we can see--but let's check our work. On the right I'm going to click on Find First Suspect. Now, I'm not going to go all the way through the document, but I want to show you how this mechanism works. Here in this window you're going to see that little clump of pixels greatly magnified, and then back in the document you see the figure that it proposes to replace it with. So for a moment there, you can see that invisible text.
It says it's an ampersand and I agree, so I check Accept and Find. And then I'm going to go down and accept and find a few of the other replacements. They seem to do a pretty good job. BASIC, that's good. What I'm worried about is that italic text, because that has some sort of flouncy characters in it, and I'm wondering if Acrobat will recognize them. So down here, oh, it has not done a good job. You might have to look closely, but it's replaced Roux with !wix. Well, it got the x right, but that's it.
I know that this is disappointing, but keep in mind, you're asking it to do something pretty heavy-duty, convert pixels to text, so this why you want to pay attention to the results. So I'm going to go in here and fix this and type the correct word. And Accept means now I'm accepting what I just typed. And again, I'm not going to go through the whole document, but you sort of get the idea. One of the ways I like to check this is to select all the text--and I can just get my Selection tool, and just select it as I would in Microsoft Word--copy it, and then go into an empty Word file and then just paste.
And let's take a look at that. If I go to View--and let's zoom up some, something in a nice 200% ought to do it-- and you can see it's confused about some things. In fact, it didn't seem to take the word that I typed. But going on through, a couple of things to consider, this could also be a way for you to extract text from a scan. Maybe you don't care about it being searchable. Maybe you just don't have to type this over and over again. Well, at least you've got something that you can work with now. You have editable text. It's not perfect. If you had to have this for legal purposes, you would need to go back into Acrobat, to all those little suspects, and fix every little instance that isn't correct. And yes, that's tedious, but there are times that that's going to be required.
So this was a bitmap scan. I'm going to try it quickly with a grayscale scan. I will say that for the most part I get better results with the bitmap, but just so you know, to convert that from a scan to a PDF is pretty easy. You can also drive a scanner from directly within Acrobat. And I don't have a scanner hooked up. I already had scanned these files, so I'm just going to choose Create PDF from File, and there's my grayscale scan. And I'm just going to quickly start this, just so you can see some difference in the Searchable (Exact).
So I'm going to choose In This File, and then for my Option I'm going to choose Searchable Image (Exact), and leave the rest of the options at their defaults. Notice that it didn't shift at the end. So it didn't do the deskewing, and that's what I meant when I said it keeps that image intact. So now when I say Find First Suspect, the first thing it finds is this little clump of trash on the scan, so maybe there was something on the scanner platen. So I can say no, that's not text, don't worry about it, and then I could continue on with the Accept and Find.
So for legal purposes, this is a faithful representation of the original. If I finish cleaning out all my little suspects, then I have a searchable file, so I kind of have the best of both worlds. So in a document like this, yes, it could be kind of tedious, but it's something that Acrobat does, in general, very well.
There are currently no FAQs about Up and Running with Acrobat XI.