Are PDFs Searchable?
I’ve been messing around with PDF files lately - trying to make sure they’re able to be spidered by (at least) Google. Here’s a crash course in what I found. Using a full version of Acrobat, open the PDF you want to optimize and press CTRL + D. This will open the Document Properties. Within the PDF’s Document Properties, enter in the PDF’s TITLE, Author, Subject and Keywords. Be accurate, be succinct and don’t spam.
The TITLE you set in your PDF Document Properties will show up in Google as the PDF’s link. (Without it, all of your PDFs will be indexed as Untitled.)
What about the rest of the document? Can the search engines read my PDF? Well, the answer is “it depends“. In general, if you open the PDF and can use the text tool to highlight individual lines of copy, it’s going to be indexable by Google. Another way to tell: open the PDF and press CTRL + A to select everything in the document. Then press CTRL + C to copy everything. Go to Notepad and press CTRL + V to paste what you just copied. If real text appears in your open Notepad, it’s searchable.
What about scanned PDFs? If you’re unable to select any text using the Text Tool, it’s likely your PDF is just an image of text — not searchable. What can you do about that? My best advice is to try to use Acrobat’s native OCR feature to convert that image to real, searchable text. Once the OCR has run, it won’t be apparent that anything has happened - that’s because Acrobat keeps that original image “in front of” the converted text. The converted text is now there, but it’s behind the scenes and only readable by “users” like search engine spiders. NOTE: the quality of the OCR is poor. I’ve never had much luck with it. To see what the converted text is, use the CTRL + A trick. This time, it will copy the converted text. When you paste it to Notepad, you’ll be able to see the quality of the results.
To answer the question “Are PDFs Searchable?” the answer has to be… sometimes. Use the tips above to find out if your PDFs can be read as real text. If not, don’t worry, setting the Document Properties will at least let you convey the PDF’s TITLE to the engines.
Technorati Tags: PDFs, PDF optimization










Been working with a ton of law firms lately, providing OCR and Document Management/Capture systems. Wrote an article on some great apps for law firms:
http://www.scanguru.com/download.php?list.7
Also, this site has some great reference material and news articles on the “paperless office”:
http://www.scanguru.com