Planet PDF Forum Planet PDF Forum
  New Posts New Posts RSS Feed - Small capitals in a PDF
  FAQ FAQ  Forum Search   Register Register  Login Login


Hi, welcome to the Foxit Planet PDF Forum. If you have PDF or Adobe Acrobat questions then the right place to ask them is here, in this forum.

Small capitals in a PDF

 Post Reply Post Reply
Author
Adrian View Drop Down
New Member
New Member


Joined: 27 Feb 2018
Location: UK
Points: 4
Post Options Post Options   Quote Adrian Quote  Post ReplyReply Direct Link To This Post Topic: Small capitals in a PDF
    Posted: 06 Mar 2018 at 6:03am
Many thanks for your clear explanation Rob,

In this case the embedded font subset must be a rewrite of the 'real' Regular font, in which smaller capital characters are used for both capitals and lower case.  This must be the compressed binary blob I'm seeing in the PostScript.

If a char code draws whatever character the font has been redesigned to contain, then only OCRing will extract the characters as displayed.  So I see that, as you say, for a specific embedded font I'd need to identify and remap any chars that don't 'draw' what the char code says they should.  Hopefully this case is a rarity!

Do you know, is a separate character-mapping operation available in PDF?  If not, the character glyphs must be in the subset font twice, for A-Z and for a-z, wasting space for no reason.  But I guess that file compression removes most of the wasted space.

This is my understanding based on your clues and some reading, please correct me if I'm wrong.  Thanks again.

Back to Top
Sponsored Links


Back to Top
Rob Lyman View Drop Down
New Member
New Member
Avatar

Joined: 12 May 2015
Location: Chicago
Points: 26
Post Options Post Options   Quote Rob Lyman Quote  Post ReplyReply Direct Link To This Post Posted: 05 Mar 2018 at 9:43pm
The ZVHXXD+ChaparralPro-Regular is a subset font: an embedded font that caries only those characters that are actually used in the document (the clue is the 'ZVHXXD+' at the front--that's how PDFs indicate a font is subset). In this case, it's using a subset to carry the small caps characters in the string.

Unfortunately, subset fonts often cause problems when extracting text: because they aren't a complete font, they have to have been set up correctly by the application that subset them in the PDF. A lot of apps don't do this correctly, leading to problems like yours--and if you don't have access to whatever process made the PDF, there's not a lot you can do to fix it.

If you're interested in the ASCII/Unicode strings you're extracting and want to store them as metadata,  the fastest solution might be to add post-extraction sanitizing step: a Python script or similar that will detect mixed case strings and convert to title case.
Rob Lyman
Software Engineer
http://www.datalogics.com
Back to Top
Adrian View Drop Down
New Member
New Member


Joined: 27 Feb 2018
Location: UK
Points: 4
Post Options Post Options   Quote Adrian Quote  Post ReplyReply Direct Link To This Post Posted: 27 Feb 2018 at 12:57pm
Hi all, I'm a newbie to PDF and to this forum.

I have a problem with extraction of text from PDFs.  The utility I need to use doesn't recognize font styles or encodings, or so it seems.

Here's a concrete example, from the first page of http://craphound.com/content/download/
Text which is displayed by PDF readers as
NEIL GAIMAN, AUTHOR OF Sandman AND American Gods (NEIL GAIMAN in small caps so the same height as lower-case 'x').
is shown by the pdf2txt utility as
Neil GaimaN, author of Sandman aNd American Gods
The different utility pdftotext gives
Neil Gaiman, author of Sandman and American Gods
i.e. it corrects the capital Ns but omits the small caps in the author name.

In fact, I'm using pdf2txt with options to preserve layout and output But metadata as XML, which I need.  This gives the same mixed case (GaimaN) in the small capitals.  But it gives the font of every non-italic character as 'ZVHXXD+ChaparralPro-Regular'.  This font doesn't seem to come in a SmallCaps variant, nor does it seem to have thousands of Unicode characters that might include small caps.  And in any case pdf2txt understands Unicode and would extract the correct code.

So something else in the PDF is making Neil GaimaN into NEIL GAIMAN.

I've extracted the PostScript from the PDF and can't find anything that looks like it would do this.  There seems to be a data blob compressed with LKW but unless it's in there I can't see any encoding.  I'm unfamiliar with PDF and PS but will appreciate and follow up suggestions.

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.10
Copyright ©2001-2017 Web Wiz Ltd.

This page was generated in 0.031 seconds.