Planet PDF Forum Planet PDF Forum
  New Posts New Posts RSS Feed - How can I extraxt text if encoding is Identity-H
  FAQ FAQ  Forum Search   Register Register  Login Login


Hi, welcome to the Foxit Planet PDF Forum. If you have PDF or Adobe Acrobat questions then the right place to ask them is here, in this forum.

How can I extraxt text if encoding is Identity-H

 Post Reply Post Reply Page  12>
Author
spongebob View Drop Down
New Member
New Member


Joined: 31 Jul 2013
Location: Germany
Points: 23
Post Options Post Options   Quote spongebob Quote  Post ReplyReply Direct Link To This Post Topic: How can I extraxt text if encoding is Identity-H
    Posted: 16 Nov 2016 at 12:25pm

Dear community!

I try in my application (C#) extract texts from pdf file.

In one pdf-file (uploads/1345/1.pdf) I meet Tj operator, which contain text in this form "<001400170011001C001C0003005000F0>". Current font has encoding Identity-H. I try convert hexadecimal elements (such as 0014, 0017... 00F0) to integers and integer to chars. But in case (if encoding is Identity-H) this isn't correct. For this text must be "14.99 m²".

How can I correctly extraxt text if font has encoding Identity-H

Thanks in advance

spongebob



Edited by spongebob - 16 Nov 2016 at 12:28pm
Back to Top
Sponsored Links


Back to Top
Rob Lyman View Drop Down
New Member
New Member
Avatar

Joined: 12 May 2015
Location: Chicago
Points: 25
Post Options Post Options   Quote Rob Lyman Quote  Post ReplyReply Direct Link To This Post Posted: 16 Nov 2016 at 8:48pm
The 'Identity-H' encoding in a PDF means that the bytes in the string specify Glyph IDs: values that identify specific glyphs in a TrueType or OpenType font file. These Glyph IDs are not the same from font to font (e.g. 'A' may map to glyph ID 12 in a Helvetica font file, but to glyph ID 37 in Comic Sans). Because the string directly depends on a specific font, an 'Identity-H' encoding requires that the font file be embedded or subset within the PDF.

In order to convert the Glyph IDs in the string back Unicode, the embedded font should carry a 'ToUnicode' table--a reverse lookup table that maps the glyph IDs back to Unicode code points. If the embedded font does not carry a 'ToUnicode' table (meaning the PDF generator didn't include one), it will be virtually impossible to extract the text that uses that font.

Note that the embedded font's 'ToUnicode' table is in a specific format--rather than being a simple table, it's basically a set of PostScript instructions for constructing the table. For this reason, it's usually easiest and most reliable to use a third-party library for PDF text extraction than to do it directly in your own code.

See ToUnicode Mapping File Tutorial - Adobe for more details about ToUnicode tables.
Rob Lyman
Software Engineer
http://www.datalogics.com
Back to Top
spongebob View Drop Down
New Member
New Member


Joined: 31 Jul 2013
Location: Germany
Points: 23
Post Options Post Options   Quote spongebob Quote  Post ReplyReply Direct Link To This Post Posted: 16 Nov 2016 at 9:41pm
Dear Mr. Lyman

Thanks for your answer. I try this only on Friday.
(Probably after trying I'll have more questionsSmile)

Best regards
spongebob
Back to Top
spongebob View Drop Down
New Member
New Member


Joined: 31 Jul 2013
Location: Germany
Points: 23
Post Options Post Options   Quote spongebob Quote  Post ReplyReply Direct Link To This Post Posted: 18 Nov 2016 at 12:53pm
Hello, Mr. Lyman
I try this, but successfulles

This is from my file.

9 0 obj
<< /Type /FontDescriptor /FontName /ArialMT /FontFamily (Arial) /FontWeight 400 /FontBBox [-665 -325 2000 1040] /Ascent 728 /Descent -210 /CapHeight -34 /Leading 33 /Flags 42 /ItalicAngle 0 /StemV 80 /FontFile2 6 0 R
>>
endobj

10 0 obj
<< /Type /Font /Subtype /Type0 /BaseFont /ArialMT /Encoding /Identity-H /DescendantFonts [
<< /Type /Font /Subtype /CIDFontType2 /BaseFont /ArialMT /FontDescriptor 9 0 R /CIDSystemInfo
<<  /Registry (PDFAUTOCAD) /Ordering (Indentity0) /Supplement 0
>> /W 7 0 R
>>]
>>
endobj

Here is not exist reference to one object which ist to unicode map. I do not understand how Acrobat Reader know what for map must be used here (Acrobat displays correct this text), and from where take Acrobat this map?

Thanks in advance
Back to Top
BAlheit View Drop Down
Senior Member
Senior Member


Joined: 15 Jul 2011
Points: 1095
Post Options Post Options   Quote BAlheit Quote  Post ReplyReply Direct Link To This Post Posted: 18 Nov 2016 at 4:19pm
For display of the text Acrobat doesn't need the map.
Back to Top
spongebob View Drop Down
New Member
New Member


Joined: 31 Jul 2013
Location: Germany
Points: 23
Post Options Post Options   Quote spongebob Quote  Post ReplyReply Direct Link To This Post Posted: 18 Nov 2016 at 6:05pm
"For display of the text Acrobat doesn't need the map"

How in this case Acrobat know that Tj <001400170011001C001C0003005000F0> is "14.99 m²" ?



Edited by spongebob - 18 Nov 2016 at 6:05pm
Back to Top
BAlheit View Drop Down
Senior Member
Senior Member


Joined: 15 Jul 2011
Points: 1095
Post Options Post Options   Quote BAlheit Quote  Post ReplyReply Direct Link To This Post Posted: 19 Nov 2016 at 10:46am
In this case the entry 0014 will display the glyph of the embedded font at this position (1).
Back to Top
spongebob View Drop Down
New Member
New Member


Joined: 31 Jul 2013
Location: Germany
Points: 23
Post Options Post Options   Quote spongebob Quote  Post ReplyReply Direct Link To This Post Posted: 22 Nov 2016 at 7:08am
"In this case the entry 0014 will display the glyph of the embedded font at this position (1)."

???? What must I make? Where is the list of these "positions"? I have not found anything similar in the font descriptor. But I have not parced stream of FontFile2. Is there in this stream? I think that's right,  or not? If it's right, where can I found description of fontfile2?

Thanks in advance


Edited by spongebob - 22 Nov 2016 at 7:10am
Back to Top
BAlheit View Drop Down
Senior Member
Senior Member


Joined: 15 Jul 2011
Points: 1095
Post Options Post Options   Quote BAlheit Quote  Post ReplyReply Direct Link To This Post Posted: 22 Nov 2016 at 10:42am
You can look at the font with the Preflight Tool of Acrobat Pro:


Back to Top
spongebob View Drop Down
New Member
New Member


Joined: 31 Jul 2013
Location: Germany
Points: 23
Post Options Post Options   Quote spongebob Quote  Post ReplyReply Direct Link To This Post Posted: 23 Nov 2016 at 6:53am
Thanks
But that help not me in future. I must parce the stream of fontfile2 in my program. I do not know the structure of this file.Cry
Where can I found description of it?
Back to Top
BAlheit View Drop Down
Senior Member
Senior Member


Joined: 15 Jul 2011
Points: 1095
Post Options Post Options   Quote BAlheit Quote  Post ReplyReply Direct Link To This Post Posted: 23 Nov 2016 at 9:06am
I don't know.
Back to Top
 Post Reply Post Reply Page  12>
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.10
Copyright ©2001-2017 Web Wiz Ltd.

This page was generated in 0.031 seconds.