Planet PDF Forum Planet PDF Forum
  New Posts New Posts RSS Feed - Unusual Unicode Map?
  FAQ FAQ  Forum Search   Register Register  Login Login

Hi, welcome to the Foxit Planet PDF Forum. If you have PDF or Adobe Acrobat questions then the right place to ask them is here, in this forum.

Unusual Unicode Map?

 Post Reply Post Reply
Author
chris_pdf View Drop Down
New Member
New Member


Joined: 26 May 2012
Points: 3
Post Options Post Options   Quote chris_pdf Quote  Post ReplyReply Direct Link To This Post Topic: Unusual Unicode Map?
    Posted: 26 May 2012 at 12:23am
Hi,

I've got a CMAP that begins like this:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <<
  /Registry (Adobe)
  /Ordering (UCS)
  /Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<00><FF>
endcodespacerange
1 beginbfchar
<24><0009 000d 0020 00a0>
endbfchar
1 beginbfchar
<4d><002d 00ad 2010>


I can't make any sense of the bfchar's though?
<24><0009 000d 0020 00a0>
<4d><002d 00ad 2010>

The bytes on the right don't seem to map to any valid unicode characters in either UTF8. How should they be interpreted?

Back to Top
aandi View Drop Down
Senior Member
Senior Member


Joined: 07 Jul 2011
Points: 18358
Post Options Post Options   Quote aandi Quote  Post ReplyReply Direct Link To This Post Posted: 26 May 2012 at 8:56am
When the PDF standard talks about Unicode it NEVER means UTF-8 (except where it appears in other standards like XML). The data is to be interpreted directly as a list of Unicode code points.
So the 2010 is simply Unicode code point 0x2010, hyphen (http://www.fileformat.info/info/unicode/char/2010/index.htm)
Back to Top
chris_pdf View Drop Down
New Member
New Member


Joined: 26 May 2012
Points: 3
Post Options Post Options   Quote chris_pdf Quote  Post ReplyReply Direct Link To This Post Posted: 29 May 2012 at 8:09pm
aandi,

I see the hyphen, but what about the other bytes? How can I make sense of them? I see that there is '0020' (space) for byte 3 of 4 in the first sequence as well, but am not sure why the other bytes are there.

I am trying to make sense of what all the bytes are doing. Any links/pointers are appreciated.


Thanks for the help!

Back to Top
aandi View Drop Down
Senior Member
Senior Member


Joined: 07 Jul 2011
Points: 18358
Post Options Post Options   Quote aandi Quote  Post ReplyReply Direct Link To This Post Posted: 29 May 2012 at 9:11pm
They are all valid Unicode code points - try modifying the 2010 URL I gave for each one in turn. Unusual to find in a PDF since they are in some cases invisible, but valid. May be a font that was used solely to get access to those Unicode code points, while another font sets the other points.
Back to Top
chris_pdf View Drop Down
New Member
New Member


Joined: 26 May 2012
Points: 3
Post Options Post Options   Quote chris_pdf Quote  Post ReplyReply Direct Link To This Post Posted: 29 May 2012 at 10:46pm
I see.

I was under the impression that bfchar is used to map one value to one other value, not to a series or a range. It seems like 95% of the time, you get something that looks like this:

6 beginbfchar
<017D> <006F>
<0189> <0070>
<018C> <0072>
<0190> <0073>
<019A> <0074>
<01B5> <0075>
endbfchar (this is perfectly clear to me)

But every once in a while, you get:

<4d><002d 00ad 2010>

Where it seems only one of the values maps to valid unicode, or perhaps I'm just misunderstanding what is happening.

Can you recommend any good reading sources other than the PDF reference (I am using that now)?

Thanks for your help,
-Chris
Back to Top
aandi View Drop Down
Senior Member
Senior Member


Joined: 07 Jul 2011
Points: 18358
Post Options Post Options   Quote aandi Quote  Post ReplyReply Direct Link To This Post Posted: 29 May 2012 at 11:03pm
1. 002d 00ad 2010 - these are three valid Unicode points, did you check? Or were you looking for something else.
 
2. "It shall use the beginbfchar, endbfchar, beginbfrange, and endbfrange operators to define the mapping from character codes to Unicode character sequences expressed in UTF-16BE encoding." The crucial word here is "sequences".
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.10
Copyright ©2001-2017 Web Wiz Ltd.

This page was generated in 0.047 seconds.