Wednesday 12 March 2014

Multi language PDF generation using iText

In one of my recent projects, a need arose to support multi-language PDF generation using iText library.

Previously simple approach of detecting the browser locale, and using appropriate font for creation of the PDF was used.
But now the problem was that there was also user comments section, which was supposed to be written to the PDF document. Now this posed a problem, as the user can enter the comment in any language, and ideally the application was supposed to use appropriate fonts.

On some googling I found a few solutions, each with its own pros and cons.
  1. Solution: This approach involved using a 'arialuni.ttf', available in the windows fonts collection. As it covered most of the asian languages, it was kind of 'One size Fits All' solution. All I had to do was embed this font while creating the base font. But on doing some further googling, I found that this can create some serious copyright issues, as the fonts have been brought by Microsoft from a third party vendor, and are not supposed to be redistributed or rather embedded in your own documents for redistribution. Actually I had liked the approach very much, but hated the solution because of the copyright issues.
  2. Solution: This approach was similar to what was previously done in the project. Identify the browser locale, and use appropriate font for writing the PDF content. But as mentioned earlier, this solution was not going to work anymore, since the content could be in multiple languages.
  3. Solution: After a lot of googling (most probably due to my incorrect searches), I found the solution which used the concept called FontSelector, to write the PDF. The FontSelector in the iText library can be used in following way (atleast as far as I know :-P)
    • Create the fonts required for the target languages
      1. Don't specify the font as embedded while creating the Font.
      2. The main pain point might be identifying appropriate Font-Names and Encodings to be used for various languages. But I guess a bit of googling should do the trick.
    • Provide the FontSelector with above created fonts appropriate to desired target language (In my case, only few languages were necessary to be covered, such as English, variants of Chinese, Japanese, etc.)
    • The order of the fonts play an important role. Lets say if we provide the font applicable to Chinese language first, and then the font applicable to English such as TimesRoman or Helvetica or so on; by default the exported PDF would require Asian-Font-Package, even if the entire PDF is in english. This is because the FontSelector, selects the font appropriate to the data currently being written. If the first font in the list, is unable to process the data, the FontSelector tries to process the data with the next font available in the list. As the font applicable to Chinese language, can also write English text, the exported PDF contained data only in that specific font (i.e. the font applicable to Chinese language). Hence the requirement of Asian Font Package, even if the entire PDF is in english.
    • While writing the data, use the FontSelector, to process the content being written in the PDF.
Even though the 3rd solution worked in my case, and is also deployed in the project, I am still not entirely satisfied. Following are few of the drawbacks that I encountered, in the FontSelector solution
  • Creation of local hyperlinks (local goto), becomes tedious if the goto text can be multi-language. One has to (atleast I did), iterate over the chunks, returned by the FontSelector's process method, to create appropriate hyperlinks. In my case, I was creating the hyperlinks, for table of contents, and it was a tedious job to iterate over each item which was going to be a part of Table of Contents.
  • Creation of bookmarks, is pretty straightforward, but where the bookmark item was in an Asian language, the bookmark didn't work (i.e. it didn't provide appropriate hyperlink, to the target content). Maybe I implemented it wrong, but anyhow most of the other things worked.
  • Lets say one has not installed an Asian-Font-Package for adobe in the system. Now in the created PDF, if on a page entire text is in English, just one character is in an Asian language, the user cannot view the entire page in the PDF.

I guess, if I would have done proper reading, about the iText library; I might not have faced these may issues. But alas the time constraints. 
Anyhow that is it about the Asian language support in iText library (which I was able to explore) 

Now slight deviation from the topic name ... (Feel free to skip ... :-) )
The most intriguing problem while doing this was to set 'Page-Numbers' in the Table of Content. As the writing of content in the document in sequential, it is very difficult to identify beforehand, how much long the content is going to span. Also if there are multiple Table of Content entries, the Table of Contents itself may span across multiple pages. At that time due to time constraints, we were unable to devise any solution; even though many wild ideas were discussed at that time :-P. Hopefully someday we will be able to tackle this problem. 

If you have any suggestions or comments, feel free to let me know ...

That's all for now ...
:-)