Adobe Devanagari Font

August 18, 2012

For those who wonder how the statistics based on the dictionaries can be obtained, here is the method I used (using Python regular expressions).

As John said, a compound is a sequence of virama separated consonants. A consonant can be defined by the Python regular expression

   [\u0915-\u0939\u0958-\u095F]

(I just rewrote in Python what John wrote in words above). The Virama is \u094D. A compound is thus a sequence of 1 or more consonants followed by \u094D (whence the + operator in the code below), with a consonant appended to it. For grouping, I used (?:expression) to avoid the back referencing mechanism.

The dictionary contains one word per line. The following program reads line per line, finds the compounds in each line, and outputs them directly.

import re, sys

compound = ur'(?:[\u0915-\u0939\u0958-\u095F]\u094D)+[\u0915-\u0939\u0958-\u095F]'
f=open(sys.argv[1])
word = f.readline().decode('utf-8')
while word:
  for comb in re.findall(compound, word):
    print comb.encode('utf-8')
  word = f.readline().decode('utf-8')

If we call this stub compounds.py and if the dictionary is aspell.txt, then

python compounds.py aspell.txt

outputs the compounds (I should rather say the candidates for producing compounds), one per line, as many times as they occur. That should work on any platform.

To get a more sophisticated output, you can make a more involved Python program, or just use standard unix commands if you are on Linux or OS X. The first thing to do is to sort the compounds so that they are grouped together in the output and then use the unix command uniq with the option -c to count them.

python compounds.py aspell.txt | sort | uniq -c

Here are the first lines of the output:

223 क्क
1 क्क्क
1 क्क्ड़
32 क्ख

So there are 223 occurrences of क्क. It would be interesting now to have those numbers is descending order. Again, all that is needed is to sort those last lines according to the numerical value (option -n) and I'll choose the reverse order (option -r). Here is the full command.

% python compounds.py aspell.txt | sort | uniq -c | sort -n -r

Here are the first lines of the output

3200 प्र
1614 त्र
1599 क्ष
961 स्त
924 र्ण

Once you know the compounds, you can search for the words containing them in the dictionary using the unix command grep. Very little programming is thus required.

Of course, to produce a table for LaTeX, the compounds found were put in a Python dictionary and the processing was all done in Python. The full source is 39 lines of Python after removing comments and blank lines (but including the 8 lines above).

Michel
Rem: I removed duplicates in aspell after I produced my last tex files; the number of occurrences for स्त has decreased from 962 to 961.

August 19, 2012

Tremel's list makes more sense to me, based on his sources, although this too contains some loan words and, importantly in terms of glyph set design, includes large numbers of conjuncts that almost all Hindi writers would instead write with anusvara, e.g. संख्या instead of सङ्ख्या. [John Hudson][comment]

Here is a note concerning writing standards for Nepali Wikipedia:

Don't replace ङ् with an ं in the middle of word as in Hindi, use like this शङ्कर not शंकर

August 19, 2012

a font not including the characters for typesetting, say, Serbian ought to be called a Russian font and not a Cyrillic font.

John, with all due respect, I don’t think that the absence of ‘say, Serbian’ glyphs in a font makes it Russian (‘not Cyrillic’)—or Ukrainian, or Kazakh, or Bashkir, for that matter. Just like a font that has no Baltic or Central and East European glyphs does not become… French or English.

August 19, 2012

Indeed, Maxim. And limiting the name of a typeface to a particular language then limits ones ability to extend the language support of that type in future. Adobe Devanagari is a type that in its current version, as intended, supports modern Hindi usage. I think it highly likely that future versions will support Marathi and Nepali. Whether it will ever support a full range of Sanskrit typography, even Vedic, isn't obvious to me.

August 22, 2012

I had forgotten that the LaTeX Devanagari (devnag package) (pdf manual) not only handles differently ligatures in Sanskrit and Hindi but provides three choices through the directives @sanskrit, @hindi and @modernhindi. Here is what the manual says about the directive @hindi

With @modernhindi there are fewer Sanskrit-style ligatures. Here is a grab from a relevant table where figures the conjunct .dg.

Michel

August 22, 2012

While the concept that there are different conjunct representations appropriate to different languages is sound, I think some of the LaTex choices in this regard are questionable. In the illustration you show, I'd say all the 'Sanskrit' ligature forms would be reasonable to use for Hindi and, indeed, are found in plenty of Hindi publishing; there are other Sanskrit ligatures that are not. What LaTeX categorises as 'Modern Hindi' seems to me to represent particular technological limitations of some obsolete typesetting technology that needed to rely on half forms, post forms and explicit halant. I don't think there's anything particularly 'modern' about it; indeed, it now looks to me old fashioned, like the non-kerning f of old Linotype faces.

August 27, 2012

I've taken some time to look a bit more deeply into this issue.

One thing I've found is that many Devanagari ligatures are formed by composing a version of a letter without a vertical stroke on the right, and many others are formed by stacking letters vertically.

This suggests that the total number of glyphs required for supporting a large selection of ligatures could be significantly reduced by composing these types of ligatures from their component parts.

(EDIT: I see that this has been thought of, and still many ligatures need to be individually drawn - and that you and a colleague from Russia apparently have produced the only two complete - for classical Sanskrit - Devanagari fonts in existence!)

Another thing I noticed is the sad fate of the project to develop a font for the Sama Veda with what are apparently cantillation marks used for the Sama Gana. My advice would have been to release the preliminary font, and wait for the complaints to roll in - because, psychologically, while teaching you about how the less common marks are used may seem like work when you request it in the abstract, once people have a font in their hands that they would like to use, telling you how to fix it so they can use it better seems to them as though they're getting you to do the work.

On USENET, if you ask a question, you may not get an answer. But if someone does give you the wrong answer, lots of people will chime in to correct him!

August 27, 2012

Yes, some conjuncts can be created from sequences of so-called half forms, and the OpenType Layout model for Devanagari enables this. [Unicode also provides a control character mechanism to force half forms if supported by the font; this can be used to override vertical conjuncts with horizontal layout.] If the sequences of half form(s) and full form are carefully kerned in the font, the result can be really quite acceptable. Quite a lot of rare conjuncts are displayed in this way with the Adobe Devanagari fonts, and one can't easily tell from the resulting typeform that one is looking at a half form sequence instead of a ligature (until, that is, one enters an ikar after them, because the variant width ikar selection currently only works with ligatures).

Now, all that said, in some conjuncts, especially wide ones, a subtle reduction in the width of the component letters is desirable, and then in some cases modification of stroke weights to maintain an even colour on the page. Or some form of optical correction of overlapping segments might be desirable. So half form sequences are not always the way to go.

[As noted in this other thread, the Linotype hot metal composing method formed many individual letters from combinations of half form and long a vowel sign.]

September 4, 2012

quadibloc:

"This suggests that the total number of glyphs required for supporting a large selection of ligatures could be significantly reduced by composing these types of ligatures from their component parts."

It is possible to make a highly professional Devanagari font consisting mainly of half-forms. The Rigveda edition by Prof. R. L. Kashyap and Prof. S. Sadagopan published in printed book form in 1998 and also downloadable at my website as PDF files was typeset using such a two-halves font. See here

http://www.sanskritweb.net/rigveda

Such a two-halves font requires extremely meticulous planning, though.

September 4, 2012

Uli: Such a two-halves font requires extremely meticulous planning, though.

Yes, and usually require careful kerning too. We've made use of contextual variant half and post (conjunct-final) forms in some fonts.

I'm in the process of rationalising the glyph set for a new Hindi font, and am removing a good number of the transliteration ligatures that we included in Adobe Devanagari (most of which can be shaped with half forms anyway) and adding some more vertical ligatures to cover Sanskrit quoted words without explicit halant.

September 4, 2012

On the Internet Archive, I found a book, "The Bible of Every Land", which was the source of an image of the Multani abugida I saw on a web site somewhere.

It contained some interesting items in the part that illustrated several alphabets. Arabic was shown with more than four forms of each letter - and a set of the most necessary or common ligatures for Sanskrit was shown on another page.

Hmm. Maybe the copy I found with a misspelled title was on Google Books - there were several in the Internet Archive with the correct title.

September 4, 2012

[Wandering off-topic.]

Arabic was shown with more than four forms of each letter

As is entirely normal. To my knowledge, there is not a single Arabic or Persian work on Arabic writing that uses the 'four-forms-per-letter' analysis. It is a European misconception, most likely introduced by Biblical scholars whose prior knowledge was of Syriac (for which the analysis makes sense). Arabic and Persian scribes would never have made that error, because they would be aware that there was no attested style of Arabic writing for which the four-form analysis is true, not even the most geometric 'kufi'. Unfortunately, it is an error that persists in almost all introductory grammars of Arabic in English, and forms the basis OpenType Arabic shaping model, which requires letters to be mapped to initial, medial and final forms before one can apply the actual joining rules of the script or of particular styles.

September 4, 2012

Sigh.

Just when I think I am getting a handle on the data, I come across another conjunct list that completely contradicts the others. Uli, I am hoping that you are familiar with a 'Saṃyoga Table' document that lists Sanskrit conjuncts as found in four sources:

Coulson, Michael. Teach yourself Sanskrit.
Monier-Williams. A practical grammar of the Sanskrit Language.
Vasu, S.C. Aṣṭādhyāyī of Pāṇini.
Agenbroad, J.E. 'Difficult characters: a collection of Devanagari conjunct consonants' (in Bulletin 38 of the International Association of Orientalist Librarians).

I have this document as a PDF, but have been unable to find it online again this evening (the contents are images, not live text, so it is impossible to search for effectively). If you do not have it, I would be happy to send it to you, but I am guessing that you know it.

Over the past few days, I have been collating my draft Hindi glyph set specification with Ernst Tremel's list, with your higher frequency Sanskrit list, and with the Hunspell and Aspell frequency analysis that Michel did (the latter mostly useful in that it includes sample words that make it easy to determine whether a conjunct is found only in loan words). I have been performing a kind of triage, identifying conjuncts that could be removed from my glyph set spec. I thought I had identified 63 conjuncts that could be safely removed from a font that didn't aim to support modern loan words, e.g. a font for pre-modern literary Hindi, which is what I happen to be working on at the moment. I was happy with this number, but Fiona spotted one conjunct in this group that she recalled seeing in Sanskrit, so I decided to double check against your Sanskrit ligature list in the Itrans manual, and also against the Saṃyoga Table document. I was dismayed to find that 52 of these conjuncts appear in the Saṃyoga Table but not in your Itrans list. These are (with my glyph names, not standard romanisation; sources as indicated in the Saṃyoga Table):

I understand and appreciate the method you used to arrive at your ligature list, which is why I'm inclined to consider it reliable. But I am worried by so many conjuncts occurring in the Saṃyoga Table that are not in your list (and I am only showing here the intersection of the Table with those glyphs that I am considering removing from my spec; I believe there are additional conjuncts in the Table that are not in your list or in my spec).

I wonder if you might be able to share any insight on these discrepancies? In the meantime, I will write to James Agenbroad, whom I know from Unicode circles, and will try to get a copy of the journal with his original collection.

September 4, 2012

Mr. Hudson:

"ब्न | dBNa | Coulson"

The Agenbroad collection is "old hat" to me.

In my comprehensive book "Conjuncts Consonants in Sanskrit", which is an unpublished work according to German Copyright Law and according to the Revised Berne Convention and which I cannot yet make available to others prior to publication, I wrote this on the invented ligature "bn":

"Charles Wilkins (1808), Alix Desgranges (1845), M.R. Kale (1894), Richard Fick (1922), A.A. MacDonell (1926), H.M. Lambert (1953), Michael Coulson (1973) and Madhav M. Deshpande (1997) invented the conjunct consonant "bn". If "bn" were no invention, then a Sanskrit word would exist containing "bn". But there does not exist any such Sanskrit word. Therefore "bn" has been invented. Yet "bn" could occur in a foreign-language word, e.g. in "abnormal", inserted into Neo-Sanskrit texts. But serious scientific research on Sanskrit conjunct consonants must dismiss such "abnormalities"."

If you search for "bn" in the huge Sanskrit dictionary file

http://www.sanskritweb.net/sansdocs/reverse1.pdf

you will not find any Sanskrit words containing "bn"

Since Sanskrit words with "bn" are non-existant, it does not make much sense to design the ligature "bn" for a Sanskrit font, because this ligature will never be used on account of the fact that Sanskrit words with "bn" do not exist.

More than 300 of the approx. 1000 conjunct consonants, i.e. roughly 30% of the collection by Agenbroad, are entirely fictitious, as far as Sanskrit is concerned.

Some of these 300 ligatures compiled by Mr. Agenbroad may occur in foreign-language loan words in Hindi and Marathi texts, but they definitely do not occur in ancient Sanskrit texts. That is for sure.

In the funny book "Spoken Sanskrit" edited by S. S. Janaki and published by the Kuppuswami Sastri Research Institute in 1990, you will find a fictitious Sanskrit report on a tennis match at Wimbledon between Björn Borg and John McEnroe.

For typesetting the name "Wimbledon" in Devanagari, you could invent the ligature "mbl", and in fact, this invented ligature is contained in the Agenbroad collection and also contained in the book by Mariano Rampolla del Tindaro, Lingua Sanscrita, Romae 1936. But this does not mean that this entirely fictitious ligature "mbl" will ever be required for typesetting an ancient Sanskrit text, since there never existed a Sanskrit word containing "mbl". This is a fact which you can check yourself here:

www.sanskritweb.net/sansdocs/reverse1.pdf

or here:

www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/tamil/index.html

Do substring search for "bn" and "mbl" on Cologne Digital Sanskrit Lexicon (166,434 entries). You will get no hits at all for these entirely fictitious ligatures.

So, why should you care to design fictitious ligatures for fictitious Sanskrit words?

Note:

The "Samyoga table" mentioned by Mr. Hudson is downloadable here

http://www.ctan.org/tex-archive/language/sanskrit as file sktdoc.ps

If you convert this ps file to a pdf file, it will be searchable.

September 5, 2012

For a pdf version of the Samyoga table, cf https://typography.guru/forums/topic/106109-forwarding.

September 5, 2012

Thank you for the detailed response, Uli. I had assumed that at least some of the Samyoga Table entries were 'fictitious', to use your term, either the result of loan words in Devanagari contexts that were not limited to Sanskrit, or of misreadings of manuscripts or poorly proofread texts. But you can perhaps appreciate my concern, as a non-Sanskritist, to find so many questionable entries. Thankfully, on my current project I have a number of Sanskritists to whom I will eventually be able to submit my Sanskrit glyph set for review. And I do plan to double-check against the Cologne Lexicon; thank you for this suggestion.

September 17, 2012

Searching in the Cologne lexicon seems odd in that the dictionary uses the Harvard-Kyoto romanisation system, which relies on case sensitivity and digraphs, but the search results are case-insensitive and can't distinguish digraph occurrences from individual letters. Hence, if I do a substring search for 'ksT' I get the same results as for 'kst' and for 'kSt' and 'kST', and these include instances of 'th' as well as 't'.
_____

1 digvidikstha mfn. situated towards the cardinal and intermediate points , encompassing MW.
2 pratyaksthalI f. N. of a Vedi1 R.
3 pRthaksthita mfn. existing separately , separate MW.
4 pRthaksthiti f. separate existence , separation Vikr.
5 RksthA mfn. consisting of R2ic verses Ta1n2d2yaBr. xvi , 8 , 4.
6 samyaksthiti f. remaining together BhP. Sch.
7 uttaradikstha mfn. situated in the north , northern.
8 vAkstambha m. paralysis of speech Va1gbh.
_____

I presume from context that 'th' = थ and not त्ह, although the H-K system seems to have no way to distinguish them! Does the system rely on features of the language to avoid ambiguity, or is ambiguity simply accepted (in which case it seems a really bad system to use for a digital lexicon)?

September 18, 2012

Mr. Hudson:

My own PDF file

http://www.sanskritweb.net/sansdocs/reverse1.pdf

is case-sensitive, if you activate case-sensitivity in Adobe Acrobat.

Furthermore the old program by Louis Bontes

http://members.chello.nl/l.bontes/mwsdd.gif
http://members.chello.nl/l.bontes/sans_n.htm

is also case-sensitive, as far as I remember.

September 18, 2012

Thanks, Uli. I did searches in the Cologne lexicon and visually reviewed the results, and also used the Sanskrita converter to check results in Devangari.

Specifically, I did substring searches for conjuncts that are in my current draft glyph set and also in the Samyoga list but not in your frequency list. Most of these, as you would predict, do not occur in the Cologne lexicon. A few do, though, although most are very rare:

1 च्न cn in हस्तिकच्नि hastikacni (a kind of bulbous plant L.)
1 ज्द jd in भुज्दृश् bhujdRz (or mfn. accompanied by distortion of the eyes (as a fever) Bhpr.)
4 ज्न jn as in अरसज्न arasajna (mfn. having no taste for , not taking interest in MBh. xii , 6719)
2 झ्य jhy as in झ्यु jhyu (cl. 1. A1. v.l. for %{jyu}.)
9 द्ब्र dbr as in सद्ब्रह्मन् sadbrahman (n. the true Brahman ib.)
1 न्त्स nts in अनश्नन्त्साङ्गमन anaznantsAGgamana (m. the sacrificial fire in the Sabha1 ... S3Br.)
4 ल्ज lj as in पश्चाल्जन pazcAljana (m. Pa1n2. the people in the west Var)
1 ल्स ls in अतिपेशल्स् atipezals (mfn. very dexterous.)
12 ष्ट्व STv as in हविष्ट्व haviSTva (n. the being an oblation Nya1yam. Sch.)
35 स्त्व stv as in अनागास्त्व anAgAstva (n. sinlessness RV.)

September 18, 2012

Mr. Hudson:

My answers to your questions are at the end of each line after ---

1 च्न cn in हस्तिकच्नि hastikacni (a kind of bulbous plant L.) --- scanning error of हस्तिकन्द
1 ज्द jd in भुज्दृश् bhujdRz (or mfn. accompanied by distortion of the eyes (as a fever) Bhpr.) --- scanning error of भुग्न-दृश्
4 ज्न jn as in अरसज्न arasajna (mfn. having no taste for , not taking interest in MBh. xii , 6719) --- scanning error of अरस-ज्ञ
2 झ्य jhy as in झ्यु jhyu (cl. 1. A1. v.l. for %{jyu}.) --- jhyu and jyu are listed in the Dhatupada 22, 60, but are not attested in any real texts.
9 द्ब्र dbr as in सद्ब्रह्मन् sadbrahman (n. the true Brahman ib.) --- attested conjunct, frequency 0.277%, i.e. very frequent and included in my frequency list
1 न्त्स nts in अनश्नन्त्साङ्गमन anaznantsAGgamana (m. the sacrificial fire in the Sabha1 ... S3Br.) --- an-aznan-t-sAGgamana is formed by n+"t"+s and is contained in my frequency list as a peculiar Vedic sandhi in the Shatapatha-Brahmana. In other Sanskrit texts, n+s is used instead of n+"t"+s, i.e. "t" is not inserted.
4 ल्ज lj as in पश्चाल्जन pazcAljana (m. Pa1n2. the people in the west Var) --- scanning error of पञ्चाल-जन
1 ल्स ls in अतिपेशल्स् atipezals (mfn. very dexterous.) --- scanning error of अति-पेशल
12 ष्ट्व STv as in हविष्ट्व haviSTva (n. the being an oblation Nya1yam. Sch.) --- "STv" is extremely frequent (1.405%) and of course included in my frequency list
35 स्त्व stv as in अनागास्त्व anAgAstva (n. sinlessness RV.) --- "stv" is also extremely frequent (1.339%) and of course included in my frequency list

I am under the impression that you changed your mind and that you now intend to include into the Adobe Devanagari font even the most exotic conjuncts. Nobody would search for "ls", because this combination is impossible in Sanskrit. And as regards "nts", it is an odd peculiarity in some Brahmana manuscripts and hence should only be included into the most specialized Sanskrit fonts.

September 19, 2012

Thanks, Uli. This analysis and documentation isn't related to Adobe Devanagari, but to other projects, some of which will specifically target Sanskrit, but also Hindi and Marathi. I'm trying to make sense of what should be included for each: weeding out modern loan word transliteration forms that won't be needed in the texts in question, taking into account your frequency tests, etc..

The number of scanning errors in the Cologne lexicon is alarming.

September 19, 2012

@John Hudson:
The number of scanning errors in the Cologne lexicon is alarming.

But hardly unexpected. Most OCR software is oriented around the Latin alphabet, as it lends itself better to that process, and the humans available to check the results would presumably have been native speakers of German (or possibly French, Köln not being too far from the border, as its English name attests) rather than Hindi, let alone Sanskrit.

We have to count ourselves lucky that a thing like the Cologne lexicon even exists, even though its imperfections certainly do need to be remedied.

September 19, 2012

quadibloc:

OCR scanning started in 1994. See this lengthy report

http://www.sanskrit-lexicon.uni-koeln.de/CDSL.pdf

September 20, 2012

The report doesn't seem to discuss ligatures, and it seems to come from a stage in the project when that issue did not arise, as a dictionary where the Sanskrit words were all in Latin transliteration was the one used.

Of course, I see your point that my comments were misguided, then, as the excuses I advanced did not apply, and the dictionary ought to be a more accurate source of information on what consonant clusters exist in Sanskrit (if not on how they're written in Devanagari).

September 23, 2012

Uli, your frequency list doesn't include the conjunct 'bn' ब्न, which occurs in Coulson's list but does not show up in substring search on the Cologne lexicon. However, the latter absence surprises me because the lexicon is based on the Monier-Williams dictionary, and Lambert (Introduction to the Devanagari Script, p.42) cites it in the context of this word:

which Fiona reports as given in Monier Williams's dictionary as : 'one whose navel is a Lotus,' N. of Vishnu.

Adobe Devanagari Font

Recommended Posts

Member Mic…

Link to comment

Member Mic…

Link to comment

Member Max…

Link to comment

Member Joh…

Link to comment

Member Mic…

Link to comment

Member Joh…

Link to comment

Member qua…

Link to comment

Member Joh…

Link to comment

Member Uli…

Link to comment

Member Joh…

Link to comment

Member qua…

Link to comment

Member Joh…

Link to comment

Member Joh…

Link to comment

Member Uli…

Link to comment

Member Mic…

Link to comment

Member Joh…

Link to comment

Member Joh…

Link to comment

Member Uli…

Link to comment

Member Joh…

Link to comment

Member Uli…

Link to comment

Member Joh…

Link to comment

Member qua…

Link to comment

Member Uli…

Link to comment

Member qua…

Link to comment

Member Joh…

Link to comment

Create an account or sign in to comment

Create an account

Sign in

Our partners

Recent Discussions

Home

Forums

News & Events

Fonts

Knowledge

Exclusive

Legal

Important Information