quadibloc Posted August 12, 2012 Posted August 12, 2012 I, for one, certainly don't dispute the appropriateness of Adobe using real-world marketing considerations as a guide to the design of its fonts. This is not to say, however, that it would not also be nice, if by the addition of a limited number of additional glyphs, the font could be made usable for other languages using the Devanagari script besides Hindi, to add those glyphs. After all, as things stand now, it should really be called a Hindi font and not a Devanagari font - just as a font not including the characters for typesetting, say, Serbian ought to be called a Russian font and not a Cyrillic font. Thus, to be able to set Sanskrit after a fashion, and Hindi well - by including ligatures for foreign words commonly transliterated in Hindi - would be good. And the other point raised, that a limited number of glyphs are missing that are at least shown by dictionaries as used by Hindi - if, indeed, the number is limited, then perhaps they should be added as well. This, though, is less important - as here we are talking about ligatures, to which virama is always an alternative. As for the example above, yes, dga is different from nga, since the former doesn't have a dot - but since omitting a dot is much easier than drawing a new character from scratch, if nga is included, then since dga comes cheaply from a design point of view, unless the size of the font comes at a premium, dga ought to be included if the language could call for it. So I think that while the criticism of Adobe adding useless ligatures just for fun can be thrown out, some of the other objections seem to have enough merit that, if, on examination, they really do call for nothing more than the addition of a very limited number of glyphs, they could be highly constructive.
John Hudson Posted August 12, 2012 Posted August 12, 2012 Michel, I see some instances in your analysis of letter + virama + independent vowel, e.g ल्उ. I'm somewhat surprised by this spelling, but in any case this would not constitute a conjunct, which is a sequence of consonant letters separated by virama.
Michel Boyer Posted August 12, 2012 Posted August 12, 2012 John, I have no doubt you are right but I prefer having more and clean up than missing some. If I define consonants to be range(0x0915,0x0940) + range(0x0958,0x0960) am I missing something?
John Hudson Posted August 12, 2012 Posted August 12, 2012 The consonant letter ranges for Hindi are 0x0915–0x0939 and 0x0958–0x095F inclusive.
Michel Boyer Posted August 12, 2012 Posted August 12, 2012 Thanks, that now gives a clean output. New versions of all files are on my blog for all cases (hunspell, aspell and the dictionaries put together). https://typography.guru/xmodules/typophile/files/hunspell20120812c.txthttps://typography.guru/xmodules/typophile/files/hunspell20120812c.pdfhttps://typography.guru/xmodules/typophile/files/aspell20120812c.pdfhttps://typography.guru/xmodules/typophile/files/aspell20120812c.txthttps://typography.guru/xmodules/typophile/files/together20120812b.pdfhttps://typography.guru/xmodules/typophile/files/together20120812b.txt The lines causing trouble disappeared and the others were kept. Michel
Michel Boyer Posted August 12, 2012 Posted August 12, 2012 I don't have the Adobe Devanagari font and can't comment on it. Uli's analysis contains 821 compound glyphs. For the aspell dictionary, I made my analysis by first replacing all letters followed by a nukta by the corresponding precomposed character. On 475108 bigrams, there were only two "u092F nukta" and that many for others (number of occurrences between parentheses) u0915 (316), u0916 (411), u0917 (157), u091C (673), u0921 (1789), u0922 (395), u092B (391) I then just searched for the longest sequences of the form letter virama letter virama … virama letter and found 627 glyphs to be precompounded to eliminate occurrences of virama that are not at the end of a word. From the analysis, one sees that 211 compounds would be required by only one occurrence in the dictionary and only 356 compounds occur in more than two entries. You can find my results in the files on my blog: /files/compounds_20120812b.pdf I think I see bugs with vowels (bottom of page 10). I would need the right "regular expression" that characterizes possible conjuncts (or is it a problem with the input dictionary file (txt, 1.67M) ?). Michel New version on my blog.
Michel Boyer Posted August 12, 2012 Posted August 12, 2012 For the Hunspell dictionary; the file contains only 15990 entries and at most 343 compounds are found. /files/hunspell20120812.pdf/files/hunspell20120812.txt PS. Here are weird bigrams that gave nuktas that could not be precomposed: u0924;u093C; 1 occurrence u092C;u093C; 2 occurrences u0939;u093C; 1 occurrence Added: new version on my blog https://typography.guru/forums/topic/101095-forwarding
Michel Boyer Posted August 12, 2012 Posted August 12, 2012 I added the files with both dictionaries put together. A applied sort -u on the result, to remove duplicates. Some resulting entries look weird but I could find they exist with Google. I am nevertheless surprised that so many entries (7692) were added to the aspell dictionary that already had 83514 entries while hunspell has only 15990 entries. I don't understand how such a thing can happen. /files/together20120812.pdf/files/together20120812.txt Added: new versions on my blog https://typography.guru/forums/topic/101095-forwarding
Uli Posted August 13, 2012 Author Posted August 13, 2012 Mr. Boyer: Thank you very much for compiling the conjunct files. This will help Mr. Hudson to improve his Adobe font. I am pleased to see that your PDF files (together20120812b.pdf etc.) were typeset using the great Siddhanda font made by my correspondent Mihail Bayaryn. see http://www.sanskritweb.net/cakram/index.html and http://svayambhava.blogspot.de/p/siddhanta-devanagariunicode-open-type.html For lovers of foreign-language Bibles, I should mention that the Sanskrit version of St. John's Gospel was typeset using Mihail Bayaryn's great font Siddhanta: see http://www.sanskritweb.net/sansdocs/john.pdf
John Hudson Posted August 13, 2012 Posted August 13, 2012 Uli, if I provide you with a list of conjuncts that do not occur in either the Aspell or Hunspell Hindi dictionaries, would you be able to confirm which occur in your Sanskrit corpus? Of the various attested conjuncts, I am now trying to sort out which are standard Hindi, which would be used for Sanskrit, and which are were introduced for transliteration of foreign words.
John Hudson Posted August 13, 2012 Posted August 13, 2012 Michel: I am nevertheless surprised that so many entries (7692) were added to the aspell dictionary that already had 83514 entries while hunspell has only 15990 entries. I don't understand how such a thing can happen. This troubles me too, and makes me wonder how the dictionaries were compiled. I suspect the Aspell collection might be based on a simple corpus analysis, in which case there is a strong likelihood that it contains transliterated foreign words and, as we can see from your analysis, misspellings and incorrect encodings.
Uli Posted August 13, 2012 Author Posted August 13, 2012 Mr. Hudson: "Uli, if I provide you with a list of conjuncts that do not occur in either the Aspell or Hunspell Hindi dictionaries, would you be able to confirm which occur in your Sanskrit corpus?" Since everything was documented by me, you can quite easily do this yourself. At my subsite http://www.sanskritweb.net/itrans/index.html#SANS2003 please download http://www.sanskritweb.net/itrans/itmanual2003.pdf On pages 28 through 42, you will find the complete list of attested Sanskrit compounds sorted by Indic alphabet. Another document of interest for font developers is the list of attested orthographic syllables sorted by Indic alphabet and contained in the file itmanual2003.pdf, pages 76 through 103. The same list sorted by frequency is contained in another file: http://www.sanskritweb.net/itrans/ortho2003.pdf For additional Hindi ligatures, see itmanual2003.pdf, pages 110 through 130.
John Hudson Posted August 13, 2012 Posted August 13, 2012 Thanks, Uli. I'll take a look at your documents and let you know if I have any questions. It would be great to have versions of these tables that could be used as test documents, e.g. plain text files.
Michel Boyer Posted August 13, 2012 Posted August 13, 2012 I am nevertheless surprised that so many entries (7692) were added In fact, there were even duplicate entries in the aspell dictionary, so that it is not 7692 but 7819 entries that were added to it by adding entries from the hunspell dictionary. I saved them to a file that I posted on my blog. https://typography.guru/xmodules/typophile/files/added_20120813.txt From those 7819 entries, I can see many proper names but I have no idea what is the proportion. That there are errors in the aspell dictionary does not explain that it would have missed 7819 entries on 15990; that is 48.9%. That means that only 51.1% of hunspell entries, i.e. 8171 entries in hunspell, are also in aspell (which contains over 83000 entries).
Uli Posted August 14, 2012 Author Posted August 14, 2012 Mr. Hudson: "Thanks, Uli. I'll take a look at your documents and let you know if I have any questions. It would be great to have versions of these tables that could be used as test documents, e.g. plain text files." As a companion to adobe-ligatures-analysis.pdf (see above), I uploaded the file http://www.sanskritweb.net/itrans/adobe-ligatures-analysis.txt containing nothing but the Devanagari ligatures in plain 16 bit Unicode encoding. This 16-bit Unicode file has the following internal structure: FF FE - (16 bit Unicode file identification signature) 2A 09 - प (Devanagari p) 4D 09 - ् (Devanagari virama) 30 09 - र (Devanagari r) 0D 00 - CR (carriage return) 0A 00 - LF (line feed) etc. etc. So, the first line of this file is the Devanagari ligature प्र. MS Word recognises adobe-ligatures-analysis.txt as 16-bit Unicode file.
Uli Posted August 14, 2012 Author Posted August 14, 2012 Mr. Hudson: And here are the 460-odd attested Hindi ligatures drawn from Hindi dictionaries and compiled and exemplified by Ernst Tremel and contained on the pages 110 through 130 of the Itranslator manual: http://www.sanskritweb.net/itrans/itmanual2003.pdf This ligature list compiled by Ernst Tremel is downloadable as a plain Unicode file http://www.sanskritweb.net/itrans/hindi-ligatures.txt This file contains nothing but the Devanagari ligatures in plain 16 bit Unicode encoding in the same mannner as the companion Unicode file adobe-ligatures-analysis.txt.
John Hudson Posted August 14, 2012 Posted August 14, 2012 Many thanks, Uli. I have correlated my most recent draft glyph set to Michel's analysis of the Hunspel and Aspell dictionaries, and am now in the process of correlating to your Sanskrit ligature list and Ernst Tremel's Hindi list. The number of ligature un-attested in these sources is gradually being whittled down, and most that remain are either what I classify as systematic inclusions -- i.e. representative of core aspects of the writing systems qua system such as merged -R forms of each letter -- or are fairly obvious transliteration sequences. With regard to the latter, I've come to the conclusion that their inclusion needs to be a matter of individual font and its intended purpose. Coincidentally, Fiona just communicated to me a request from a major Bengali language newspaper publisher to include an SPL conjunct ligature for the transliteration of the English loan word 'splinter', which apparently occurs often enough in the context of politics to be afforded such treatment. Recently, we worked on user interface fonts for Hindi, and analysed the localised strings for the operating system and other software, finding such unexpected transliterated conjuncts as TZV (barmitzva). Not all of these end up with ligature solutions -- if I can get the half form shaping for rare sequences to look good, that's what I'll use --, but they illustrate the reality of transliterated loan words in modern Hindi. At present, I am working on a font specifically for pre-modern Hindi texts, so being able to identify the source and attestation for different groups of conjunct ligatures is very helpful.
Uli Posted August 14, 2012 Author Posted August 14, 2012 Mr. Hudson: I wish you good look in finishing your Adobe Devanagari font. Ligatures are a never-ending story. Look at you own name "Hudson". Sanskrit and Hindi do not have "ds" as a sound combination. This is because the soft dental "d" would be assimilated to the hard dental "t" before "s", resulting in "ts", which would be available as a Hindi ligature. But you would not want your name transliterated in Devanagari as "Hutson", would you? Therefore, transliterating your name in Devanagari would required a new ligature, namely for "ds". Namaste
Michel Boyer Posted August 14, 2012 Posted August 14, 2012 finding such unexpected transliterated conjuncts as TZV (barmitzva) Knowing that barmitzva is itself a transliteration from the Hebrew* בַּר מִצְוָה where I expect the letter צ (tsadi) would normally give rise to the sound ts and not tz, I find that surprising too (even if some might voice it before a voiced consonant, and I wonder who). *(מִצְוָה is certainly Hebrew; בַּר is Aramaic)
John Hudson Posted August 14, 2012 Posted August 14, 2012 One of our Indian associate designers has taken a look at the Hunspell and Aspell Hindi lists, and confirms my suspicion that they are heavily loaded with multiple transliterations or transcriptions of foreign loan words, e.g. इंग्लिश इंग्लैंड इग्लैंड ग्लव्स inglish, ingland, iglaind, and glov(e)s. This, of course, makes them useful for their purpose of spellchecking for modern Hindi usage, but I'm finding Ernst Tremel's list more reliable in terms of being sourced from *mostly* Hindi words only. I find the presence of conjuncts beginning with ङ that are not attested in either Hunspell or Aspell unnerving, and note even in his list there are transliterations or transcriptions of naturalised terminology e.g. 'anglo-indiyan' or 'kongres'.
Michel Boyer Posted August 15, 2012 Posted August 15, 2012 The 8 lines of Python in the previous post can be replaced by the following 6 lines (that first read the full dictionary as a list of words, which may cause trouble on very large dictionaries but works fine for me on aspell.txt). For closing the input file, additional lines of code would be required. --- import re, sys, codecs compound = ur'(?:[\u0915-\u0939\u0958-\u095F]\u094D)+[\u0915-\u0939\u0958-\u095F]' listwords=codecs.open(sys.argv[1],"r","utf-8").read().split() for word in listwords: for comb in re.findall(compound, word): print comb.encode('utf-8') --- The advantage is that the processing loop is extremely simple.
John Hudson Posted August 17, 2012 Posted August 17, 2012 Here's a summary of my thinking on this, after taking time to correlate my Devanagari glyph set spreadsheet with the data Michel derived from the Hunspell and Aspell dictionaries and with Ernst Tremel's list of Hindi conjunct ligatures. I am suspicious of both the Hunspell and Aspell dictionaries as attestation sources for many Hindi conjuncts, simply because they seem to include high numbers of transliterated or transcribed foreign loan words. Tremel's list makes more sense to me, based on his sources, although this too contains some loan words and, importantly in terms of glyph set design, includes large numbers of conjuncts that almost all Hindi writers would instead write with anusvara, e.g. संख्या instead of सङ्ख्या. Also, Tremel's list does not include frequency data. In terms of the Adobe Devanagari set, there are about 100 conjunct ligatures that are not attested in any of the lists, of which a significant number are of Sanskrit origin, their presence in the Linotype list presumably reflecting those Sanskrit conjuncts that we most frequently encountered in words quoted by Linotype's Indian customers of the time. A few are what I consider 'systematic inclusions', i.e. those whose existence is implied by the writing system rather than its application to a particular language, e.g. a merged rakar form of the nukta letters. The remainder are presumed to be transliteration and transcription forms for loan words, as requested by Linotype's customers. As noted by Uli, once you start including such forms, it becomes an open ended set, and the only way to constrain it is by frequency. But, of course, the frequency of foreign loan words depends entirely on the nature of the texts examined; which accounts for the differences between the Hunspell, Aspell and Linotype/Adobe sets. What I'm left with is a set of data that includes a fairly large number of Hunspell and Aspell transliteration conjuncts that are not covered in the Adobe Devanagari set, and vice versa. The most useful aspect of this, at the moment, is that it suggests ways in which I might reduce the draft glyph set of a font I am working on for pre-modern Hindi texts, in which few loan words are expected to be encountered. For fonts targeting modern Hindi usage, the usefulness is less clear, but perhaps there might be candidates among the more common transliteration conjuncts from Hunspell and Aspell that could be added. On the other hand, I'm pleased to see just how many of the conjuncts not supported by ligatures in the Adobe Devanagari font shape very well with half forms (which we carefully kerned to that purpose). Uli, you suggested early in this thread that the Adobe Devanagari set would need only about a dozen more ligatures in order to adequately support classical Sanskrit. This surprised me. The first document to which you linked in this discussion seems to be a subset of Sanskrit conjuncts, certainly much shorter than the complete list in your manual. Can you explain the basis of this subset? Is it based on frequency, or on a particular set of texts?
Uli Posted August 17, 2012 Author Posted August 17, 2012 Mr. Hudson: "Uli, you suggested early in this thread that the Adobe Devanagari set would need only about a dozen more ligatures in order to adequately support classical Sanskrit. This surprised me. The first document to which you linked in this discussion seems to be a subset of Sanskrit conjuncts, certainly much shorter than the complete list in your manual. Can you explain the basis of this subset? Is it based on frequency, or on a particular set of texts?" The first document, namely this document: http://www.sanskritweb.net/itrans/adobe-ligatures.pdf lists attested Sanskrit conjuncts in descending frequency order. I cut off this partial list at the frequency of 0.010 %, namely here: ङ्घ्र्य ṅghry (!!!) 0.010% If the Adobe Devanagari font contained all attested Sanskrit ligatures down to a frequency of 0.010 %, it would be a very good Sanskrit font, although it would not include the rarest ligatures with a frequency below 0.010%. These rarest ligatures are only covered by our own highly specialized Itranslator fonts "Sanskrit2003.tt", "Chandas.ttf" and "Siddhanda.ttf" downloadable here http://www.sanskritweb.net/itrans/ However, the Adobe Devanagari font, primarily designed for Hindi, could be also a satisfactory Sanskrit font, if it contained a subset of Sanskrit ligatures, provided that at least the most frequent Sanskrit ligatures were covered by the subset. It is up to you where you cut off the frequency list adobe-ligatures.pdf. For example, given the present version of the Adobe Devanagari font, it contains all Sanskrit ligatures down to a frequency of 0.215%, because the first ligature not covered by this font is the ligature द्ध्व, namely द्ध्व ddhv (!!!) 0.215% If you were to include also ड्ग, the Adobe Devanagari font would cover all Sanskrit ligatures down to a frequency of 0.167 %, namely ड्ग ḍg (!!!) 0.167% and so on and so on. You could proceed and cut off here द्द्व ddv (!!!) 0.119%, or here: ङ्घ्र ṅghr (!!!) 0.109%, or here: द्द्र ddr (!!!) 0.109%, or here: ड्य ḍy (!!!) 0.097%, or here: ... ङ्घ्र्य ṅghry (!!!) 0.010% If the Adobe Devanagari font included all ligatures down to ङ्घ्र्य, id est down to a frequency of 0.010%, it would be a very good Sanskrit font. But I repeat, you may cut off the list much earlier. As regards the attestations, I made frequency analyses of original electronic Sanskrit files. The German University of Göttingen hosts this huge collection of electronic Sanskrit files: http://gretil.sub.uni-goettingen.de/gretil.htm Ten years ago, I started with the proofread entire Mahabharata: http://bombay.indology.info/mahabharata/statement.html However, it only makes sense to analyze proofread electronic texts: http://bombay.indology.info/mahabharata/history.html because text files crammed with typos result in erroneous frequency counts. For making a good Hindi font, someone would have to analyze proofread electronic Hindi files and would have to make frequency counts. Thereafter it would become clear, which ligatures should be included into a good Hindi font and which should be omitted. I should like to mention that the German University of Cologne here http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/tamil/index.html hosts a huge electronic Sanskrit dictionary with more than 150,000 entries, which I also completely analysed for my own ligature frequency counts ten years ago. A similar electronic dictionary should also be available for Hindi. Mr. Boyer's short Hindi word files mentioned above in this thread are not sufficient for large-scale frequency counts.
John Hudson Posted August 18, 2012 Posted August 18, 2012 Thanks, Uli. This is very useful. Later this year I'll be working on a font specifically for Sanskrit, and you analysis is a huge boon for anyone working in this area. For making a good Hindi font, someone would have to analyze proofread electronic Hindi files and would have to make frequency counts. Thereafter it would become clear, which ligatures should be included into a good Hindi font and which should be omitted. Yes. It seems this resource doesn't exist yet.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now