X - International Journal of Information Science and Computer Mathematics (Closed Ed TRF)
Volume 5, Issue 1, Pages 19 - 32
(February 2012)
|
|
STATISTICAL SUBSTRING PROPERTIES OF ARABIC PROPER NAMES
Salah Al-Sharhan, Fahhad Alharbi and Fawaz S. Al-Anzi
|
Abstract: Non-diacritization is a deeply seated property of the Arabic orthography. Attempts to produce tools to generate diacritics for a general text are underway. Most of the tools developed in this area, concentrate on the making and understanding of the text (sentence) and produce the proper diacritics of a word according to its position in the text. The results of such tools are still at their early stages. It will be quite some time before an efficient general-purpose tool for diacritics generation is produced. The purpose of this paper is to study substring pattern behaviour in diacritized Arabic names. Pattern behaviour is very important in word prediction and the results can be applied to automatic diacritization and transliteration of Arabic names.
Statistical properties of a natural language are one of the most important parts of language analysis. Number of Different Words (NODW) and Different Word Usage Ratio (DWUR) concepts are some of the general characteristics of a corpus. We have modified this concept to calculate the Number of Different Segments (NODS) and Different Segment Usage Ratio (DSUR) to statistically analyze our corpus of Arabic names. Renewal and clump counts are other statistical metrics that shed light on substring pattern behavior. The names corpus were analyzed for these measures and the results tabulated. |
Keywords and phrases: Arabic language, mathematical modeling, statistical properties, natural Arabic language processing, human computer interface. |
Communicated by Kewen Zhao |
Number of Downloads: 57 | Number of Views: 350 |
|