Hindi Tamil Gujarati Bangla Assamese Kannada Malayalam Marathi Oriya Punjabi Telugu
Background
VishwaBharat@tdil
Request a CD
Quick Registration
TDIL Team at DIT
Keep watching this area for latest News and Updations
Frequently Asked Questions(FAQ's)

What is Language Technology?
What is Computational Linguistics?
Where do I find out about linguistic annotation tools?
What is speech synthesis?
How do I perform speech synthesis?
What is ISCII?
What is ACII Script Code?
Are there any new entities required for ensuring proper representation of complex scripts?
How is text represented through ACII?
What are ISFOC character sets?
What are the basic ISFOC concepts?
What is UNICODE?
What is unicode policy for character encoding?
What is the basic difference between Unicode and ISCII code?
What are three different Keyboard Layouts for typing in Indian Languages?


What is Language Technology?

Language technology researches computer systems, which understand and/or synthesize spoken and written human languages. Included in this area are speech processing (recognition, understanding, and synthesis), information extraction, handwriting recognition, machine translation, text summarization, and language generation.

What is Computational Linguistics?

Computational linguistics (CL) is a discipline between linguistics and computer science which is concerned with the computational aspects of the human language faculty. It belongs to the cognitive sciences and overlaps with the field of artificial intelligence (AI), a branch of computer science that is aiming at computational models of human cognition. There are two components of CL: applied and theoretical. The applied component of CL is more interested in the practical outcome of modelling human language use. The goal is to create software products that have some knowledge of human language.
Natural language interfaces enable the user to communicate with the computer in German, English or another human language. Some applications of such interfaces are database queries, information retrieval from texts and so-called expert systems. Current advances in recognition of spoken language improve the usability of many types of natural language systems.
Much older than communication problems between human beings and machines are those between people with different mother tongues. One of the original goals of applied computational linguistics was fully automatic translation between human languages. Computational linguists have created software systems which can simplify the work of human translators and clearly improve their productivity. Even though the successful simulation of human language competence is not to be expected in the near future, computational linguists have numerous immediate research goals involving the design, realization and maintenance of systems which facilitate everyday work, such as grammar checkers for word processing programs.
Theoretical CL takes up issues in formal theories. It deals with formal theories about the linguistic knowledge that a human needs for generating and understanding language. Computational linguists develop formal models simulating aspects of the human language faculty and implement them as computer programmes. These programmes constitute the basis for the evaluation and further development of the theories. In addition to linguistic theories, findings from cognitive psychology play a major role in simulating linguistic competence. Within psychology, it is mainly the area of psycholinguistics that examines the cognitive processes constituting human language use. The special attraction of computational linguistics lies in the combination of methods and strategies from the humanities, natural and behavioural sciences, and engineering.

Where do I find out about linguistic annotation tools?

There is a very comprehensive Linguistic Annotation Tools web page provided by the Linguistic Data Consortium, at ttp://www.ldc.upenn.edu/annotation .
It concentrates on speech but also covers resources for working with text.

What is speech synthesis?

Speech synthesis programs convert written input to spoken output by automatically generating synthetic speech. Speech synthesis is often referred to a "Text-to-Speech" conversion (TTS).

How to perform speech synthesis?

There are several algorithms. The choice depends on the task they're used for. The easiest way is to just record the voice of a person speaking the desired phrases. This is useful if only a restricted volume of phrases and sentences is used, e.g. messages in a train station, or schedule information via phone. The quality depends on the way recording is done. More sophisticated but worse in quality are algorithms which split the speech into smaller pieces. The smaller those units are, the less are they in number, but the quality also decreases. An often used unit is the phoneme, the smallest linguistic unit. Depending on the language used there are about 35-50 phonemes in western European languages, i.e. there are 35-50 single recordings. The problem is combining them as fluent speech requires fluent transitions between the elements. The intellegibility is therefore lower, but the memory required is small.
A solution to this dilemma is using diphones. Instead of splitting at the transitions, the cut is done at the center of the phonemes, leaving the transitions themselves intact. This gives about 400 elements (20*20) and the quality increases.  The longer the units become, the more elements are there, but the quality increases along with the memory required. Other units which are widely used are half-syllables, syllables, words, or combinations of them, e.g. word stems and inflectional endings. The Museum of Speech Analysis and Synthesis has pictures of artificial speech systems going back over 150 years: worth a visit. (http://mambo.ucsc.edu/psl/smus/smus.html)

What is ISCII?

Bureau of Indian Standards formed a standard known as ISCII (Indian Script Code for Information Interchange) for the use in all computer and communication media, which allows usage of 7 or 8 bit characters. In an 8 bit environment, the lower 128 characters are the same as defined in IS10315:1982 (ISO 646 IRV) 7 bit coded character set for information interchange also known as ASCII character set. The top 128 characters cater to all the Indian Scripts based on the ancient Brahmi script. In a 7-bit environment the control code SI can be used for invocation of the ISCII code set and control code SO can be used for reselection of the ASCII code set.
There are 15 officially recognized languages in India. Apart from Perso-Arabic scripts, all the other 10 scripts used for Indian languages have evolved from the ancient Brahmi script and have a common phonetic structure, making a common character set possible. An attribute mechanism has been provided for selection of different Indian script font and display attributes. An extension mechanism allows use of more characters along with the ISCII code. The ISCII Code table is a super set of all the characters required in the Brahmi based Indian scripts. For convenience, the alphabet of the official script Devnagari has been used in the standard. The standard number IS1319:1991 issued by Bureau of Indian Standards is the latest Indian Standard for Information Interchange, and is being widely used for development of IT products in Indian Languages.

What is ACII Script Code?

Alphabetic Code for Information Interchange (Pronounced as "Ae-Kee). This is a 8-bit code, containing the ASCII character set in the bottom half. The top half contains the ACII characters. PC-ACII Script code is the version of ACII script code where the characters are split in the upper-half for compatibility with IBM PC. This splitting is necessary in order to keep intact the Line Drawing characters which are located in middle of the upper-half of the character set.

Are there any new entities required for ensuring proper representation of complex scripts?

Following are the entities required for ensuring proper representation of complex scripts:

ACII
- Alphabetic code for Information Interchange This is a computer code by which the basic alphabet of a script is represented. The basic letters and signs needed in most of scripts (leaving aside ideographic scripts like Chinese) are less than 96. All the possible shapes in a script can be expressed through combinations of these basic letters. The ACII code can be typed through an ACII keyboard overlay. The ACII keyboard overlay fits on a standard English keyboard. Each ASCII character has a unique position on the keyboard overlay.

ISFOC
- Intelligence Based Script Font Code ISFOC is a coded character set containing all the basic shapes required for rendering a script. These shapes can be overlapped linearly to compose any word in the script. Each of the ISFOC characters is like a piece of a jigsaw puzzle; it may not be a complete letter by itself. Each ISFOC set can contain a maximum of 188 characters. This is adequate for most of the scripts. However, some require more.

ISFA
- Intelligence Based scripts to Font Algorithm A word is always typed in terms of its basic ACII characters. It however, has to be displayed using the basic ISFOC shapes. An algorithm is required for converting the ACII codes to the appropriate ISFOC code. This is the ISFA algorithm.

How is text represented through ACII?

 ACII (Alphabet code for Information Interchange) code contains all the basic characters available on the ACII keyboard. For example, The ACII Indian code and keyboard accommodates the requirements for the 10 Indian scripts: Assamese, Bengali, Devanagri, Gujrati, Kannada, Malayalam, Oriya, Punjabi, Tamil and Telugu. The basic characters are ordered such that direct sorting gives results, which are almost the same as that for any of the scripts. The ACII codes have to be converted to ISFOC for display purpose. This is done through an ISFA algorithm for the selected script. An ACII text can be displayed in any of the scripts. Transliteration to another script can be achieved by merely selecting that script. ACII code is used in communication media, like telex, for optimal transfer of text. ALP word processor uses the ACII code internally to allow proper editing at alphabetic level and unique representation of spellings.
The existing window applications are unable to handle ACII directly, as it requires an intelligent algorithm for handling the display. They can, however handle the ISFOC codes, which were made for this purpose. Thus, conversion is necessary between ACII and ISFOC whenever text has to be transferred from ALP to a window application. It is possible to type ISFOC text directly within a windows application using the ACII keyboard. This is done through a custom keyboard driver who does ACII to ISFOC conversion internally.

What are ISFOC character sets?

Script character set
This is the primary character set containing most of the language characters and a set of symbols and numerals, which are frequently used. This set of symbols will be common across all the ISFOC character sets, with a few exceptions.

The matching English Character set
This is a companion character set for matching English fonts containing ASCII characters in the bottom half, and accent characters for Roman Transliteration in the upper half.

The supplemental character set
The supplemental character set is an extended set to the basic script character set containing conjuncts and symbols, which are not required for normal usage.
What are the basic ISFOC concepts?

This chapter list the basic philosophy required for rendering complex scripts.

Script Rendition Philosophy

It is intuitive and logical to type in a word in terms of its spelling.
The spelling of a word consists of the basic alphabet in the order of their pronunciation.
The basic alphabet of a script along with necessary special symbols and punctuations constitute the ACII (Alphabet Code for Information Interchange). The letters in the ACII are arranged according to their alphabetical sorting order. ACII also contains the ASCII character set.
· A word can be composed by linearly combining the basic shapes available in a script.
· ISFOC foe a script contains these basic shapes. These can be too unwieldy for direct typing.
· An intelligent script to Font Algorithm (ISFA) can interpret the ACII spelling and generate an ISFOC code sequence required for displaying the word.
· For simple scripts like that of English the ASCII code for itself suffices for both ACII and ISFOC.
· However most complex non-linear scripts, like the Indian scripts, require a separate code for ACII, ISFOC, and an ISFA algorithm.

ISFOC Standards
· Script standards for basic shapes and their composition facilitate designing of fotns.
· ISFOC represents the modern rendition style of a script by defining the necessary basic shapes.
· The basic shapes are chosen such that they can represent a wide variety of fonts styles in the script.
· ISFOC for a script is associated with an ISFA, which defines the standard way for composing a word using the basic shapes.
· All the fonts developed for a script are mutually compatible. A user can view a text in a font of his choice.
· Since ISFOC fonts are linearly composed, they can be used along with the existing English applications and printed on existing Laser printers and Typesetters.
· ISFOC provides the code set for inclusion of complex scripts in graphics-oriented environments like MS-Windows and Macintosh.
· ISFOC provides the neatest script rendition, while allowing an intuitive human interface through an ACII keyboard.


What is UNICODE?

Unicode is increasing being accepted as a standard for Information Interchange worldwide as most of the major IT Companies have declared their support for it. Unicode for Indian Languages use ISCII-88 and not ISCII-91 which is the latest official standard. It was felt necessary that Indian Government should represent UNICODE Consortium for necessary modification in the code pertaining to Indian languages script and hence Department of Information Technology became full member of Unicode Consortium with voting right.

16 Bit (2 Byte) UNICODE
Unicode standard is the Universal character encoding standard, used for representation of text for Computer Processing. Unicode standard provides the capacity to encode all of the characters used for the written languages of the world. The Unicode standards provide information about the character and their use. Unicode Standards are very useful for Computer users who deal with multilingual text, Business people, Linguists, Researchers, Scientists, Mathematicians and Technicians. Unicode uses a 16 bit encoding that provides code point for more than 65000 characters (65536). Unicode Standards assigns each character a unique numeric value and name. The Unicode standard and ISO10646 Standard provide an extension mechanism called UTF-16 that allows for encoding as many as a million. Presently Unicode Standard provide codes for 49194 characters.

What is unicode policy for character encoding?

Unicode consortium has laid down certain policy regarding character encoding stability by which no character deletion or change in character name is possible only annotation update is possible

1. Once a character is encoded, it will not be moved or removed.
2. Once a character is encoded, its character name will not be changed.
3. Once a character is encoded, its canonical combining class and decomposition (either canonical or compatibility) will not be changed in a way that would affect normalization.
4. Once a character is encoded, its properties may still be changed, but not in such a way as to change the fundamental identity of the character.
5. The structure of certain property values in the Unicode character database will not be changed.

What is the basic difference between Unicode and ISCII code?

Unicode
uses a 16 bit encoding that provides code point for more than 65000 characters (65536). Unicode Standards assigns each character a unique numeric value and name. Unicode standard provides the capacity to encode all of the characters used for the written languages of the world.

ISCII
uses 8 bit code which is an extension of the 7 bit ASCII code containing the basic alphabet required for the 10 Indian scripts which have originated from the Brahmi script. There are 15 officially recognized languages in India. Apart from Perso-Arabic scripts, all the other 10 scripts used for Indian languages have evolved from the ancient Brahmi script and have a common phonetic structure, making a common character set possible. The ISCII Code table is a super set of all the characters required in the Brahmi based Indian scripts. For convenience, the alphabet of the official script Devnagari has been used in the standard.

What are three different Keyboard Layouts for typing in Indian Languages?

There are 3 different keyboard layouts.

1. Romanised Layout: In Romanised layout, phonetic English mappings are used to compose the Hindi Text. For example, the key raamaa (or rAmA) can be used to type 'Rama'.

2. Typewriter Layout: This layout is similar to the Hindi typewriter layout & useful for Hindi typists & other people familiar with Hindi Typewriter layout. Typewriter Layout & Key Sequence Charts

3. DOE Phonetic: This layout is standardized by the Department Of Electronics (DOE), Govt. Of India. The advantage of this layout is that the layout remains identical for all Indian Languages. For example, the key 'k' is used to represent the letter 'ka' in all Indian Languages. The Keyboard Layout and the Key Sequence Charts can be used to find the correct key combinations.
 
 
Best Viewed at 800 by 600 pixels
© Dept. of Information Technology, MoCIT, Govt. of India || Site and Data Centre maintained by C-DAC Noida