Language technology researches computer systems, which understand
and/or synthesize spoken and written human languages. Included
in this area are speech processing (recognition, understanding,
and synthesis), information extraction, handwriting recognition,
machine translation, text summarization, and language generation.
Computational linguistics (CL) is a discipline between linguistics
and computer science which is concerned with the computational
aspects of the human language faculty. It belongs to the cognitive
sciences and overlaps with the field of artificial intelligence
(AI), a branch of computer science that is aiming at computational
models of human cognition. There are two components of CL: applied
and theoretical. The applied component of CL is more interested
in the practical outcome of modelling human language use. The
goal is to create software products that have some knowledge of
human language.
Natural language interfaces enable the user to communicate with
the computer in German, English or another human language. Some
applications of such interfaces are database queries, information
retrieval from texts and so-called expert systems. Current advances
in recognition of spoken language improve the usability of many
types of natural language systems.
Much older than communication problems between human beings and
machines are those between people with different mother tongues.
One of the original goals of applied computational linguistics
was fully automatic translation between human languages. Computational
linguists have created software systems which can simplify the
work of human translators and clearly improve their productivity.
Even though the successful simulation of human language competence
is not to be expected in the near future, computational linguists
have numerous immediate research goals involving the design, realization
and maintenance of systems which facilitate everyday work, such
as grammar checkers for word processing programs.
Theoretical CL takes up issues in formal theories. It deals with
formal theories about the linguistic knowledge that a human needs
for generating and understanding language. Computational linguists
develop formal models simulating aspects of the human language
faculty and implement them as computer programmes. These programmes
constitute the basis for the evaluation and further development
of the theories. In addition to linguistic theories, findings
from cognitive psychology play a major role in simulating linguistic
competence. Within psychology, it is mainly the area of psycholinguistics
that examines the cognitive processes constituting human language
use. The special attraction of computational linguistics lies
in the combination of methods and strategies from the humanities,
natural and behavioural sciences, and engineering.
There is a very comprehensive Linguistic Annotation Tools web page
provided by the Linguistic Data Consortium, at
ttp://www.ldc.upenn.edu/annotation
.
It concentrates on speech but also covers resources for working
with text.
Speech synthesis programs convert written input to spoken output
by automatically generating synthetic speech. Speech synthesis is
often referred to a "Text-to-Speech" conversion (TTS).
There are several algorithms. The choice depends on the task they're
used for. The easiest way is to just record the voice of a person
speaking the desired phrases. This is useful if only a restricted
volume of phrases and sentences is used, e.g. messages in a train
station, or schedule information via phone. The quality depends
on the way recording is done. More sophisticated but worse in
quality are algorithms which split the speech into smaller pieces.
The smaller those units are, the less are they in number, but
the quality also decreases. An often used unit is the phoneme,
the smallest linguistic unit. Depending on the language used there
are about 35-50 phonemes in western European languages, i.e. there
are 35-50 single recordings. The problem is combining them as
fluent speech requires fluent transitions between the elements.
The intellegibility is therefore lower, but the memory required
is small.
A solution to this dilemma is using diphones. Instead of splitting
at the transitions, the cut is done at the center of the phonemes,
leaving the transitions themselves intact. This gives about 400
elements (20*20) and the quality increases. The longer the
units become, the more elements are there, but the quality increases
along with the memory required. Other units which are widely used
are half-syllables, syllables, words, or combinations of them,
e.g. word stems and inflectional endings. The Museum of Speech
Analysis and Synthesis has pictures of artificial speech systems
going back over 150 years: worth a visit. (http://mambo.ucsc.edu/psl/smus/smus.html)
Bureau of Indian Standards formed a standard known as ISCII (Indian
Script Code for Information Interchange) for the use in all computer
and communication media, which allows usage of 7 or 8 bit characters.
In an 8 bit environment, the lower 128 characters are the same
as defined in IS10315:1982 (ISO 646 IRV) 7 bit coded character
set for information interchange also known as ASCII character
set. The top 128 characters cater to all the Indian Scripts based
on the ancient Brahmi script. In a 7-bit environment the control
code SI can be used for invocation of the ISCII code set and control
code SO can be used for reselection of the ASCII code set.
There are 15 officially recognized languages in India. Apart from
Perso-Arabic scripts, all the other 10 scripts used for Indian
languages have evolved from the ancient Brahmi script and have
a common phonetic structure, making a common character set possible.
An attribute mechanism has been provided for selection of different
Indian script font and display attributes. An extension mechanism
allows use of more characters along with the ISCII code. The ISCII
Code table is a super set of all the characters required in the
Brahmi based Indian scripts. For convenience, the alphabet of
the official script Devnagari has been used in the standard. The
standard number IS1319:1991 issued by Bureau of Indian Standards
is the latest Indian Standard for Information Interchange, and
is being widely used for development of IT products in Indian
Languages.
Alphabetic Code for Information Interchange (Pronounced as "Ae-Kee).
This is a 8-bit code, containing the ASCII character set in the
bottom half. The top half contains the ACII characters. PC-ACII
Script code is the version of ACII script code where the characters
are split in the upper-half for compatibility with IBM PC. This
splitting is necessary in order to keep intact the Line Drawing
characters which are located in middle of the upper-half of the
character set.
Following are the entities required for ensuring proper representation
of complex scripts:
ACII- Alphabetic code for Information Interchange This is
a computer code by which the basic alphabet of a script is represented.
The basic letters and signs needed in most of scripts (leaving
aside ideographic scripts like Chinese) are less than 96. All
the possible shapes in a script can be expressed through combinations
of these basic letters. The ACII code can be typed through an
ACII keyboard overlay. The ACII keyboard overlay fits on a standard
English keyboard. Each ASCII character has a unique position on
the keyboard overlay.
ISFOC- Intelligence Based Script Font Code ISFOC is a coded
character set containing all the basic shapes required for rendering
a script. These shapes can be overlapped linearly to compose any
word in the script. Each of the ISFOC characters is like a piece
of a jigsaw puzzle; it may not be a complete letter by itself.
Each ISFOC set can contain a maximum of 188 characters. This is
adequate for most of the scripts. However, some require more.
ISFA- Intelligence Based scripts to Font Algorithm A word
is always typed in terms of its basic ACII characters. It however,
has to be displayed using the basic ISFOC shapes. An algorithm
is required for converting the ACII codes to the appropriate ISFOC
code. This is the ISFA algorithm.
ACII (Alphabet code for Information Interchange) code contains
all the basic characters available on the ACII keyboard. For example,
The ACII Indian code and keyboard accommodates the requirements
for the 10 Indian scripts: Assamese, Bengali, Devanagri, Gujrati,
Kannada, Malayalam, Oriya, Punjabi, Tamil and Telugu. The basic
characters are ordered such that direct sorting gives results,
which are almost the same as that for any of the scripts. The
ACII codes have to be converted to ISFOC for display purpose.
This is done through an ISFA algorithm for the selected script.
An ACII text can be displayed in any of the scripts. Transliteration
to another script can be achieved by merely selecting that script.
ACII code is used in communication media, like telex, for optimal
transfer of text. ALP word processor uses the ACII code internally
to allow proper editing at alphabetic level and unique representation
of spellings.
The existing window applications are unable to handle ACII directly,
as it requires an intelligent algorithm for handling the display.
They can, however handle the ISFOC codes, which were made for
this purpose. Thus, conversion is necessary between ACII and ISFOC
whenever text has to be transferred from ALP to a window application.
It is possible to type ISFOC text directly within a windows application
using the ACII keyboard. This is done through a custom keyboard
driver who does ACII to ISFOC conversion internally.
Script character set This is the primary character set containing
most of the language characters and a set of symbols and numerals,
which are frequently used. This set of symbols will be common
across all the ISFOC character sets, with a few exceptions.
The matching English Character set This is a companion character
set for matching English fonts containing ASCII characters in
the bottom half, and accent characters for Roman Transliteration
in the upper half.
The supplemental character set The supplemental character
set is an extended set to the basic script character set containing
conjuncts and symbols, which are not required for normal usage.
This chapter list the basic philosophy required for rendering
complex scripts.
Script Rendition Philosophy
It is intuitive and logical to type in a word in terms of its
spelling.
The spelling of a word consists of the basic alphabet in the order
of their pronunciation.
The basic alphabet of a script along with necessary special symbols
and punctuations constitute the ACII (Alphabet Code for Information
Interchange). The letters in the ACII are arranged according to
their alphabetical sorting order. ACII also contains the ASCII
character set.
· A word can be composed by linearly combining the basic shapes
available in a script.
· ISFOC foe a script contains these basic shapes. These can be
too unwieldy for direct typing.
· An intelligent script to Font Algorithm (ISFA) can interpret
the ACII spelling and generate an ISFOC code sequence required
for displaying the word.
· For simple scripts like that of English the ASCII code for itself
suffices for both ACII and ISFOC.
· However most complex non-linear scripts, like the Indian scripts,
require a separate code for ACII, ISFOC, and an ISFA algorithm.
ISFOC Standards
· Script standards for basic shapes and their composition facilitate
designing of fotns.
· ISFOC represents the modern rendition style of a script by defining
the necessary basic shapes.
· The basic shapes are chosen such that they can represent a wide
variety of fonts styles in the script.
· ISFOC for a script is associated with an ISFA, which defines
the standard way for composing a word using the basic shapes.
· All the fonts developed for a script are mutually compatible.
A user can view a text in a font of his choice.
· Since ISFOC fonts are linearly composed, they can be used along
with the existing English applications and printed on existing
Laser printers and Typesetters.
· ISFOC provides the code set for inclusion of complex scripts
in graphics-oriented environments like MS-Windows and Macintosh.
· ISFOC provides the neatest script rendition, while allowing
an intuitive human interface through an ACII keyboard.
Unicode is increasing being accepted as a standard for
Information Interchange worldwide as most of the major IT Companies
have declared their support for it. Unicode for Indian Languages
use ISCII-88 and not ISCII-91 which is the latest official standard.
It was felt necessary that Indian Government should represent
UNICODE Consortium for necessary modification in the code pertaining
to Indian languages script and hence Department of Information
Technology became full member of Unicode Consortium with voting
right.
16 Bit (2 Byte) UNICODE
Unicode standard is the Universal character encoding standard,
used for representation of text for Computer Processing. Unicode
standard provides the capacity to encode all of the characters
used for the written languages of the world. The Unicode standards
provide information about the character and their use. Unicode
Standards are very useful for Computer users who deal with multilingual
text, Business people, Linguists, Researchers, Scientists, Mathematicians
and Technicians. Unicode uses a 16 bit encoding that provides
code point for more than 65000 characters (65536). Unicode Standards
assigns each character a unique numeric value and name. The Unicode
standard and ISO10646 Standard provide an extension mechanism
called UTF-16 that allows for encoding as many as a million. Presently
Unicode Standard provide codes for 49194 characters.
Unicode consortium has laid down certain policy regarding
character encoding stability by which no character deletion or
change in character name is possible only annotation update is
possible
1. Once a character is encoded, it will not be moved or removed.
2. Once a character is encoded, its character name will not be
changed.
3. Once a character is encoded, its canonical combining class
and decomposition (either canonical or compatibility) will not
be changed in a way that would affect normalization.
4. Once a character is encoded, its properties may still be changed,
but not in such a way as to change the fundamental identity of
the character.
5. The structure of certain property values in the Unicode character
database will not be changed.
Unicode uses a 16 bit encoding that provides code point for
more than 65000 characters (65536). Unicode Standards assigns
each character a unique numeric value and name. Unicode standard
provides the capacity to encode all of the characters used for
the written languages of the world.
ISCII uses 8 bit code which is an extension of the 7 bit ASCII
code containing the basic alphabet required for the 10 Indian
scripts which have originated from the Brahmi script. There are
15 officially recognized languages in India. Apart from Perso-Arabic
scripts, all the other 10 scripts used for Indian languages have
evolved from the ancient Brahmi script and have a common phonetic
structure, making a common character set possible. The ISCII Code
table is a super set of all the characters required in the Brahmi
based Indian scripts. For convenience, the alphabet of the official
script Devnagari has been used in the standard.
 |
What
are three different Keyboard Layouts for typing in
Indian Languages? |
 |
There are 3 different keyboard layouts.
1. Romanised Layout: In Romanised layout, phonetic English
mappings are used to compose the Hindi Text. For example, the
key raamaa (or rAmA) can be used to type 'Rama'.
2. Typewriter Layout: This layout is similar to the Hindi
typewriter layout & useful for Hindi typists & other people
familiar with Hindi Typewriter layout. Typewriter Layout &
Key Sequence Charts
3. DOE Phonetic: This layout is standardized by the Department
Of Electronics (DOE), Govt. Of India. The advantage of this layout
is that the layout remains identical for all Indian Languages.
For example, the key 'k' is used to represent the letter 'ka'
in all Indian Languages. The Keyboard Layout and the Key Sequence
Charts can be used to find the correct key combinations.