The Chomsky hierarchy, sometimes known as the Chomsky-Schützenberger hierarchy, is a hierarchy of formal grammatical systems for describing various classes of languages; the hierarchy can apply to both human and computer languages. This hierarchy was first published in 1956 by the linguist Chomsky. [1] These grammatical systems were referred to by Chomsky as transformational grammars. Other linguists have since developed a variety of similar grammars; together these are known as generative grammars.
The basic idea of a generative grammar is to provide a set of rules which are capable of generating all possible grammatical sentences in some language; the rules must constrain the generation so that only grammatical sentences are generated. The rules can also be used in reverse to parse a sentence and check if it is grammatical.
In the late 1950s, John Backus and Peter Naur developed a notation for describing programming languages. This notation is known as Backus-Naur Form (or BNF) and turns out to be equivalent to Chomsky's grammar number two (context-free grammar), although it was developed independently of Chomsky's work. Backus-Naur Form was first used in the Algol 60 Report to specify the syntax of Algol 60, and has since been used for specifying the syntax of other programming languages.
This very simple example can handle only a tiny subset of English and should not be taken seriously as a guide to English grammar or vocabulary. It is based on an example by David Crystal [2] and is a slight extension of an example on slide 9 of this lecture. [3]
Sentence -> NounPhrase VerbPhrase # A Sentence consists of a NounPhrase followed by a VerbPhrase NounPhrase -> Determiner Noun # A NounPhrase consists of a Determiner followed by a Noun VerbPhrase -> Verb NounPhrase # A VerbPhrase consists of a Verb followed by a NounPhrase Determiner -> a, the # A Determiner is one of: a or the Noun -> boy, girl, dog # A Noun is one of: boy or girl or dog Verb -> chased, heard, saw # A Verb is one of: chased or heard or saw
Notation used above:
By making appropriate substitutions we can generate such sentences as:
The stages in the generation of these examples are as follows:
By appropriate selection of words to substitute for Determiner, Noun, and Verb we can produce the 3 example sentences above.
The simple grammar defines a few words which can be substituted for Determiner, Noun, and Verb. There are two possibilities for Determiner, three for Noun, and three for Verb, with both Determiner and Noun appearing twice in the expansion. Hence the total number of grammatical sentences which can be generated by this simple grammar is 108 (= 2 * 3 * 3 * 2 * 3). In this particular case the possible words have been carefully chosen so that all of these 108 sentences also actually mean something.
Note that the use of italics and boldface in this example is merely a convenient way of indicating the difference between items which can be expanded further (e.g. concepts such as NounPhrase) and items which can not be expanded further (e.g. words such as girl). In the linguistic literature, items which can be expanded further are referred to as 'non-terminal' items, while items which can't be expanded further are known as 'terminal' items.
The 'comma' notation for alternatives is a form of shorthand. In some formulations this would be shown as several separate rules, so that:
Determiner -> a, the
could be shown as two rules:
Determiner -> a Determiner -> the
While a generative grammar may produce sentences which are grammatically correct, there is no certainty that such sentences will actually mean anything. The classic example of such a meaningless sentence is from Chomsky's book [4] with the grammatically correct sentence:
Another problem when analysing a sentence is that many English words have multiple meanings, so that there may be more than one way to interpret a sentence. A simple example is the sentence:
Note: this section and subsequent sections are, by necessity, slightly more technical/formal than the earlier parts of this article, though specialist notation and abbreviations have been largely avoided. Make sure that you understand the 'Simple example' above before proceeding further, since references will be made to that example to help explain the technicalities, and variations on that example will be used to illustrate new ideas.
A formal generative grammar has four components (as described in slide 19 of this lecture [6] ):
Notes:
An relatively small change to even a single rule can make a large change in the language which can be generated. For example, change the first rule in the simple example to read:
Sentence -> NounPhrase VerbPhrase, NounPhrase VerbPhrase and NounPhrase VerbPhrase
This means that a Sentence is either a NounPhrase followed by a VerbPhrase or a NounPhrase followed by a VerbPhrase followed by and followed by a NounPhrase followed by a VerbPhrase. In addition to the shorter sentences already discussed, this could also generate a sentence such as: "A dog saw a girl and the dog chased the girl". The total number of grammatically-correct sentences which can be generated by this modified grammar is now 11,772 (108 + 108*108), though some of these may not be very meaingful.
A significantly different rule change can have an astounding effect. Consider changing the first rule to read:
Sentence -> NounPhrase VerbPhrase, NounPhrase VerbPhrase and Sentence
This can generate a potentially infinite number of sentences (any number of these five-word sentences linked by and). This is an example of what is known as a recursive definition, whereby a Sentence is partially defined in terms of itself. In such a recursive rule there must be at least one transformation option which does not involve Sentence, otherwise the transformation could never stop.
The above change is known as right-recursive; a left-recursive formulation is also possible, and can in this example generate exactly the same sentences. As shown below under 'Context-free grammar', there are situations where the choice of left- or right-recursive formulations can change the meaning.
Sentence -> NounPhrase VerbPhrase, Sentence and NounPhrase VerbPhrase
The distinction between left- and right-recursive grammars becomes important when trying to parse/analyse a sentence; some types of parser work best with left-recursive grammars, while other types work best with right-recursive grammars. This distinction is particularly relevant in connection with compiling or translating a program written in some high-level programming language so that it can be used on a computer.
By imposing different constraints [7] as to what sort of transformation rules could be used, Chomsky was able to define four strictly nested transformational grammars, which he numbered from 0 to 3. As the grammar number increases, the generated language becomes simpler. As the grammar number decreases, parsing a sentence in that language becomes more difficult. Other linguists have described formal grammars which fit in between the various Chomsky levels.
The following table provides a summary, which is then explained further in the subsequent subsections.
Chomsky number |
linguistic name | Rule constraints | Recogniser+ | Notes |
---|---|---|---|---|
0 | recursively enumerable | [anything] -> [anything] | Turing machine | most human languages |
1 | context-sensitive | X A Y -> X [anything] Y | linear bounded automaton | X and Y provide the context in which A can be transformed |
2 | context-free | A -> [anything] | non-deterministic push-down automaton | syntax analysis for most programming languages |
3 | regular | A -> [nothing] A -> xyz A -> xyz B |
finite automaton or finite state machine |
lexical analysis (combining characters into words/tokens) |
+ The recogniser column indicates the type of mathematical or computer science theoretical model needed to recognise that a sentence belongs to a language, essentially by applying the transformation rules in reverse to work backwards from the actual words in order to finish up with a sentence (in programming languages the starting concept is usually a program). A finite automaton uses only a small fixed amount of memory, operates in time proportional to the length of the input, and always reaches a decision, while a Turing machine may use an infinite amount of memory and time and may fail to reach a decision (may never stop).
In the linguistic literature, [nothing] is often represented by the symbol ε.
The descriptions of the various Chomsky levels which follow are presented in reverse order, so as to work up from a simple case to a rather complicated case.
The transformation rules for regular grammars are required to be very simple.
Note that instead of a terminal item we could have a non-terminal item whose transformation rule immediately resolves to a single terminal item, or to one of several terminal items.
Informally we can define a 'word' as a sequence of one or more Letters. The following regular grammar provides a more formal definition; for simplicity it is restricted to lower-case letters.
Word -> Letter PartWord PartWord -> [nothing] PartWord -> Letter PartWord Letter -> a, b, ... z
Note that PartWord is recursively defined, with the recursion stopping if it is transformed into [nothing].
For one-letter words we expand as:
For two-letter words we expand as:
Obviously, words of any length can be constructed, though they may well not actually be words from any language.
Trying to devise transformation rules to produce words which are more English-like is much more difficult - how do you insist on at least one vowel or on no more than 5 consecutive consonants (e.g. as in 'strengths').
Further examples in Regular grammar show how numbers can be defined via a regular grammar, and also show how a hyphenated word could be defined.
One of the limitations of a regular grammar is that it can't cope with nested matching brackets, so that it is impossible to write a regular grammar to describe an expression such as 2 * (3+4) where that grammar is expected to allow brackets nested to any depth. A context-free grammar is needed for this, as described in the next section.
Regular grammars or equivalent are widely used by compilers in the lexical analysis of programming languages.
The transformation rules for context-free grammars don't appear that much different from those for regular grammars.
The key difference is that in a regular grammar, any non-empty rule must effectively start with terminal item so that every rule transformation makes some immediate progress, while in a context-free grammar a rule could start with a non-terminal item which itself could be subject to arbitrary further transformations.
The following example shows how context-free grammars can cope with arithmetic expressions involving possibly nested brackets. Also an expression such as "2 + 3 * 4" could be interpreted as either "(2 + 3) * 4" = 20 or as "2 + (3 * 4)" = 14. The latter interpretation is that which is normally used - multiplication "*" is said to have a higher priority than addition "+" - so the grammar must incorporate this priority concept. Furthermore, consecutive operators of the same priority are normally applied from left to right, so that "8 - 2 + 3" is interpreted as "(8 - 2) + 3" = 9, rather than as "8 - (2 + 3)" = 3. This ordering must also be built in to the grammar.
To avoid distracting side-issues, this example only deals with numbers which consist solely of a single digit. The example is a simplified version of the definition of an arithmetic expression given in the Revised Algol 60 Report (Section 3.3.1). [8]
Expression -> Term # an Expression is either a Term Expression -> Expression AddSubOp Term # or an Expression followed by + or - followed by a Term # Term -> Primary # a Term is either a Primary Term -> Term MulDivOp Primary # or a Term followed by * or / followed by a Primary # Primary -> Number # a Primary is either a Number Primary -> ( Expression ) # or an Expression enclosed in brackets # MulDivOp -> *. / # multiplication and division have the same priority AddSubOp -> +. - # addition and subtraction have the same priority Number -> 0, 1, ... 9 # this simple-minded definition of a Number is enough for this example
Notes:
Example 1: generate 2 + 3 * 4
Expression -> Expression AddSubOp Term -> Term AddSubOp Term -> Primary AddSubOp Term -> 2 + Term # + can not be applied until Term has been evaluated -> 2 + Term MulDivOp Primary -> 2 + Primary MulDipOp Primary -> 2 + 3 * 4
Example 2: generate 2 * 3 + 4
Expression -> Expression AddSubOp Term -> Term AddSubOp Term -> Term MulDivOp Primary AddSubOp Term -> Primary MulDivOp Primary AddSubOp Term -> 2 * 3 AddSubOp Term # Term to left of AddSubOp is evaluated first -> 2 * 3 AddSubOp Primary -> 2 * 3 + 4
Context-free grammars (or equivalent definitions using Backus-Naur-Form) are widely used both to define the syntax of programming languages and to (semi-automatically) generate parsers for these languages.
A limitation/feature of both regular grammars and context-free grammars is that the left-hand side of any rule is always a single non-terminal item, so any transformation is applied regardless of the context in which that non-terminal item appears. A simple example which depends on context is the use of singular and plural nouns and verbs in English. We say: the man runs or the men run, but neither of the man run nor the men runs. Which form of verb to use depends on the noun, i.e. there is a context dependency.
For context-sensitive grammars the left-hand side of a rule may indicate the context in which that rule applies, so that the same non-terminal item may be transformed in different ways depending on the context in which it appears.
There is only one transformation rule for context-sensitive grammars:
where X and Y determine the context in which the [non-terminal] is transformed. Note that X and Y may each be terminals or non-terminals or [nothing]. When both X and Y are nothing there is no context and this reduces to a Level 2 or Level 3 grammar.
Taylor has provided several examples [9] and the following, inspired by his third example, is an extension of the original simple example above to handle singulars and plurals:
Sentence -> NounPhrase VerbPhrase # simple sentence structure NounPhrase -> Determiner Noun # simple phrase structure VerbPhrase -> Verb NounPhrase # simple phrase structure # Noun -> SingularNoun, PluralNoun # nouns can be singular or plural # Determiner SingularNoun -> SingularDeterminer SingularNoun # context-sensitive: Determiner PluralNoun -> PluralDeterminer PluralNoun # determiner must agree with noun # SingularNoun Verb -> SingularNoun SingularVerb # context-sensitive: PluralNoun Verb -> PluralNoun PluralVerb # verb must agree with noun # SingularDeterminer -> a, the # singular determiners PluralDeterminer -> the # plural determiners # SingularNoun -> boy, sheep, dog # singular nouns PluralNoun -> boys, sheep, dogs # plural nouns # SingularVerb -> chases, hears, sees # singular verbs PluralVerb -> chase, hear, see # plural verbs
The above grammar can generate sentences such as: a boy chases a dog, the boys chase a dog, etc. Because some words have the same form in both singular and plural, it can also generate meaningful (and grammatically-correct) sentences such as: the sheep sees the dog, the sheep see the dog.
But the above grammar can NOT generate sentences such as: a boys chases the sheep, since this would violate the context-sensitive rules.
It is in fact possible to handle singular/plural issues by using only a Level 2 context-free grammar, but the above rules expressed as a Level 1 context-sensitive grammar are a closer match to ordinary concepts of English grammar (e.g. nouns and verbs must agree in number i.e. singular/plural).
A recursively enumerable grammar has absolutely no restriction on what sort of transformation rules can be applied; hence it is also known as an unrestricted grammar. Rules take the general form:
The following rules can generate all symbols of the form: abc, aabbcc, aaabbbccc, etc., where there are equal numbers of a's, b's and c's. This example is taken from a lecture.[10]
S -> a B S c S -> a B c B a -> a B B c -> b c B b -> b b
The derivations of the first three items in this infinite sequence are:
These lecture notes [11] show that languages exist which can not be described by any set of rules. As is common with logical proofs of this nature, the proof doesn't actually produce or demonstrate such a language.
Categories: [Linguistics] [Computer Science] [Language]