A context-free grammar is a formal grammatical system for describing a particular class of language. The basic idea of such a grammar is to provide a set of rules which can be used to generate all possible grammatical 'sentences'; these rules must also avoid generating any ungrammatical sentences. A context-free grammar is not powerful enough to describe a natural language such as English, but it is powerful enough to describe nearly all programming languages.
A context-free grammar is the level 2 grammar in the Chomsky hierarchy and was first described in 1956 by the linguist Chomsky. [1]
This very simple example can handle only a tiny subset of English and should not be taken seriously as a guide to English grammar or vocabulary. It is based on an example by David Crystal [2] and is a slight extension of an example on slide 9 of this lecture. [3]
Sentence -> NounPhrase VerbPhrase # A Sentence consists of a NounPhrase followed by a VerbPhrase NounPhrase -> Determiner Noun # A NounPhrase consists of a Determiner followed by a Noun VerbPhrase -> Verb NounPhrase # A VerbPhrase consists of a Verb followed by a NounPhrase Determiner -> a, the # A Determiner is one of: a or the Noun -> boy, girl, dog # A Noun is one of: boy or girl or dog Verb -> chased, heard, saw # A Verb is one of: chased or heard or saw
Notation used above:
By making appropriate substitutions we can generate such sentences as:
The stages in the generation of these examples are as follows:
By appropriate selection of words to substitute for Determiner, Noun, and Verb we can produce the 3 example sentences above.
The simple grammar defines a few words which can be substituted for Determiner, Noun, and Verb. There are two possibilities for Determiner, three for Noun, and three for Verb, with both Determiner and Noun appearing twice in the expansion. Hence the total number of grammatical sentences which can be generated by this simple grammar is 108 (= 2 * 3 * 3 * 2 * 3). In this particular case the possible words have been carefully chosen so that all of these 108 sentences also actually mean something.
Note that the use of italics and boldface in this example is merely a convenient way of indicating the difference between items which can be expanded further (e.g. concepts such as NounPhrase) and items which can not be expanded further (e.g. words such as girl). In the linguistic literature, items which can be expanded further are referred to as 'non-terminal' items, while items which can't be expanded further are known as 'terminal' items.
The 'comma' notation for alternatives is a form of shorthand. In some formulations this would be shown as several separate rules, so that:
Determiner -> a, the
could be shown as two rules:
Determiner -> a Determiner -> the
While a generative grammar may produce sentences which are grammatically correct, there is no certainty that such sentences will actually mean anything. The classic example of such a meaningless sentence is from Chomsky's book [4] with the grammatically correct sentence:
Another problem when analysing a sentence is that many English words have multiple meanings, so that there may be more than one way to interpret a sentence. A simple example is the sentence:
Note: this section and subsequent sections are, by necessity, slightly more technical/formal than the earlier parts of this article, though specialist notation and abbreviations have been largely avoided. Make sure that you understand the 'Simple example' above before proceeding further, since references will be made to that example to help explain the technicalities, and variations on that example will be used to illustrate new ideas.
A formal generative grammar has four components (as described in slide 19 of this lecture [6] ):
Notes:
An relatively small change to even a single rule can make a large change in the language which can be generated. For example, change the first rule in the simple example to read:
Sentence -> NounPhrase VerbPhrase, NounPhrase VerbPhrase and NounPhrase VerbPhrase
This means that a Sentence is either a NounPhrase followed by a VerbPhrase or a NounPhrase followed by a VerbPhrase followed by and followed by a NounPhrase followed by a VerbPhrase. In addition to the shorter sentences already discussed, this could also generate a sentence such as: "A dog saw a girl and the dog chased the girl". The total number of grammatically-correct sentences which can be generated by this modified grammar is now 11,772 (108 + 108*108), though some of these may not be very meaingful.
A significantly different rule change can have an astounding effect. Consider changing the first rule to read:
Sentence -> NounPhrase VerbPhrase, NounPhrase VerbPhrase and Sentence
This can generate a potentially infinite number of sentences (any number of these five-word sentences linked by and). This is an example of what is known as a recursive definition, whereby a Sentence is partially defined in terms of itself. In such a recursive rule there must be at least one transformation option which does not involve Sentence, otherwise the transformation could never stop.
The above change is known as right-recursive; a left-recursive formulation is also possible, and can in this example generate exactly the same sentences. As shown below under 'Context-free grammar', there are situations where the choice of left- or right-recursive formulations can change the meaning.
Sentence -> NounPhrase VerbPhrase, Sentence and NounPhrase VerbPhrase
The distinction between left- and right-recursive grammars becomes important when trying to parse/analyse a sentence; some types of parser work best with left-recursive grammars, while other types work best with right-recursive grammars. This distinction is particularly relevant in connection with compiling or translating a program written in some high-level programming language so that it can be used on a computer.
The use of left-recursive or right-recursive rules may also affect the meaning, as shown in the 'arithmetic expression' example below.
The transformation rules for context-free grammars are fairly simple:
In the linguistic literature, [nothing] is often represented by the symbol ε.
The following example shows how context-free grammars can cope with arithmetic expressions involving possibly nested brackets. Also an expression such as "2 + 3 * 4" could be interpreted as either "(2 + 3) * 4" = 20 or as "2 + (3 * 4)" = 14. The latter interpretation is that which is normally used - multiplication "*" is said to have a higher priority than addition "+" - so the grammar must incorporate this priority concept. Furthermore, consecutive operators of the same priority are normally applied from left to right, so that "8 - 2 + 3" is interpreted as "(8 - 2) + 3" = 9, rather than as "8 - (2 + 3)" = 3. This ordering must also be built in to the grammar.
To avoid distracting side-issues, this example only deals with numbers which consist solely of a single digit. The example is a simplified version of the definition of an arithmetic expression given in the Revised Algol 60 Report (Section 3.3.1). [7]
Expression -> Term # an Expression is either a Term Expression -> Expression AddSubOp Term # or an Expression followed by + or - followed by a Term # Term -> Primary # a Term is either a Primary Term -> Term MulDivOp Primary # or a Term followed by * or / followed by a Primary # Primary -> Number # a Primary is either a Number Primary -> ( Expression ) # or an Expression enclosed in brackets # MulDivOp -> *. / # multiplication and division have the same priority AddSubOp -> +. - # addition and subtraction have the same priority Number -> 0, 1, ... 9 # this simple-minded definition of a Number is enough for this example
Notes:
Application 1: generate 2 + 3 * 4
Expression -> Expression AddSubOp Term -> Term AddSubOp Term -> Primary AddSubOp Term -> 2 + Term # + can not be applied until Term has been evaluated -> 2 + Term MulDivOp Primary -> 2 + Primary MulDipOp Primary -> 2 + 3 * 4
Application 2: generate 2 * 3 + 4
Expression -> Expression AddSubOp Term -> Term AddSubOp Term -> Term MulDivOp Primary AddSubOp Term -> Primary MulDivOp Primary AddSubOp Term -> 2 * 3 AddSubOp Term # Term to left of AddSubOp is evaluated first -> 2 * 3 AddSubOp Primary -> 2 * 3 + 4
Categories: [Linguistics] [Computer Science]