Jump to content

Talk:Context-free grammar

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Derivations and syntax trees

[edit]
The article presents this example:

For example, if we take the following grammar:

(1) S → S + S
(2) S → 1

and the string "1 + 1 + 1" then the left derivation of this string is the list [ (1), (1), (2), (2), (2) ]. Analogously the rightmost derivation is defined as the list that we get if we always replace the rightmost nonterminal first. In this case this would be the list [ (1), (2), (1), (2), (2)].

However, we could just as easily replace the rightmost S with S+S for the rightmost case, or we could replace the leftmost S with 1 for the leftmost case. In other words, this section should be rewritten with a better (less ambiguous) example RaulMiller 02:32, 21 Sep 2005 (UTC)
The "1 + 1 + 1"-example is really hard to understand. There is (if I understood it correct) two ways to do rightmost derivation, and two ways to do leftmost derivation. In adittion the method that gives [ (1), (2), (1), (2), (2) ] actually applies to both leftmost- and rightmost derivation? Things would probably be better explained to the reader if another example is used. [Guest user], 3 Dec 2005

It is necessary to add, that arrows outgoing from node are ordered. AlexShkotin 05:14, 6 September 2006 (UTC)[reply]

The current example is:
(1) S → S + S
(2) S → 1
(3) S → a

and the string '1 + 1 + a'. Still the parsing process is not determined (neither righmost nor leftmost derivation) because there always is a choice between first expanding (via rule (1)) or terminating (via rules (2) or (3)). Although leftmost and rightmost derivation is clearly distinguished by the order of application of rules (2) and (3), this is not clear from the text. I am not sure whether an altogether different example is useful or if it is sufficient to mention point out that there are different derivation path for each type of derivation. --Larsipulami (talk) 15:20, 10 December 2009 (UTC)[reply]

I have significantly expanded Example 3. Does that text provide adequate explanation, or is there still something amiss? The idea is to introduce leftmost derivations in the next example. Rp (talk) 18:52, 10 December 2009 (UTC)[reply]
This section is still ambiguous. As far as I can tell, there is no explanation given as to why the first rule is used first for the left most derivation. From a readers perspective, you could just as easily apply the second or third rule and terminate the parsing. - 80.176.156.118 (talk) 14:03, 29 March 2011 (UTC)[reply]
There's nothing wrong with that, except that you'd be deriving a different string! I have expanded the explanation even further to clarify, but I feel the amount of detail on parse trees and derivations has gone far beyond what an encyclopedia article on cfgs should provide, and there is quite a lot of redundancy now. Rp (talk) 15:22, 30 March 2011 (UTC)[reply]
The section claims to be validating a specific string, not creating a different one, so creating an incorrect one seems more like a mistake due to the ambiguity than a choice. This problem is gone now, more from a removal of any such explanation than clarity. - 80.176.156.118 (talk) 13:06, 31 March 2011 (UTC)[reply]
Yes, I rewrote some things; if this still isn't clear enough, of course your remark can be added, or the context can be removed entirely by not starting with a specific string. What do you prefer? Rp (talk) 18:31, 31 March 2011 (UTC)[reply]
Hi, I modified one of the leftmost derivation examples (it was using production rule 1 before the first use of production rule 2). I also altered the corresponding syntax tree, but it is now the same as the rightmost derivation tree, not sure if I have done it correctly, I am new to this. Check my work please — Preceding unsigned comment added by 2.120.159.191 (talk) 16:48, 7 July 2011 (UTC).[reply]
I'm not sure what you did, as I'm also not a frequent editor and can't be arsed finding your edit, but it seems to me you might have undone whatever these guys were fixing. It's still not good, I came here precisely because of that, and I've seen the same bad explanation in other websites. Saying something like "in a leftmost derivation, we expand the leftmost variable first" is not a good explanation, it is meaningless unless "first" has an effect on which production rule you pick.
Its meaninglessness is clear when you think of parse trees. If you have a bag of Christmas tree balls, it makes no sense to talk about a "rightmost Christmas tree" because you placed the balls on the rightmost branches first, unless you are forced to first place all the red balls, and then the blue ones when you run out, for example. Nuno pereira69 (talk) 21:45, 12 May 2022 (UTC)[reply]
It doesn't make much sense to reply to an 11 years old post, or to try to find out what edit it referred to at that time, and what (if anything) of it survived in the current version. If you can express your problem with the current text, I'd like to fix it in cooperation with you. Imo, your christmas-tree example misses the point. "Parsing" means searching for a derivation tree such that its projection to strings matches the input string. Different tree structures may result in the same string, e.g.
 + +
 / \ / \
 + a 1 +
 / \ / \
 1 1 1 a
 and both project to
 1 + 1+a 1+1 + a
The left tree is obtained by a leftmost derivation, the right tree is not. In contrast, your christmas tree example already starts from a given tree structure (rather than from a given string). Jochen Burghardt (talk) 08:35, 13 May 2022 (UTC)[reply]

Decidability

[edit]
It is not possible to construct a general algorithm which takes as input two context-free grammars and decides whether the two grammars generate the same language; nor is it decidable whether their languages have a single string in common. It is however possible to decide whether a given context-free grammar generates a non-empty language or not, and it is also possible to decide algorithmically whether a given context-free grammar generates an infinite language or not.

Is it possible to decide whether given context-free grammar is equivalent to given regular grammar ? In particular is it possible to decide whethere there is single string that doesn't belong to context-free language (that is, whether or not is it equivalent to .*) ? Taw 14:41 Apr 6, 2003 (UTC)

The answer for your first question is Pumping lemma. -- Sundar 05:08, Feb 15, 2005 (UTC)
And the answer to your second question is yes. See Rice's Theorem. -- JBolla
It seems both answers are wrong. It is undecidable whether a given context-free grammar generates all words over its alphabet. Jochgem 23:45, 11 March 2007 (UTC)[reply]
The first answer is indeed wrong as well: it is undecidable whether any given context-free language is regular (I was surprised to learn this, too. Google for references). The pumping lemma isn't good enough: some non-regular languages can be pumped. Rp (talk) 23:29, 11 April 2008 (UTC)[reply]
I have tried to search for results about "whether any given CFL is regular" without success; the terms I have come up with appear together too often in other contexts. Can you reproduce the search that led to your earlier discovery? 128.105.181.52 (talk) 19:41, 2 February 2012 (UTC)[reply]
I added a source, found by searching for the terms "context-free", "is regular", and "undecidable" in Google books. —David Eppstein (talk) 20:10, 2 February 2012 (UTC)[reply]

On the page Context-free_language#Decidability_properties is is stated:

  • Intersection Emptiness: Given two context-free grammars A and B, is  ? However, the intersection of a context-free language and a regular language is context-free,[1] and the variant of the problem where B is a regular grammar is decidable.
  • Containment: Given a context-free grammar A, is  ? Again, the variant of the problem where B is a regular grammar is decidable.

And judging from Regular_grammar and Regular_language it seems that regular languages can easily be described by context-free grammars (which is natural, as any regular language is also context-free), so a right-linear CFG or a left-linear CFG generates a regular language. So there are *some* CFGs for which regularity is decidable, right? I am not questioning the theory, I just think it maybe could be explained more clearly? --Lasse Hillerøe Petersen (talk) 20:21, 7 December 2014 (UTC)[reply]

What's unclear? The general problem is hard. Some special cases of it are not hard. That seems to be what we say. —David Eppstein (talk) 21:21, 7 December 2014 (UTC)[reply]
My guess: the sloppy way such problems are traditionally formulated. The theorem as usually stated: it is undecidable whether a context-free language is regular. This is a very sloppy formulation: it doesn't state how we know that the language in question is context-free. Properly stated, the theorem says: it is undecidable whether, given an arbitrary context-free grammar (or, equivalently, an arbitrary pushdown automaton), the language it generates (or recognizes) is regular. In other words: whether some left-linear or right-linear grammar describes the exact same language. Strictly speaking, this is a stronger theorem. Now, it is immediately clear that if the given CFG happens to be left-linear or right-regular already, the affirmative answer follows immediately. All the more surprising that it there are so many other ways to describe regular languages with CFGs that it is impossible to devise an algorithm that decides for any given CFG whether it describes a regular language. Rp (talk) 23:20, 7 December 2014 (UTC)[reply]

References

  1. ^ Salomaa (1973), p. 59, Theorem 6.7

Syntax or semantics

[edit]

Context-free grammars are important because they are powerful enough to describe the syntax of programming languages

(Well...context-free grammars can describe most of the syntax of programming languages. For example, any programming language that requires that variables be declared before use is not fully describable via a context-free grammar, essentially because there can be arbitrary amounts of source code between declaration and use--see the reference to the "pumping lemma" below. Such constraints are typically swept over into the corner, labelled "semantics," and dealt with outside of the parser mechanism proper. In the example, "semantic action" code is attached to the grammar rule that recognizes identifiers to check against a symbol table.) --24.36.158.101

Right, the requirement to declare before use is called "semantics" rather than "syntax", so the sentence in the article is correct: the syntax of programming languages is describable by context-free grammars. --LC
Not exactly - it's debatable whether "declare-before-use" is really "syntax" or "semantics". In C and C++, for example, it's impossible to check whether a source file is syntactically correct without building a symbol table while parsing that can at least distinguish between typedef'd and "regular" identifiers. The syntax of C and C++ critically depends on the distinction betwen those two forms of identifiers, which depends on what you're calling semantic information. In addition, there's plenty of other such stuff in various languages that's usually "swept into the corner" but is unquestionably syntax. For example: the ad hoc precedence rules built into parser generators to resolve dangling else and similar conflicts; the whole "longest match" convention that is built into lexical analyzers; the layout mechanisms of many more recent languages such as ML, Haskell, and Python. Then there's the fact that the C++ grammar has ambiguities that cannot be resolved by any finite amount of lookahead, and so the C++ standard simply states ad hoc disambiguation rules which implementors have to follow by making the parser capable of looking ahead an arbitrary distance in certain situations. As far as I can tell, practical languages that really follow a "pure" context-free syntactic paradigm are rather few and far between. Brynosaurus 15:00, 13 Aug 2004 (UTC)
It's my understanding that because of these facts C, C++, and the other languages do not have context-free grammars. A context-free grammar is an ideal for a computer programming language which is seldom achieved - basically because many things which are very useful for programming languages are not possible in a context-free grammar.
This is the strict sense of "context-free" grammar. When programmers think about "grammar", "syntax", and "semantics" they usually have much looser usage of these terms. — Hippietrail 02:13, 14 Aug 2004 (UTC)
Quite so. It is certainly almost universal practice to use BNF notation informally as a method of concisely describing or suggesting the general, overall structure of a language's syntax. But it is almost never the case that these BNF notations actually formally specify the syntax of the language by themselves. Brynosaurus 06:18, 14 Aug 2004 (UTC)
I have added some words in the article in the hope of clearing up the confusion. It is a fact that the equivocation of 'syntax' and 'context-free grammar' is common in computer science.

Phrase structure grammars

[edit]

Phrase structure grammar and context-free grammars


There is a redirect from Phrase structure grammar to this page.

Does this mean that Phrase structure grammars are just context-free grammars in any case? Isn't there more to say about Phrase structure grammars?

PSG isn't a precise term. CFGs are definitely PSGs, but there are also CSGs (context-sensitive grammars), for example, which could also be described as PSGs. The redirect seems sensible enough to me for the moment, since it's hard to imagine that an article on PSGs could be written which didn't just duplicate material in Context-free grammar and Chomsky hierarchy. I agree though, that the redirect is misleading. Perhaps it would be better to have a stub with links to Context-free grammar and Chomsky hierarchy? Cadr 23:11, 14 Feb 2005 (UTC)

It ought to be a stub if they are not equivalent.Otherwise it is misleading. -- Sundar 05:08, Feb 15, 2005 (UTC)

I agree with Sundar. –jonsafari 22:38, 18 June 2007 (UTC)[reply]

I did some more looking around on this. Here's the score from the most reputable sources I could find:

  • Chomsky Three models for the description of language (1956) introduces the term "phrase structure grammar". Here he is clearly referring to a CFG.
Nope, what he defines is in Wikipedia under formal grammar (I always hated the name of that article). Rp (talk) 15:10, 28 September 2009 (UTC)[reply]
  • Jurafsky and Martin's Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition (2000) says PSG is a synonym for CFG.
They're wrong.
  • JE Hopcroft and JD Ullman, Introduction to Automata Theory, Languages, and Computation (1979) might in some sense be more reputable than Jurafsky and Martin, and says PSG refers to anything in the Chomsky hierarchy. This might reflect a change in usage between the introduction of the term and this work's publication.
It doesn't - they use Chomsky's definition. Rp (talk) 15:11, 28 September 2009 (UTC)[reply]
  • Lewis and Papadimitriou Elements of the Theory of Computation (1998) doesn't address the issue.

So this seems like a disputed usage. At any rate, no one uses the term in any current work. The stub idea seems reasonable. kraemer 00:39, 20 June 2007 (UTC)[reply]

I changed it to a stub. Cadr 18:21, 21 June 2007 (UTC)[reply]
As far as I can tell: computer science doesn't use the term phrase structure grammar at all, while it is the standard term in (generative) linguistics. Hence the confusion. Rp (talk) 23:32, 11 April 2008 (UTC)[reply]

Natural languages

[edit]

It would be nice to include some statement about whether natural languages are context-free in this article. On this topic, the article Bambara language has the following note:

In mathematical linguistics Bambara is regarded with interest, since for only very few languages it was possible to show that they were not context-free. For Zurich German and Dutch the proof is based on sentence construction, whereas the proof for Bambara is based on word construction.

I wonder what is statement is really supposed to say, because I find it a bit difficult to believe. Can it really be so hard to find an example showing that English is not context-free? --Saforrest 12:37, Apr 19, 2005 (UTC)

BNF and Production rules

[edit]

The article states that "BNF (Backus-Naur Form) is the most common notation used to express context-free grammars.", but still the article make use of production rules like V→w instead of V::=w, is that because production rules and BNF are interchangeable ? --Robsy 11:32, 2 May 2006 (UTC)[reply]

No, it is because theoretical articles about context-free grammars use the arrow, while the BNF notation is pretty much restricted to manuals or textbooks that describe the syntax of a concrete language. The reason for the difference, by the way, is that BNF was designed for ASCII-only text. Rp 09:28, 25 July 2007 (UTC)[reply]

Lojban

[edit]

I doubt that Lojban is actually context-free. It probably were context-free if none of the terminator cmavo were elidable. Icek 22:53, 28 June 2006 (UTC)[reply]

It is parsable at the word-level with an LALR parser (there is at least one implementation of a parser done using Yacc). However, should it really be listed on the context free grammar page (I'd argue not)? Saying it has "an immense expressive power" is definitely POV and there is no citation, so I'm removing that part. JordanDeLong 03:34, 24 August 2006 (UTC)[reply]
AFAIK it is not possible without some preprocessing. From the Lojban Reference Grammar[1]: Lojban is not in itself LALR1. There are words whose grammatical function is determined by following tokens. As a result, parsing of the YACC grammar must take place in two steps. [...].
The problem that I see with the alleged context-freeness of Lojban is as follows: Consider a Lojban sumti (an argument to a predicate) which contains subordinate clauses which contain the words be (marks the beginning of the relative clause) and be'o (which ends a relative clause). The subordinate clause may itself contain subordinate clauses and so on. However deeply nested the subordinate clauses are, all the be'o terminators may be elided if there is a single ku terminator terminating the sumti (and if there are no ambiguities stemming from multiple subordinate clauses on the same level, for further details see [2], section 7). Icek 20:15, 25 August 2006 (UTC)[reply]
Got ya: I think you're right. I went ahead and removed it from the page. It may need some other items in "Other Examples" now, though. JordanDeLong 02:55, 26 August 2006 (UTC)[reply]

How useful is this page?

[edit]

You either understand all this jargon, in which case you already know what a "context free grammar" is...or you don't, and this is Greek to you.

Useless.

The comment above is useless. I found the explanation on this page clear and helpful. --Whisk3rs 00:07, 17 May 2007 (UTC)[reply]

I strongly disagree. An encyclopedic article is supposed to be understandable for the interested layman. This article in its present state is not; it presupposes familiarity with the basic notions of formal language theory. The article would greatly benefit from clear examples, and external references to more introductory material. Rp 09:32, 25 July 2007 (UTC)[reply]
As someone who has studied this topic in depth, it's hard for me to get perspective on just what is hard for the layman to understand. Do we have a layman that is willing to make specific criticisms that I can address? --Doradus 06:15, 27 July 2007 (UTC)[reply]
I too have just visited the page and after a couple of readings found it quite confusing, clearly written by and for people who already understand. To Address Doradus's request...
"In formal language theory, a context-free grammar (CFG) is a grammar in which every production rule is of the form
V → w where V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (possibly empty)." In this sentence every key word is linked, so readers coming here must immediately leave and try to understand other pages before they can even understand the introduction. I fully understand that prior information is required to understand some topics, but I think the intro should make some attempt to explain itself in simpler terms that are understandable in themselves.
"The term "context-free" expresses the fact that nonterminals can be rewritten without regard to the context in which they occur." This is a most confusing sentence! What does 'rewritten' mean? What 'context'? Are we talking about letters? Words? Clauses?
"A formal language is context-free if some context-free grammar generates it." Again, I don't know what this is trying to say. I instinctively want to put this sentence in the passave (eg. '...if it is generated according to a context-free grammar'). It just seems like a tautology.
"These languages are exactly all languages that can be recognized by a non-deterministic pushdown automaton." Is 'exactly' needed here? 'Recognized'? How can an automata recognize anything? This seems to be a misuse of wording. The correlate on the pushdown automata page is better... "Pushdown automata are equivalent to context-free grammars: for every context-free grammar, there exists a pushdown automaton such that the language generated by the grammar is identical with the language generated by the automaton, which is easy to prove.".
This was just the introduction, but I feel this is the weakest part of the page. It would be great if someone with a clear understanding of the topic could make this more accessible (I don't mean this as a criticism of the original writers though - there are many scientific pages on wiki that start out like scientific papers and need to be clarified for the average reader.) --218.143.30.7 (talk) 01:20, 10 December 2008 (UTC)[reply]
The basic problem is that a context-free grammar is just a particular type of grammar so we don't want to repeat all the things that really belong in an explanation of what a grammar is, yet people clearly expect to be able to read this article without reading the article on (formal) grammar first. How to resolve this, I don't know. Rp (talk) 22:11, 21 February 2009 (UTC)[reply]
I don't think we can solve this, you really do need to know the basics of formal language theory or automata theory in order to understand this content properly, especially because it is so abstract. We should not remove important information simply because it confuses some people. It would be better just to introduce the material in a more intuitive manner, and then move onto formal definitions. Ben1220 (talk) 07:31, 30 November 2009 (UTC)[reply]
I kinda agree. The only real point of this concept is that a context-free grammar is one that can be put on a linear tape, symbol by symbol without losing any meaning. The purpose of such is really for loading the tape for a Turing Machine. Average64 (talk) 20:41, 26 October 2013 (UTC)[reply]

I tried to add a simpler explanation that most people could understand to the introduction just now. — Preceding unsigned comment added by Programmers are evil (talkcontribs) 20:24, 27 October 2015 (UTC)[reply]

I am a computer scientist - the first part of this page is very off-putting because it jumps straight into the formal definition. It looks like it has been written by someone who does not have a clue what this subject is all about (not that i am necessarily suggesting that is the case). If they did then they would write something that can be understood by a human being. This page needs to be rewritten in such a way that a newcomer to the subject can see immediately what it means without seeing some formal (and potentially cryptic looking) mathematical statement. The formal definition should at the very least come a sentence or 2 after the basic/ friendly description. I am looking into a way of rewriting it somewhat but people may consider it overly complex - which may make it even worse or confusing.81.103.229.191 (talk) 23:37, 16 March 2016 (UTC)[reply]
I have just rewritten the first bit to try to make this page actually usefull. I presume the so called experts will tear me to shreds for it ;). However, I'll note that the origional version clearly didn't make sense to anyone because it contained some obvious BS... Tim.thelion (talk) 20:14, 3 June 2016 (UTC)[reply]
Sentence Noun_Phrase Verb_Phrase
Noun_Phrase Article Noun
Verb_Phrase Verb Prepositional_Phrase
Prepositional_Phrase Preposition Noun_Phrase
Article the
Noun cat
Noun mat
Verb sat
Preposition on
Context-free toy English grammar
Derivation of example sentence The cat sat on the mat

I agree that an informal motivation would be a good idea; I'm thinking about a natural language example along the image shown here. We could give a small excerpt of an English grammar that just generates the sentence. - Jochen Burghardt (talk) 22:20, 19 March 2016 (UTC)[reply]

A first attempt to realize my above suggestion is shown to the right. I hope the tree image might be widely familar from English school education. The image caption could explain some issues of CFGs. - Presentation problems are: the grammar takes too much horizontal space; however, when the syntactical categories are abbreviated (like "VP" for "Verb_Phrase") to save space, they'd need to be explained. commons:Category:Syntax trees in English contains 159 syntax trees, but none of them fits perfectly here; maybe we need to design a 160th one. - Jochen Burghardt (talk) 20:04, 25 March 2016 (UTC)[reply]

Clarify clarify and clarify more

[edit]

A starting point would be to add an explanation of the used symbol S to the first example. It represents a grammar? In that case, would G be more appropriate? No? Why? Also, the use of the arrow symbol (product?) could be linked to an article describing that symbol etc. The use of symbols instead of human language warrants for an easy reference material to symbols used. Otherwise the use of symbols serve only an obfuscation purpose, being informative only to those already adept in the topic, which doubtfully is the purpose of Wikipedia.

The link is right there; click on grammar. Rp (talk) 15:15, 28 September 2009 (UTC)[reply]

AnBnCn Example and PEGs

[edit]

This particular language can be generated by a parsing expression grammar, which is a relatively new formalism that is particularly well-suited to programming languages.

I've cut this from the head of the article. PEGs are relatively new, and I see no reason why they would be differentiated from any of the number of well-established formalisms that can also accept the language AnBnCn, especially since this is the canonical type 1 language (context-sensitive).

context free grammar page suggestion - request for clarity

[edit]

68.122.42.35 (talk) 03:33, 25 October 2008 (UTC)[reply]

In the article "context-free grammar" the first paragraph could use a little clarity.

The following sentence: "The term "context-free" expresses the fact that nonterminals can be rewritten without regard to the context in which they occur" could really use an example or another sentence or two to describe it. I was unable to figure out what this meant, possibly not because I'm stupid. (i.e.: I'm not stupid, which may or may not be why I couldn't understand the above sentence :P )

I am not formally trained, but then, that could be seen as permissable for wikipedia.

Thanks to all you geniuses who are teaching me math and computer science.

I originally wrote it so I have attempted to clarify it; however, I'm meeting an obstacle. A context-free grammar, as the article states, is a particular type of grammar. The remark is indeed not understandable if you don't know how a grammar is used (namely, by using its production rules to rewrite strings of symbols until a string is produced that contains no nonterminal symbols, and considering the total set of terminal strings that can be produced in this way from some initial symbol). But reiterating this in this article would be redundant to anyone who has bothered to look at the grammar article. You are assumed to follow the link to grammar if you don't know what is meant by that. Apparently, you didn't, which suggests that many other readers won't do it, either. If you can suggest an elegant way out of this dilemma, please let me know. Rp (talk) 22:50, 9 January 2009 (UTC)[reply]
Yes, Rp, it's an ugly conundrum often encountered in all kinds of technical writing. There are a number of ways I've seen used to resolve it:
  1. (Possibly the least effective:) Pretend there's no problem, and hope the readers are so keen to understand that they'll follow every link you give them.
  2. (Almost as bad:) Tell the reader that the following explanation will only make sense if one first familiarises oneself with certain prerequisites, which you then list.
  3. (Better than nothing:) Prominently rank the article as "Master-class" or "Genius-class" on the difficulty scale. Who knows, it may motivate the readers to try harder!
  4. (A workable compromise:) Write a brief introductory sentence (or paragraph) that summarises the prerequisite knowledge, perhaps informally, perhaps not. Your paragraph above comes close to doing the necessary. (Optionally also preface the introduction by saying that more advanced readers can skip the following paragraph, page or section - and try not to make the rest feel inferior or patronised! How many of my uni maths texts said: "Recall that ... {Uglich's Barking Lemma here restated in five lines of incomprehensible formalism}"!?)
  5. (Probably the most effective:) Provide pop-up or hover- "hint text" balloons; see for example almost any Microsoft online help documentation, which uses a dashed underline to signal that hint text is available. AFAIK, MediaWiki doesn't support such a mechanism; but maybe it should? Personally, I appreciate this approach, as it lets me brush up on forgotten detail without losing my place on the page; it's a minimal distraction to the main task at hand.
HTH. yoyo (talk) 19:59, 30 April 2009 (UTC)[reply]
I'd go for 4, obviously. Your last suggestion is great, but I'd implement it as a general Wikipedia user interface feature, i.e. for arbitrary links instead of specific links. Like Google Maps does with its Wikipedia links. Rp (talk) 16:10, 3 June 2009 (UTC)[reply]
[edit]

I stumbled upon a tool that produces images from context free grammars. Here is the link http://www.contextfreeart.org/index.html 140.203.154.11 (talk) 18:37, 11 November 2008 (UTC) mihai andrei[reply]

Nice; this kind of stuff is typically associated with L-systems, which reminds me that L-systems should definitely be mentioned in the article. Please do ... Rp (talk) 22:52, 9 January 2009 (UTC)[reply]

Cycles?

[edit]

Under the formal definition section, the last in Additional Section 5 says "no cycles". In this context, what exactly is a cycle? 66.69.234.193 (talk) 22:05, 24 January 2009 (UTC)[reply]

The only reasonable cycle that comes to my mind is something like this S->BA,B->S. If you have no that kind of cycles, you can generate just finite amount of products from the CFG and thus the generated language is star free. One might want to use that kind of CFG to compress text. —Preceding unsigned comment added by Hiihammuk (talkcontribs) 12:44, 25 January 2009 (UTC)[reply]

I have added formalizations of the criteria for properness, please verify. Rp (talk) 22:13, 21 February 2009 (UTC)[reply]
Think you may have a typo:
no cycles: \neg\exists N \in V: N \stackrel{+}{\Rightarrow} N
should perhaps instead read:
no cycles: \neg\exists N \in V: N \stackrel{*}{\Rightarrow} N
yoyo (talk) 19:31, 30 April 2009 (UTC)[reply]
It's not a typo: a zero-step derivation from N to N is not a cycle. Rp (talk) 16:06, 3 June 2009 (UTC)[reply]

Decidability again

[edit]

I have read this article a few times over the months, and never noticed that it mentioned (in the Normal Form section) that the membership problem is decidable (i.e. determining whether a string is generated by a given grammar). I think this should get more prominent mention in the article. Maybe in the summary? It at least shouldn't only be buried at the end of the normal form section. --pdf23ds 24.174.103.112 (talk) 18:17, 17 May 2009 (UTC)[reply]

This is implied by the remarks about parsing algorithms. Incidentally, I believe the statement while the widely used LR and LL parsers are more efficient algorithms that deal only with more restrictive subsets of context-free grammars is incorrect: they aren't more efficient, only simpler. Rp (talk) 15:19, 28 September 2009 (UTC)[reply]
The text now reads: Deterministic grammars (...) These classes are important in parsing: they allow string recognition to proceed deterministically, e.g. without backtracking.' This is still wrong: arbitrary grammars can be parsed without backtracking, using essentially matrix multiplication, see Cocke-Younger-Kasami algorithm and Earley parser. Rp (talk) 11:42, 9 December 2009 (UTC)[reply]
I have rewritten these paragraphs - please review. Rp (talk) 20:09, 31 March 2011 (UTC)[reply]

Chomsky's beliefs

[edit]

I like Likebox's edits a lot in general, but I'm not so happy about the references to Noam Chomsky's beliefs. The article should provide facts about cfgs, not beliefs. And while they are at the core of Chomsky's frameworks for describing syntax, they were always complemented with transformations in some form, so the reader shouldn't be left with the impression that syntax == context free grammar in Chomsky's eyes. Rp (talk) 19:32, 11 January 2010 (UTC)[reply]

Yes, I suppose that might belong elsewhere. The only reason I placed it there is because I think that the only observation Chomsky made that is not controversial is that context free grammars are very special types of grammars, and that context free grammars are suited to describe all of the complex grammatical clause-nesting/adjective-listing in natural language. Everything else he says is controversial, but I thought that this part is univerally accepted.Likebox (talk) 03:42, 12 January 2010 (UTC)[reply]

I've modified the section slightly, because Chomsky has never believed that language is strictly context free, but rather that it has a context free component who's output is manipulated by further decidedly non-CF mechanisms (at the time, transformations, and currently, movement). This is clear in his work, when he addresses the inadequacy of CFG for various phenomena. --Augur (talk) 08:18, 1 February 2010 (UTC)[reply]

Thank you-- I am sorry for not being completely accurate about this, I always mean to, but I never read enough of Chomsky's work.Likebox (talk) 17:53, 1 February 2010 (UTC)[reply]

True context free grammar vs. log-context free grammar

[edit]

There is a distinction between the grammar of parentheses matching (()()) and the grammar of matching two types of parens (([][])[()]). The first language can be approximatly parsed by a finite automaton exponentially well, meaning that while it is technically context free and requires a stack, a finite automaton with a counter of n-bits can match strings of matching parens up to a depth of 2^n. Effectively, this means that a regular grammar will work to match the strings, and I wouldn't call that a real context free grammar. I would call it a "log context free grammar". I don't know what the official term is, if there is one.

On the other hand, with two types of parens, n bits can only work up to a depth of n. This is a true context free grammar, it can't be approximated well by a regular grammar, since you need to push the type onto the stack. The languages of interest, natural and computer languages, are true context free grammars.Likebox (talk) 13:55, 19 January 2010 (UTC)[reply]

Yes, with a finite nesting level, the stack is finite, so the language is regular. Rp (talk) 23:08, 26 February 2010 (UTC)[reply]
Of course that's true, but that's not what I was saying. I wanted to point out that there is a distinction between matching one type of parentheses and matching two different types of parentheses. Using a finite memory counter with n bits of memory, you can match 2^n levels of parentheses by using this stupid trick: scan the string left to right, increment the counter wheneve you see "(", decrement the counter wheneve you see ")" and make sure the counter is always positive.
This stupid trick doesn't work when you have two different types of parentheses. For two types of parentheses, no combination of counters will work--- you need a real stack. What this means is that parentheses matching is exponentially well approximated by a regular grammar, while two-types of parentheses are only at best linearly approximated by any regular grammar. This distinction is probably not original to me. I was hoping for a source.Likebox (talk) 23:23, 26 February 2010 (UTC)[reply]
A more language theoretical way of saying the same thing: if I want to make a finite automaton match parentheses up to n-levels deep, I need exactly n-states, which represent the nesting depth. If I want a finite automaton to match two types of parentheses to n-levels of depth, I need 2^n states. Every context free language has a complexity constant C, the "exponential growth number" which tells you that to match expressions of depth n you need C^n states. For one-types of parentheses, and for the most commonly cited examples of context free grammars, the constant C is 1, which is a ridiculously special case--- I think it should be classed as an intermediate stage between regular and true context free.Likebox (talk) 23:32, 26 February 2010 (UTC)[reply]

Natural human languages never allow for overlapping constructions

[edit]

Never? How then does one account for such a construction (Catullus):

Sed mulier cupido quod dicit amanti in vento et rapida scribere oportet aqua

Transliterate it and any linguist will tell you.Likebox (talk) 23:52, 24 January 2010 (UTC)[reply]

Thank you. I just so happen to be a linguist. By "transliterate" I suppose that you meant "translate". So here:

Word for word :

But [a] woman eager what [she] tells [a] lover in [the] wind and swift [to] write one-should water

In English now:

But what a woman tells an eager lover ought to be written in the wind and in swift water. —Preceding unsigned comment added by JacquesGuy (talkcontribs) 00:12, 25 January 2010 (UTC)[reply]

By transliterate, I meant say what each word is, and any dangly modifiers that it has, like one that makes it act as a subject or something.
First off, is this poetry? Poetry allows some crazy constructions, like
Along the beach he said we should walk.
Which would not seem to normally be allowed, because "Along the beach" is attached to a clause buried a level down. I don't speak many languages, and all the languages I speak lack those dangly parts of words that tell you what part of speech they are. So perhaps there are overlapping clauses in latin.
Secondly, I think that you are purposefully obscuring the latin meaning. I think a better translation would be:
But a woman is eager that what she tells a lover in the swift wind should be written, and in the water.

"Cupido" is masculine dative, "eager" in "a woman is eager" is feminine nominative, i.e. "cupida"

Go learn Latin. End of discussion. —Preceding unsigned comment added by 61.68.162.228 (talk) 06:15, 26 January 2010 (UTC)[reply]

This preserves the word order and makes the non-overlapping clauses manifest. On the other hand, for all I know, Latin just might admit overlapping clauses. Please educate me if it does.Likebox (talk) 01:02, 25 January 2010 (UTC)[reply]
The point of the example is to explain the central role of Chomsky's "merge" operation. What this does is merge together parts of a sentence into larger objects. Merge only acts one way, so that in the sentence:
The man walked to the store is green
(The man (walked to the store) gets merged, and then "is green" doesn't have anything to merge with. There is no analogous "split" operation which would allow two-sided merge like this:
The man walked to the store---the store is green.
Where a split of the store allows it to merge with the predicate on the right and simultaneously with the rest of the stuff on the left. It is this property which makes language special, and leads it to be described by a parse tree, and not by a parse graph. But many linguists dispute Chomsky's analysis, and having seen the crazy example of Piraha, I am not certain of anything anymore.Likebox (talk) 01:13, 25 January 2010 (UTC)[reply]
Perhaps what we're trying to tell you is that perhaps you should be cautious about editing this article until you learn that linguistics didn't exactly start with Chomsky, nor will it end with him; and formal language theory, as used in computer science, on which he has had a great impact, is a related, but distinct discipline. (I do appreciate your attempts to clarify certain issues.) Rp (talk)
Linguistics seems to be a primitive field to me, an dominated by nonmathematical people.Likebox (talk) 12:55, 27 February 2010 (UTC)[reply]

Non context-freeness of Swiss German

[edit]

The article in question points out that Swiss German has a non context free construction, which involves the following:

Jan says that (we-1) (Mary-1) (the house-2) helped-1 paint-2

The numbers say which nouns go with which verb. is is not stack based, otherwise it would be:

Jan says that (we-1) (Mary-1) (the house-2) (painting-2) (helped-1)

So agreed, it's not context free. But you need to go several levels deep to make this work. The claim that they make is that you can go on in this pattern in Swiss German with more nouns and more verbs

Jan says that we-1 Mary-1 Martha-2 Sally-3 the house-4 told-1 tell-2 help-3 paint-4

If you find something like this, that would be the clincher. But I have not been able to verify this schema as grammatical in Dutch German, and the gloss provided by the authors for their analogous sentence was insufficient to determine whether it was like the above.

But the construction in question is peculiar to Swiss German, and Dutch, while Chomsky was working with English. It is manifestly obvious that English grammar is context free. I don't know how this article can be used to support the statement that "Chomsky's observations about the non-context freeness of natural language has held up", especially that I don't see any examples analogous to the above in Chomsky, and I don't know of any in English.Likebox (talk) 13:00, 9 February 2010 (UTC)[reply]

Chomsky has always explicitly worked on a universal theory of language, i.e. one that would not be tied to any particular natural language such as English. Rp (talk) 17:02, 22 April 2010 (UTC)[reply]

Non-context free examples

[edit]

The example of an overlapping construction:

[John saw (a car in the ad yesterday] with green headlights)

Is not as overlapping as the Martian

[John walked to (the store] is green)

Which is truly inhuman. The second is a better example, I think, because the first is too natural--- it could be produced. The second shows what a truly non-context-free language would look like.

Wait ...
  • The point of the example was to illustrate your statement that human language is context-free, but it isn't - not completely. Examples should illustrate both the sense in which it is context-free and the fact that it isn't completely.
  • Your example doesn't illustrate that language is context-free. What makes it "inhuman" to you is the appearance of two main verbs without a conjunctive. This is easily fixed, e.g., in Green John walked to the store or John walked to the store all green or John walked to the store and is green. I don't see how your example is any less context-free than the other examples.
  • My example can be criticized, but for a different reason: if we regard yesterday as a postmodifier of the ad, we do obtain a context-free constituent structure.
I'm happy to trade my example for a better one. But I believe the way in which your passage argues for the contextfreeness of natural language is rash and introduces fuzziness into the article (see below). Rp (talk) 15:16, 21 February 2010 (UTC)[reply]
(It is not a good idea to break up comments like this, because it makes the discussion hard for others to follow, but I will respond in the same style).
You misunderstood the Martian sentence: [John walked to (the store] is green) is not saying that John is green, its saying that the store is green. The sentences "overlap" in a way that is not present in any human language. I give this example to explain that human language (and human thinking) almost automatically makes structures that do not overlap, and there is no a-priori reason for this. It's just always true that we chunk up sentences into structures in which each substructure plays a unique role. In the "martian" example, one occurence of "the store" splits to play two roles, as object and subject of two separate verbs. That never happens in human languages. It was difficult to even conceive of the example.Likebox (talk) 22:22, 21 February 2010 (UTC)[reply]
I wouldn't be too sure. In English, we use John walked to the green store to say that, but I can see no reason why your coinstruct would be inhuman - compare John painted the store green. Rp (talk) 23:41, 26 February 2010 (UTC)[reply]
The issue is that it is difficult to even conceive of language with parallel overlapping constructions, where a word plays two different roles at once, as subject of one sentence and as object of another. We want our sentences to have chunks that make grammatical objects of the same types as single words. "John painted the store green" is perfectly normal. "John green painted the store" is ungrammatical, but conceivable. "John painted the store is open green" is inhuman.Likebox (talk) 12:47, 27 February 2010 (UTC)[reply]
What is accusative and infinitive? 2A0A:A541:59A5:0:21BF:4D76:E67E:55E1 (talk) 04:11, 19 August 2023 (UTC)[reply]

This example of an overlapping construction is very good--- I came up with much worse examples. But it never occurs in the New York Times (I don't think), and it has a colloquial flavor (at least to my ears), precisely because the dangling "with green headlights" is not in the proper place. I feel like it's a "John saw a car with green headlights in the ad yesterday", mangled up as someone said it. But I might have been programming a computer too long.

Programming languages are just about as context-free as natural languages. I'm no specialist in the syntax of English and it's risky to reason from a single example. We need examples from the linguistic literature with references. Rp (talk) 15:16, 21 February 2010 (UTC)[reply]
(again--- interspersing makes talk-pages unreadable--- but I will do the same)
The problem is that I have not found the linguistic literature to have any good examples in English. The examples everyone cites are in Swiss German. If you sort out what's going on first by discussion, it will make the literature search much easier when the time comes.Likebox (talk) 22:27, 21 February 2010 (UTC)[reply]

It would be good to have two examples--- a truly Martian language, with tons of overlapping constructions everywhere (it's easy to make one up), and an example in English like the one above. Unfortunately the one above sounds very colloquial and marginally grammatical, Do you know of a similar example in formal writing?Likebox (talk) 05:59, 21 February 2010 (UTC)[reply]

No, but I'll try to find something.
What bothers me about your introduction, and this discussion, is that it blurs the notion of context-free grammars as a technical term. In formal language theory, a language is context-free if it can be described by a context-free grammar, not if it can be described by a context-free grammar plus postprocessing on the parse tree, as is the practice in Chomskian linguistics and computer science. By saying "language is context-free", you mean the latter, inexact sense, not the former, exact one. This introduces a discrepancy between the introduction and the formal definitions that follow. Rp (talk) 15:16, 21 February 2010 (UTC)[reply]
The post-processing on the parse tree is not required for most english sentence syntax, and Chomsky does not introduces it for the purpose of describing phrase strctures. He was describing transformations, which are taking a declarative sentence and turning it into a question, stuff like that.Likebox (talk) 22:27, 21 February 2010 (UTC)[reply]
I know, but my point is that allowing arbitrary transformations on a parse tree makes the resulting descriptive formalism non-context-free, actually Turing-complete, from a formal language point of view. So you can't just say: well, language is context-free, but for these transformations, because you're going to have to come up with a novel justification of why context-free grammars and transformations are "the right thing to do", expressive power isn't going to cut it anymore. Besides, while Chomsky happens to be fluent only in English, perhaps Hebrew and maybe a couple of other languages, he does claim universality for his theories, so dismissing examples in my native tongue (Dutch) or on Latin as "not English" is not the appropriate way to deal with this. In any case, I believe such discussion doesn't really belong in this article, which should explain what a context-free grammar is, and doesn't necessarily have to answer the question whether natural language is context-free. Rp (talk) 23:41, 26 February 2010 (UTC)[reply]
You don't understand what I am saying: the transformations don't describe the grammar at all. They are only there for relating one sentence to another. I am saying English grammar is a straight BNF (other people said this in the 70s too).Likebox (talk) 04:11, 27 February 2010 (UTC)[reply]
That is not what Chomsky says, who does use transformations to describe the syntactic structure of sentences (claiming that the real sentence is a "surface structure" that arises out of a "deep structure" by transformation). Rp (talk) 17:15, 22 April 2010 (UTC)[reply]
Oh, and by the way, I believe the standard example in Dutch is related to chains of verbal complements, as in
  • You shouldn't want to help me wash the dishes
which can be expressed in Dutch as
  • Je zou me niet moeten willen helpen afwassen
but I must be wrong, as I don't see how the Dutch sentence structure is fundamentally different from the English one. Rp (talk) 23:55, 26 February 2010 (UTC)[reply]
The difference is in the order of the verbs. The english:
You (shouldn't want to (help (me (wash (the dishes)))))
has a context free form--- everything is properly nested. Or numbering stuff like a previous section (to see what goes with what verb):
You(1s) shouldn't want(1) to help(2) me(2s) wash(3) the dishes(3o).
Numbering the verbs
You(1s) me(1o) not-want dishes(2o) help(1) wash(2)
Note that unlike the english, there is a crossing of objects and verbs, so that you can't parenthesize it properly. This kind of triviality also happens in Swiss German, and it seems ridiculous to hang such a nontrivial statement as "language isn't context free" on such stupid examples both of the same type, in Swiss German and Dutch (see previous section). I think that the linguistics field must still be in Aristotle mode.Likebox (talk) 12:42, 27 February 2010 (UTC)[reply]
In particular, everything hinges on having the "want" there, which is really the main verb. The "help wash" should be thought of as a compound verb perhaps. You need nontrivial examples with many layers of nesting to figure out what the true pattern is, and human language doesn't naturally generate that.Likebox (talk) 12:49, 27 February 2010 (UTC)[reply]

Ambiguity, Passive Voice: needs improvement

[edit]

Article currently states: "In later work (e.g. Chomsky 1981), the idea of formulating a grammar consisting of explicit rewrite rules was abandoned."

Could someone (who is more familiar than myself with the subject matter) please rewrite this sentence to eliminate the ambiguity and the passive voice. "Was abandoned" ... by who? Chomsky?

Karl gregory jones (talk) 19:02, 9 February 2011 (UTC)[reply]

Examples of possible overlap, and alternative explanations?

[edit]

Article states: "It is hard to find cases of possible overlap and many such cases can be explained in an alternative way that doesn't assume overlap."

1. "Hard to find" sounds un-encyclopedic to my ear. Can someone rewrite this to sound more professional?

2. "Hard to find" means uncommon but extant. Can someone provide an example, or examples plural?

3. Can someone provide examples that have been explained in alternative ways?

Karl gregory jones (talk) 19:11, 9 February 2011 (UTC)[reply]

I wrote it, and I completely agree. If I knew the sources by heart I'd have added them but I'll see if I can find them. Rp (talk) 09:00, 10 February 2011 (UTC)[reply]
I wouldn't say it's hard to find cases of possible overlap, if you still allow alternative explanations. Just look at any sentence with an extra-posed relative clause, for example by replacing some words of your example: “[John met (a woman at a club yesterday] who had brown eyes)”.
Also, I doubt that "Natural human languages rarely allow overlapping". What about Non-English languages? What about spoken language? (“I saw that new car in an ad yesterday, with three wheels only!”) --Zahnradzacken (talk) 00:05, 15 February 2011 (UTC)[reply]

Minor question about an example

[edit]

In the last paragraph of subsection "Proper CFGs" we find the following:

This makes it clear that . The language is context-free, but as shown in Example 4.8, it is not regular.

Should we remove this hard-coded reference to "Example 4.8"? Perhaps we might add an actual reference for this result.

(Sorry for the minor and possibly irrelevant points, I'm a newbie contributor.) — Preceding unsigned comment added by Marcelkcs (talkcontribs) 16:39, 29 May 2011 (UTC)[reply]

Formal Definition of a Context Free Language and the Empty Language

[edit]

I don't think it makes sense to require that there is at least one production rule for the non-terminal S. All that counts is that the transitive closure of the production rules, filtered to terminals is taken, in defining the accepted language L(G). If there is no production rule for S, then we have obviously L(G) = {}, i.e. the accepted language is empty. This is not to be confused with L(G) = {eps}.

But there are also other context free grammars that also lead to empty languages. And there are such that have a rule for the start symbol S. For example the following grammar, has also an empty accepted language:

S --> S

So the requirement that thre is at least one rule for S, does not imply that the accepted language L(G) is non empty, so it doesn't make any sense.

Janburse (talk) 20:09, 20 June 2011 (UTC)[reply]

I agree. Where do you see this requirement? It may be too early in the morning, but I can't find it in the article. Rp (talk) 07:29, 22 June 2011 (UTC)[reply]
I found it and removed the requirement. --Zahnradzacken (talk) 13:53, 23 June 2011 (UTC)[reply]

Notation

[edit]

I'm currently studying CFGs in school; I got rather confused reading this article, when presented with this notation:

S → SS
S → (S)
S → () 

I hadn't seen a CFG presented that way before and I didn't know what it meant. Then later in the article I arrived at

S → bSbb | A A → aA | ε

which is what I'm used to seeing.

After hunting through the article numerous times I finally searched the page for pipes and found "It is common to list all right-hand sides for the same left-hand side on the same line, using | (the pipe symbol) to separate them. Hence the grammar above can be described more tersely as follows:" arbitrarily hidden under the "A regular grammar" section.

There absolutely needs to be a "Notation" section before the "Examples" section that clearly explains both styles of notation to the reader. Not explaining your first notation and switching to another notation midway through with an easily-missed warning is a recipe for confusion for those not fully familiar with the subject. I would write a notation section myself, but I don't feel I grasp the concept and terminology enough to write accurately about it. Some guy (talk) 05:52, 3 March 2013 (UTC)[reply]

Is there anything else you would like to read in a separate notation section? I moved the sentence you quoted up to the section Production rule notation, maybe that's enough for now? --Zahnradzacken (talk) 10:03, 23 March 2013 (UTC)[reply]

lead paragraph

[edit]

The lead (or lede in Wiki-speak?) is terrible, IMO. The Background section does a better job of explaining what this is all about. I am a SW engineer, and remember a little about this from college many years ago, but that lead section did nothing in helping me remember the overall purpose and meaning of a CFG. I can't imagine how it could be expected to help a lay-person. Nerfer (talk) 17:18, 23 March 2020 (UTC)[reply]

I made an attempt to improve the lead, by replacing the former example by a C example. The problem is that there are two different audiences: a user knowing formal grammars in general should be told why these ones here are called context-free, while a user who is new to formal grammars should be told how a grammar works in general. - Jochen Burghardt (talk) 21:32, 23 March 2020 (UTC)[reply]

Misleading commas in examples

[edit]

Until the section section "Well-formed parentheses", commas are used to separate production rules. But from there on, no commas are used!
I think this should be done in an uniform way. — Preceding unsigned comment added by Doubletimer (talkcontribs) 17:55, 27 November 2020 (UTC)[reply]

Examples of languages that are not context free: proof is incorrect

[edit]

I don't have editing experience, so I thought I'd bring this up here and let someone else address it. I buy that the language in this section (balanced but not properly nested strings of parentheses and brackets) is not context-free, but this cannot be shown through the Pumping Lemma. Indeed, any nonempty string in this language will contain either the pair '(' and ')' or the pair '[' and ']' - either of these pairs of substrings will satisfy the conclusion of the Pumping Lemma. The observation that this language contains the sublanguage (^n [^n )^n ]^n is irrelevant. The complexity of a language puts no bounds on the complexity of its subsets (e.g. the regular language (a)* of all strings of a's contains the subset {a^n | n is in your favorite non-recursively-enumerable set of natural numbers}). Mersenne44 (talk) 21:07, 26 September 2022 (UTC)[reply]

I think the standard proof is that L intersect (^* [^* )^* ]^* equals (^n [^m )^n ]^m. By closure under intersection with regular languages, if L were context free (^n [^m )^n ]^m would be context free. But it is not, by pumping. The bigger problem here, though, is that this section needs a published source. If we find one, we can follow whatever proof is used there. —David Eppstein (talk) 22:24, 26 September 2022 (UTC)[reply]
Oh I see - if you use (^n [^m )^n ]^m instead then it does make sense to just say L 'contains' this subfamily, since it witnesses unbounded distance between pumpable infixes. Thanks for clearing that up for me! I don't think there's a good proof that passes through (^n [^n )^n ]^n though, so that bit really should be changed. As far as citations go, might it be better to forego this example and just use L = a^n b^n c^n in this section? It's a pretty canonical example and appears in most published introductions to context-free languages that I've seen, plus the application of the pumping lemma is maybe a little more immediately obvious. Mersenne44 (talk) 00:01, 27 September 2022 (UTC)[reply]
Oh nevermind - I'm just tired. Same thing works with (^n [^n )^n ]^n - it witnesses the same thing as (^n [^m )^n ]^m. Ignore me - sorry! Mersenne44 (talk) 00:14, 27 September 2022 (UTC)[reply]

Avoid calling | the "pipe symbol"

[edit]

Minor nitpick here.

The article says:

It is common to list all right-hand sides for the same left-hand side on the same line, using | (the pipe symbol) to separate them. Rules and can hence be written as . In this case, and are called the first and second alternative, respectively.

My suggestion is to avoid calling | the "pipe symbol", which might cause confusion to people interpreting it as the Bash pipe (https://en.wikipedia.org/wiki/Vertical_bar#Pipe) and think that is somehow fed into .

Since the vertical bar (https://en.wikipedia.org/wiki/Vertical_bar) goes by a lot of other names, I think it would me more appropriate to refer to it as either:

130.192.232.228 (talk) 12:21, 30 September 2022 (UTC)[reply]

 Done: I agree that "pipe symbol" can be understood by Linux users only, and is misleading even to them. I changed both occurrences to "vertical bar". - Jochen Burghardt (talk) 14:27, 30 September 2022 (UTC)[reply]

Should "it is not proper since it includes an ε-production." be _explained_?

[edit]

In the "Examples" section, the aforementioned sentence is confusing to me (I understand a lot of the rest of the page). The ε-production itself is defined in the context of the article earlier (further up) and no mention is made alluding to why there'd be [later] some improper something with it. It is only described what an ε-production is. I can't find any information that should somehow hint at or explain why "it is not proper". Transientdifficulties (talk) 20:10, 6 April 2023 (UTC)[reply]

I agree. Some definitions of CFGs (including Chomsky's) do not allow empty productions, and this should be explained, but it doesn't appear to be standard practice to call $\epsilon$-less grammars "proper". So the right approach is to add a paragraph explaining this and refer to it here. Rp (talk) 16:06, 11 April 2023 (UTC)[reply]

Misleading simplifications of the syntax of the C programming Language

[edit]

Some changes I made were reverted. Can we discuss? I rewrote a bit that I thought gave an oversimplified impression of how C syntax (and other real programming language syntaxes) could be parse-able by a CFG.

1) Toyness The grammar of a toy language is presented as an excerpt from the appendix of K&R 2nd Edition 1988. While the grammar being shown on the diagram could be derived from the grammar given in K&R2 through a number of manipulations and simplifications, reducing, renaming, and rewriting the grammar to be a long way from being an excerpt from the cited work. "Excerpt" from the Latin means picking out, not performing such a extreme transformation. For example an expression can be a statement in C (something for which C is often criticized), whilst some other languages make a clear distinction between statements and expressions (for simplicity and clarity) as does the toy language presented here. I think it should be made clear that this is a toy language, far from the real C programming language.

2) Over-simplification The article gives a too-positive spin on CFGs as a standard formal tool for describing the grammars of programming languages when CFGs do not quite the describe the syntax of real languages, the CFGs that are employed are usually a more permissive "outline" of the language syntax that is augmented with (often informally stated) rules to give a more restricted description of the syntax. Looking at the cited grammar presented in K&R 2nd edition 1988 (Appendix A13) and we can see it is a CFG that accepts more than the syntax of C because K&R define typedef-name as being the same as identifier, making the grammar ambiguous as to which production rule to take to distinguish between a type coercion in a cast-expression and some forms of function call in a postfix-expression. K&Rs note on converting to YACC recommends deleting the typedef-name production, and making typedef-name a terminal symbol, which requires typedef-names to distinguished from other identifiers - this could be done by the parser looking them up in a symbol table because the C89 language requires typedef-names to be declared before their use. Jochen Burghardt (talk · contribs) Given the textual order, compilers could easily (and usually did) do that on the fly in a single pass like K&R's YACC implementation suggestion, rather than constructing one of the possible parse-trees and fixing it later, though I do see that's just an efficient implementation detail, and a compiler could, as you say, be implement by constructing one tree, then apply additional syntactic rules outside of the CFG. [Historical note C89 did not require the definition of identifiers before use; that was required by C99]. CorsacFoxWiki (talk) 13:46, 15 October 2024 (UTC)[reply]

1) The caption starts with "Simplified" to indicate some manipulation have been made (for the sake of brevity). While K&R used alternatives on the RHS of a production rule, the excerpt transformed them to Chomsky's original notation (to avoid the use of ellipses to indicate omitted alternatives). Only alternatives that are actually used in the derivation are shown (imo, this is covered by "excerpt", too). If a majority of editors suggests that "simplified" should be replaced by some stronger adjective, I'd be fine with it. However, I'd strongly prefer to mention the relation to the C syntax. The grammar does not describe just an arbitrary toy language; the terminal symbols are intended to have the meaning as in C.
2) Classical compiler-construction textbooks (such as William M. Waite and Gerhard Goos (1984). Compiler Construction. Texts and Monographs in Computer Science. Heidelberg: Springer. ISBN 3-540-90821-8.) distinguish syntax analysis and semantic analysis. Syntax analysis handles only those aspects that can be expressed in the framework of a context-free grammar; everything else is postponed to semantic analysis. For example the program "main(){return a+a;}" is considered syntactially correct, but has a referencing error. Informally stated rules (like "declare variables before use") have to be checked during semantic analysis, after the parser has built up a syntax tree. In that sense, both C and Algol do have a context-free grammar (given by K&R and Backus, respectively). Admittedly, deriving a parser from the former one is difficult (or maybe even impossible, depending on the parser generator framework), due to the typedef-name:identifier issue described in lexer hack (thanks for your hint below!).
The purpose of the image is to illustrate the concepts introduced in the lead by an easy-to-understand example in a well-known imperative programming language. It seems to me that C is the best choice for that since it occupies the least horizontal space (compare e.g. "{}" in C to "begin end" in Algol or Pascal). Therefore, I suggest to keep the image. Jochen Burghardt (talk) 17:01, 16 October 2024 (UTC)[reply]
I agree emphatically that the image presents an excellent brief, beautifully simple, easy-to-understand example.
Making an admirable distinction between Expr and Stmt, the grammar reminds me most closely of the Wirth grammars for Pascal and Modula(2), with assignment ':=' replaced by '=', and, as you say, 'begin' replaced by '{' and 'end' replaced by '}'.
There is no need justify that beautiful simplicity by fudging a citing to the ugliness of K&R2, and it is misleading to do so.
The grammar is describes as a "simplified excerpt" from the grammar of the C programming language given in K&R2; that is very misleading.
I'm a bit of word nerd, and a stickler for correct usage. Forgive my pedantry, but consulting the micrographical text of my Compact version of the 20 volume OED, the word "excerpt" is defined as:
1. A passage taken out of a printed book or manuscript; an extract, quotation, selection.
2. An article from the 'Transactions' of a learned society or from a periodical, printed off separately for private circulation.
3. In etymological sense: A thing picked out. rare.
It's from the Latin ex = out, carpere = to pluck.
In none of those senses can the grammar of the example be honestly described as an "excerpt", or even a "simplified excerpt" from K&R2. The grammar presented in the example has been transformed too far from the original to be described as such, even if the transformation had done a correctly... That is the trouble with transformation, if one is not very careful then some things might get changed in the translation...
In the example grammar's definition of a compound-statement <Stmt> -> '{' <StmtList> '}' the statement list is not optional, so the example grammar differs from C by requiring one or more statements between the '{' and '}', whilst C allows zero statements between the '{' and the '}'. So if this really were a transformation of the C grammar of K&R2 then a mistake has been made in the transformation! Though to me it looks like a grammar that has been constructed for the example, and then a citation made to K&R2 as some sort of post hoc justification or appeal to real-world authenticity, which I see as unnecessary for the purpose of a simple example, and misleading with regard to the complexity of the C syntax, which might give students a false idea that C syntax is simple or representable by a CFG alone.
In the C grammar <Stmt> -> <Expr>? ';' | ... i.e. an <Expr> may be a <Stmt> which is another example of the complexity of the C syntax.
And because <Expr> is option in <Stmt> -> <Expr>? ';' | ... then ';' on its own can be a statement! Ugliness upon ugliness.
The example actually does a better job (as do Pascal, Modula, Algol grammars) at separating the concepts of statement and expression; better than the C grammar does!
As you correctly say, the terminal symbols do have the same meaning as in C. I'd point out that they are readily understood without reference to C.
The notation for identifiers ('x','y'), keywords ('if'), brackets ('(',')'), terminator/delimiter (';'), relational operator ('>') are common to almost all programming languages. The use of assignment operator ('=') is common, (I'd prefer ':=' but that is a battle that has been lost by those of us who prefer clarity over convenience in programming language design) as are '{' and '}' for the beginning and end of a block containing a list of statements, their purpose being made quite clear by the production <Stmt> -> '{' <StmtList> '}'
That's all I have time for now. I would like to discuss further, and share my thoughts on the point you raised about what has traditionally been called 'semantic' analysis in compilers which is often a high-falutin' name for context sensitive analysis of aspects of syntax that cannot be expressed in CFGs, because hey we cannot bear to face the limitations of CFGs directly and have to make the syntactic problems they cannot solve sound much more profound and difficult when we talk about 'semantics' rather than 'syntax' :-). I'm reminded of a line in Bertand Meyer's Eiffel book where something has been described to him as a "deep semantic problem" and he cuts through the waffle by pointing out it is neither deep nor semantic. I'd also like to try to draw out the distinction between syntactic analysis and what I see as the proper usage of the word 'semantic' in the semantics of programming languages in formalizing the meanings of programming languages. CorsacFoxWiki (talk) 16:07, 18 October 2024 (UTC)[reply]
To make it clear what the points of difference are, I've done a transformation that is faithful to the K&R2 C grammar restricted to the terminals ';' 'if' '(' ')' '{' '}' '=' '<' '+' 'x' 'y' '0' '1' '9'
Which is as follows:
stmt -> expr? ';'
stmt -> '{' stmt_list? '}'
stmt -> 'if' '(' expr ')' stmt
stmt_list -> stmt
stmt_list -> stmt_list stmt
expr -> expr_assign
expr -> expr_rel
expr_assign -> expr_primary '=' expr
expr_rel -> expr_rel '<' expr_add
expr_add -> expr_add '+' expr_primary
expr_primary -> id
expr_primary -> num
id -> x
id -> y
num -> '0'
num -> '1'
num -> '9'
This (if I've done the translation correctly ;-)) exactly generates and parses what the K&R2 grammar does when restricted to those terminal symbols.
I would not go as far as to call this a "simplified excerpt"; I'd use a phrase like "faithful restriction". If I were to go more off-piste in deviating form K&R2 I might use a phrase like "inspired by" but would have qualms about the grammar becoming an amalgam of K&R2 an ideas from elsewhere (ideas that K&R and their followers might not agree with).
Note the differences between this faithful grammar and that in the current example:
1) optional expression list in compound statement;
2) a hierarchy of expressions that syntactically defines the precedence and associativity of operators, which means expressions are unambiguously parsed, unlike the ambiguous definition of expr in current example in the article, which does not take account of operator precedence and associativity.
3) expr? ';' as a form of stmt;
4) K&R2's (overly) permissive syntax for l-values in assignment; that's because the K&R2 grammar has unary-expression on LHS of assignment-expression; primary-expression can be a unary-expression; and constant can be a primary-expression;
Problems caused 3) and 4) have to be resolved outside the grammar, and I note the current example grammar has made "improvements" upon the K&R2 grammar so that separation of statement and expression are made admirably clear, and the syntax of l-values is more precisely defined. Of course the idea (we seem to share) of "improvement" is subjective, as can been seen in the pervasive influence of the C grammar on more modern languages (just checked the Go grammar which has ExpressionStmt -> Expression, and allows Expression as an l-value; checking the C# grammar I see a more tightly defined statement_expression, but still unary_expression for l-value in assignment).
In point 2) I regard the K&R2 grammar to be superior to the ambiguity of the current example grammar. Personally when hand-coding a parser I use a table of operator precedence & associativity similar to that K&R2 has on page 53, combined with a variant of Pratt Parsing the idea of which I first encountered in O'Keefe's parser for DEC10 Prolog.
I had not anticipated the discussion of the closeness of the example to the C syntax becoming so involved, but I hope what I have said has been useful in clarifying. CorsacFoxWiki (talk) 11:05, 21 October 2024 (UTC)[reply]
Just two remarks for now:
The notation "N?" with N a nonterminal is a shorthand for "N | eps", if eps denotes the empty string. Therefore, omitting the "?" corresponds to restricting the set of alternatives. For example "Stmt -> { StmtList? }" is shorthand for "Stmt -> { } | { StmtList }"; I omitted the first alternative, again for sake of brevity.
As for the expression grammar, you have a point. However, K+R's version employs about 10 different nonterminals to encode the operator priorities, and I didn't want to show all of them, or at least sufficiently many of them to handle >, +, and =. I didn't check that in detail, but I'm afraid this would blow up the grammar excerpt such that it doesn't fit in the image any longer. - Jochen Burghardt (talk) 19:57, 21 October 2024 (UTC)[reply]
I made a few mistakes in my grammar (see I wan't joking when I said "If I've done the translation correctly ;-)")
I missed two productions
expr_rel -> expr_add
expr_add -> expr_primary
which allow the LHSs of the left-recursive exprs to descend through the heirarchy.
And I wrote '<' for the rel_op rather than '>' of your example. And I did not quote x and y.
I am not suggesting representing all of the ten or so levels of precedence in the K&R2 grammar. My idea of restriction with fidelity to K&R2 was to take their definition of stmt and then omit grammar rules that refer (directly or indirectly) to terminals other than those we are interested in ';' 'if' '(' ')' '{' '}' '=' '>' '+' 'x' 'y' '0' '1' '9'
Many of the resulting productions are simple aliasing A -> B (with no other alternatives for A), so every occurance of A can replaced with B and remove the redundant B -> B ( which is what A -> B becomes after replacing A with B) i.e. like substitution and elimination in term re-writing. These are equivalence preserving term re-writes. That's how I arrived at the grammar (see the corrected grammar following the example below), which is 21 productions compared to the current 15.
What are the useability contraints on image and font size of the example?
Or perhaps do re-writes in parallel? It can be confusing to do different re-writes in one line, but if we confine ourselves to similar re-writes for each "row" then it could be something like this (original example with an example of disambiuating unary '+' as well as precedence of binary ops):
'if''(''x''>''9'')''{''x''=''0'';''y''=''+' 'y''+''1'';''}'
'if''('id '>'num')''{'id '='num';'id '=' '+' id '+'num';''}'
'if''('exp_pri'>'exp_pri')''{'exp_pri'='exp_pri';'exp_pri'=' '+' exp_pri'+'exp_pri';''}'
'if''('exp_unary'>'exp_unary')''{'exp_unary'='exp_unary';'exp_unary'=' exp_unary'+'exp_unary';''}'
'if''('exp_add'>'exp_add')''{'exp_unary'='exp_add';'exp_unary'='exp_add'+'exp_unary';''}'
'if''('exp_rel'>'exp_add')''{'exp_unary'='exp_add';'exp_unary'='exp_add';''}'
'if''('exp_rel')''{'exp_unary'='exp_rel';'exp_unary'='exp_rel';''}'
'if''('exp')''{'exp_unary'='exp';'exp_unary'='exp';''}'
'if''('exp')''{'exp_assign';'exp_assign';''}'
'if''('exp')''{'exp';'exp';''}'
'if''('exp')''{'stmtstmt'}'
'if''('exp')''{'stmt_liststmt'}'
'if''('exp')''{'stmt_list'}'
'if''('exp')''{'stmt'}'
stmt


For the corrected grammar
stmt -> exp? ';'
stmt -> '{' stmt_list? '}'
stmt -> 'if' '(' exp ')' stmt
stmt_list -> stmt
stmt_list -> stmt_list stmt
exp -> exp_assign
exp -> exp_rel
exp_assign -> exp_unary '=' exp
exp_rel -> exp_rel '<' exp_add
exp_rel -> exp_add
exp_add -> exp_add '+' exp_unary
exp_add -> exp_unary
exp_unary -> '+' exp_unary
exp_unary -> exp_pri
exp_pri -> id
exp_pri -> num
id -> 'x'
id -> 'y'
num -> '0'
num -> '1'
num -> '9' CorsacFoxWiki (talk) 11:12, 22 October 2024 (UTC)[reply]
I have uploaded a modified version of the image, following your above ideas, see File:C grammar stmt2 svg.svg. To keep the width reasonable, I used one-letter prefixes to "Expr" ("R" for relational, "A" for additive, "P" for primary). To save vertical space, I followed (in the expression part of the grammar) your idea of parallelizing derivations. I added some more rules to the grammar (to match its height with that of the derivation), viz. for the remaining relational and additive operators. As for "<" vs. ">", I prefer the latter, since it eases imagining the purpose of the statement (viz. maintaining the least significant decimal digit of a counter in x, and the higher digits in y. - Jochen Burghardt (talk) 12:51, 28 October 2024 (UTC)[reply]
Ooooh fancy! One small point: it should be RExpr -> RExpr '>' AExpr to make '>' left associative in the grammar, (and the same goes for the other relational ops) and hence the parsing X > Y > Z is unambiguously (X > Y) > Z, though in C that has the confusing interpretation of comparing the "boolean" result of (X > Y) with Z, whatever the type of Z might be. [Aside: Prolog disallows the expression A > B > C because rel-ops have fixity xfx (x is a strictly lower term) i.e. RExpr -> AExpr '>' AExpr in Prolog; I've seen (some) such expressions treated nicely in ACSL annotations, following the mathematical convention but that does not seem to be dealt with in the ACSL grammar I saw] . CorsacFoxWiki (talk) 17:53, 31 October 2024 (UTC)[reply]
 Done Oops - you are right. (I may indeed have been confused by the Prolog (and math) xfx convention.) I fixed it meanwhile. Now OK to replace in the article? - Jochen Burghardt (talk) 20:48, 3 November 2024 (UTC)[reply]
Yes. Very nice. I'd prefer it if the language of the caption, around the semblance to the C grammar, were softened, but I'm not going to fight over it.
My preference for the Prolog style disallowing X < Y < Z is that that simplification (which is also in the Wirth grammars for Pascal etc) prevents a whole class of mistakes (all be it with a tiny loss of expressiveness). In a similar way the definition of l-values in Pascal is preferable IMO (e.g. the C syntax is wrong to allow literal constant l-values, and it is of no use for the C syntax to allow me to say assign to the value returned by a function). I think all this goes back to BCPL and CPL (as they explored what was necessary/desirable Strachey was even asking whether conditional expressions could be l-values i.e. e1 < e2 ? x : y := e3 might assign e3 to either x or y! Fortunately they decided against).
I think this discussion has been very productive. Thank you. It has certainly clarified my thinking. Looking at published CFGs for various languages reveals where they are ambiguous because of CFG's inability to capture context-sensitive syntactic information e.g. in declaration/usage, or where in documentation a simple concise but ambiguous grammar is sometimes preferred, disambiguated by additional information e.g. an operator precedence/associativity table, rather than elaborating a hierarchy of rules, as we & K&R have done. Or where indentation (whitespace) is significant (e.g. Python, Haskell), but for simplicity and brevity the grammar is presented without tedious concern for whitespace, being clarified by explanations of how indentation indicates structure. I'd like to elaborate on those strengths and weaknesses in the section about usage in the representation of programming languages (and the boundary between the syntactic and semantic). But that is quite a different can of worms. CorsacFoxWiki (talk) 14:14, 10 November 2024 (UTC)[reply]

I found a simpler example of ambiguity caused by typedef-name/identifier ambiguity in parsing C on a page about the Lexer hack. — Preceding unsigned comment added by CorsacFoxWiki (talkcontribs) 15:53, 16 October 2024 (UTC)[reply]