I am currently working my way through Parsing Techniques: A Practical Guide by Ceriel J.H. Jacobs and Dick Grune which is available as a PDF on Grune’s site. As I continued in the book I found myself having a hard time following classifications and grammar levels when they were used interchangeably. I am using this blog post to hash it all out for myself so I can practice keeping the order and meaning straight by explaining these concepts to my good friend, the internet.
Let’s start at the very beginning (a very good place to start): what is the Chomsky Hierarchy? Well it’s a hierarchy of formal grammars described by Noam Chomsky. Hm, that wasn’t exactly the very beginning, was it?
What most people think of as a “normal language” (something you use to communicate with other humans, like English) is called a natural language, whereas a programming language (something you use to communicate with computers) is a formal language.
A formal grammar describes a formal language. Just like in natural language, a grammar isn’t going to tell you what a sentence means but it will tell you how to form that sentence. In our formal world, a sentence is more like a formula or a statement. In a language where we can have any amount of a and any amount of b , but every a has to always be before every b, aaaabbb would be a valid sentence.
Now that we’re back at the very beginning: what is the Chomsky Hierarchy? This is a classification of how strict a grammar is, with 0 being the least strict (but hardest to parse) and 3 being the most strict (but easiest to parse!). Before diving in I want to get some definitions out of the way for the terms that we’ll be using.
- Rule: A rule takes the form of
A -> awhere A and a are both something (each step of the hierarchy has its own definition). But this is basically a production step — given this input, provide this output.
- Symbol: This can be anything! They are values used in some sort of sequence of values. We’ll get into two types of symbols soon but for now think of them as a placeholder value and a content value. In our rule, both
- Sentence: A sentence contains one or more symbol. A “valid” sentence is a sentence that can be produced by the given grammar and a fully produced/final sentence has used the rules provided to replace all placeholder values with real content.
- Left-hand side: The left-hand side (you’ll see LHS sometimes but I want to try to avoid acronyms) is the
Ain my above example — it is the left-hand side of the rule.
- Right-hand side: Bet you can guess what this one is. Yep, the right-hand side of the rule! You did great. That would be
ain our example.
- Non-Terminal Symbol: A non-terminal, or non-terminal symbol, is a symbol that will not be allowed in a final sentence —this is our placeholder value from before! You can also think of it as a variable. Hopefully you will be able to turn it into a terminal eventually using the rules in the grammar, otherwise it is a sentence that doesn’t exist in the language.
Ais our non-terminal in our rule.
- Terminal Symbol: A terminal, or terminal symbol, is a legal symbol in the language — this is our content symbol. Like the non-terminal is our variable, the terminal symbol is the value. This symbol will no longer be processed further since it is in its final form. A terminal is usually lower case, so in our example it is
Type 0: Unrestricted Grammars
A grammar rule consists of a left hand side and a right hand side. Something like
A -> a. An unrestricted grammar means anything can be on the left hand side and anything can be on the right hand side. Go nuts (well, as nuts as a Turing machine will let you go).
Type 1 grammars are where things get more restrictive (and more fun!). These grammars may have rules that transform a non-zero number of symbols to possibly zero number of symbols. These come in two flavors: original and flavortown.
Monotonic grammars contain no rules where the left-hand side has more symbols than the right-hand side. For an example, I just try to think of nice things:
FavoriteThing and FavoriteThing -> raindrops on roses and whiskers on kittens. Our left-hand side has 3 symbols and our right-hand side has 7 symbols. This grammar is Monotonic! A rule that has 16 symbols on the left hand side going on 17 symbols on the right hand side would also be monotonic.
Context-sensitive seems like the prominent definition of Type 1 grammars. A context-sensitive grammar only consists of rules where only one symbol on the left-hand side is replaced by another symbol on the right-hand side. The note here is that the left-hand side can still be any number of symbols but the grammar must remain monotonic (i.e. number of symbols on the left-hand side is less than the number of symbols on the right-hand side.
Below we’ve added two possible outcomes for the rule
FavoriteThing as well as how to parse after you have selected a
|used in the upcoming grammar examples, it just means that for the given left hand side, you have options!
FavoriteThing -> Adjective Adjective Noun | Noun on Noun Noun on -> raindrops on | whiskers on Adjective -> bright | copper | warm | woolen bright Adjective -> bright copper warm Adjective -> warm woolen raindrops on Noun -> raindrops on roses whiskers on Noun -> whiskers on kittens bright copper Noun -> bright copper kettles warm woolen Noun -> warm woolen mittens
In this grammar we have defined, only one non-terminal (e.g.
Noun) is replaced by a terminal (e.g.
kittens) at a time and the left hand side is always equal to or less than the number of symbols on the right. Our example for monotonic grammars above isn’t context-sensitive because it is replacing multiple words at once. I like this example because the term “context sensitive” makes a lot of sense — in the context of
raindrops on you can only finish that with
roses and not
Keep your eye out for Sound of Grammars (Reprise) where we’ll explore Type 2 and Type 3 grammars!