Phrase and Grammar structure in Natural Language


"Artificial intelligence" (AI) is a branch of computer science that tries to give computers the ability to comprehend spoken and written words similar to human beings which is a field of "natural language processing" (NLP).

Computational linguistics combines a variety of technologies, including deep learning, machine learning, and statistics. Combining these technologies enables computers to completely "understand" the meaning of texts and speech, including the speaker's or writer's intention and sentiment, and to interpret human language as text and audio data.

Why is it Important to use Grammar Structure in NLP?

Communication is the act of sharing information through signals that are drawn from a common system of signs. Communication helps individuals succeed in a world where not everything is visible because it allows them to gain knowledge from others' observations and inferences. Since humans communicate more than any other species, computer agents will need to learn language in order to be effective. To fully understand a conversation, more advanced models than basic spam classification techniques are required. These models involve analyzing the grammatical structure of sentences, adding in the meaning behind the words, and then applying this to tasks such as translating text or recognizing speech.

One such model used in many natural language processing applications, including speech recognition, machine translation, and predictive text input models, is the N-Gram model. An N-gram model forecasts the word that is most likely to come after a sequence of N-1 words. It is a probabilistic model that has been practiced on a text corpus.

To build an N-gram model, we count how often word sequences appear in text and estimate probabilities. However, these models have limitations, so we can make improvements by using techniques like smoothing, interpolation, and backoff. We can deduce that in English, adjectives usually come before nouns by observing that "red car" is more common than "car red" and other similar examples. Although there are exceptions to this rule, such as the adjective "main" in "course main dish," knowing the part of speech is a helpful generalization. This understanding becomes even more useful when we combine different parts of speech to form phrases like noun phrases or verb phrases and then combine these phrases into sentence structures represented by trees with nested phrases marked by their respective categories.

Types of Grammatical Formalisms

Based simply on the rewriting rules, Chomsky (1957) differentiates four categories of grammatical formalisms. Each class may describe all the languages that a less powerful class can describe, as well as a few extra languages, and the classes can be arranged in a hierarchy. Grammatical formalisms can be classified according to their ability to generate or the number of languages they can represent. The following are the four classifications made by Chomsky −

  • Recursively enumerable grammar

  • Context free grammar

  • Regular Grammar

  • Context sensitive grammar

Recursively Enumerable Grammar

There are no grammar rules in recursively enumerable grammar. Both the left and right sides of the rewrite grammar, as in the rule AB -> CD, are allowed to contain both terminal and nonterminal symbols in any number. These grammars have expressive abilities on par with Turing machines.

Example

Here is an example of a recursively enumerable grammar −

S → aSb | ε

S serves as the start symbol in this grammar, whereas "a" and "b" stand in for the terminal symbols and the empty string, respectively. All strings that start with zero or more "a" symbols and the same number of "b" symbols are produced by this grammar. This language, for instance, creates strings like "ab", "aabb", "aaabbb", and so forth.

It should be noted that because this grammar is recursively enumerable, there is no algorithm that can detect in a limited amount of time whether a given string belongs to the language produced by this grammar. This is due to the grammar's support for endless recursion, which enables "S" to be substituted with "aSb" before the non-terminal symbol "S" reappears on the right side. This grammar produces an endless number of strings as a result.

Context Free Grammar

A sort of formal grammar known as context-free grammar (CFG) specifies a language's structure in terms of its component elements, known as symbols or non-terminals, and the rules that specify how those symbols can be joined to make acceptable sentences or phrases. Context-free grammars (CFGs) have a single nonterminal symbol on the left side. Each rule permits the right-hand side to be rewritten as the nonterminal in any circumstance. CFGs are ubiquitous for natural-language and programming-language grammars, even though it is now widely acknowledged that at least some natural languages contain constructs that are not context-free (Pullum, 1991). Grammars without regard to context can represent an a and a b, but not an a and a bn c.

Example

Here is an example of a context-free grammar that generates a simple language of arithmetic expressions −

S → A
A → A + M | M
M → M * K | K
K → A | num

The non-terminals A, M, and K in this grammar stand for expressions, terms, and factors, respectively. The terminal characters +, *, and () stand for arithmetic operators and brackets, respectively, whereas the symbol num denotes a numeric value. These symbols can be joined to create legitimate expressions in accordance with the production rules. For instance, the rule M M * K allows a term to be generated by combining a smaller term with a factor using the * operator. The rule A A + M allows an expression to be formed by combining two smaller expressions using the + operator.

We can create phrases like "2 + 3 * 4", "(5 + 6) * 7", and "8" using this language

Regular Grammar

A formal grammar that produces a regular language is called a regular grammar. Any language that can be stated using a regular expression, or alternatively, a deterministic or non-deterministic finite automaton, is referred to as a "regular language." Every rule has a single non-terminal on the left, a terminal symbol on the right, and a non-terminal that may or may not follow. Regular grammars are equalls powerful as finite-state machines. They are not suited for programming languages (a subset of the aan bn language), as they are unable to convey characteristics like balanced opening and closing parenthesis. Expressing a* b*, which is a series of any number of a's followed by any number of b's, will get them the closest.

Example

Here is an example −

S -> 0K | 1M | ε
K -> 0K | 1K | ε
M -> 0M | 1M | ε

The start symbol in this grammar is S, and the empty string is denoted by the symbol. The grammar produces strings of 0s and 1s with an optional final symbol. For instance, it can produce empty strings, the strings "0110," "101," and "0010." The repetition of 0s or 1s is permitted by the production standards, as is the choice to exclude them entirely. A finite automaton having three states, one for each non-terminal symbol, can be used to describe this grammar.

Context Sensitive Grammar

A sort of formal grammar known as context-sensitive grammar has production rules in the form of A A, where A is a non-terminal symbol, A and B are arbitrary strings of symbols (both terminal and non-terminal), and B is a non-empty string of symbols. The condition is that the string length must be at least as long as the string length. This indicates that only in the precise context indicated by the strings can the non-terminal A be replaced by.

Example

Here's an example of a context-sensitive grammar rule −

SAB → SBB

According to this rule, a non-terminal S can only be replaced by SBB if it comes right after the letters "A" and "B" in that order. The string "SABB" can be changed into "SBBBB" using this rule; however, "SABAB" cannot be changed because "A" and "B" do not occur in the context.

Up until the 1980s, linguists focused on context-free and context-sensitive languages. Due to the necessity to absorb and learn from gigabytes or terabytes of internet text quickly, even if it means a less in-depth examination, there has since been a resurgence of interest in regular grammar.

Conclusion

An elementary English grammar has been established for use in corresponding with Wumpus world agents. It will be enhanced to sound more like real English and is known as E0. There is no flawless English grammar, though, as various people have different ideas on what constitutes correct English.

Updated on: 07-Aug-2023

440 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements