updated to 22.12

git-svn-id: https://svn.eiffel.com/eiffel-org/trunk@2364 abb3cda0-5349-4a8f-a601-0c33ac3a8c38
This commit is contained in:
eifops
2022-12-22 13:24:07 +00:00
parent 3d4c795faf
commit ee2831d52f
2956 changed files with 63295 additions and 4 deletions

View File

@@ -0,0 +1,5 @@
[[Property:title|EiffelLex Class Reference]]
[[Property:weight|1]]
[[Property:uuid|51198de0-9cad-6b47-e351-6e0de86942ce]]
==View the [[ref:libraries/lex/reference/index|EiffelLex Class Reference]]==

View File

@@ -0,0 +1,55 @@
[[Property:title|eiffel_scan console input]]
[[Property:weight|2]]
[[Property:uuid|ce4da828-6772-e1c3-0917-82f6669cccf9]]
<code>
-- Example of a lexical analyzer based on the Eiffel syntax.
-- The analyzer itself is found in the file ``eiffel_lex'', which
-- is created according to the file ``eiffel_token'' if not
-- previously built and stored.
class EIFFEL_SCAN
inherit
SCANNING
rename
make as scanning_make
end;
ARGUMENTS
undefine
copy, consistent, is_equal, setup
end
create
make
feature
make
-- Create a lexical analyser for Eiffel if none,
-- then use it to analyze the file of name
-- `file_name'.
local
file_name: STRING;
do
if argument_count < 1 then
io.error.putstring ("Usage: eiffel_scan eiffel_class_file.e%N")
else
file_name := argument (1);
scanning_make;
build ("eiffel_lex", "eiffel_regular");
io.putstring ("Scanning file `");
io.putstring (file_name);
io.putstring ("'.%N");
analyze (file_name)
end
end -- make
end -- class EIFFEL_SCAN
</code>

View File

@@ -0,0 +1,118 @@
[[Property:title|eiffel_scan console output]]
[[Property:weight|3]]
[[Property:uuid|a233d1b0-b964-ba44-7b24-7204d9fa6932]]
<code>
Scanning file `eiffel_scan.e'.
--------------- LEXICAL ANALYSIS: ----
Token type 11: -- Example of a lexical analyzer based on the Eiffel syntax.
-- The analyzer itself is found in the file ``eiffel_lex'', which
-- is created according to the file ``eiffel_token'' if not
-- previously built and stored.
Keyword: class Code: 1819086195
Token type 32: EIFFEL_SCAN
Keyword: inherit Code: 1080299636
Token type 32: SCANNING
Keyword: rename Code: 2076787557
Token type 32: make
Keyword: as Code: 24947
Token type 32: scanning_make
Keyword: end Code: 6647396
Token type 15: ;
Token type 32: ARGUMENTS
Keyword: undefine Code: 1472863845
Token type 32: copy
Token type 17: ,
Token type 32: consistent
Token type 17: ,
Token type 32: is_equal
Token type 17: ,
Token type 32: setup
Keyword: end Code: 6647396
Token type 32: create
Token type 32: make
Keyword: feature Code: 1951938661
Token type 32: make
Keyword: is Code: 26995
Token type 11: -- Create a lexical analyser for Eiffel if none,
-- then use it to analyze the file of name
Token type 11: -- `file_name'.
Keyword: local Code: 1869613420
Token type 32: file_name
Token type 16: :
Token type 32: STRING
Token type 15: ;
Keyword: do Code: 25711
Keyword: if Code: 26982
Token type 32: argument_count
Token type 10: <
Token type 2: 1
Keyword: then Code: 1952998766
Token type 32: io
Token type 13: .
Token type 32: error
Token type 13: .
Token type 32: putstring
Token type 20: (
Token type 3: "Usage: eiffel_scan eiffel_class_file.e%N"
Token type 21: )
Keyword: else Code: 1701606245
Token type 32: file_name
Token type 18: :=
Token type 32: argument
Token type 20: (
Token type 2: 1
Token type 21: )
Token type 15: ;
Token type 32: scanning_make
Token type 15: ;
Token type 32: build
Token type 20: (
Token type 3: "eiffel_lex"
Token type 17: ,
Token type 3: "eiffel_regular"
Token type 21: )
Token type 15: ;
Token type 32: io
Token type 13: .
Token type 32: putstring
Token type 20: (
Token type 3: "Scanning file `"
Token type 21: )
Token type 15: ;
Token type 32: io
Token type 13: .
Token type 32: putstring
Token type 20: (
Token type 32: file_name
Token type 21: )
Token type 15: ;
Token type 32: io
Token type 13: .
Token type 32: putstring
Token type 20: (
Token type 3: "'.%N"
Token type 21: )
Token type 15: ;
Token type 32: analyze
Token type 20: (
Token type 32: file_name
Token type 21: )
Keyword: end Code: 6647396
Keyword: end Code: 6647396
Token type 11: -- make
Keyword: end Code: 6647396
Token type 11: -- class EIFFEL_SCAN
Token type -1:
</code>

View File

@@ -0,0 +1,54 @@
[[Property:title|EIFFEL_SCAN Text]]
[[Property:weight|1]]
[[Property:uuid|092bd183-2fc4-ae65-02b9-d66933492a50]]
<code>
class
EIFFEL_SCAN
inherit
SCANNING
rename
make as scanning_make
end
ARGUMENTS
undefine
copy,
consistent,
is_equal,
setup
end
create
make
feature
make
-- Create a lexical analyser for Eiffel if none,
-- then use it to analyze the file of name
-- file_name.
local
file_name: STRING
do
if argument_count < 1 then
io.error.putstring ("Usage: eiffel_scan eiffel_class_file.e%N")
else
file_name := argument (1)
scanning_make
build ("eiffel_lex", "eiffel_regular")
io.putstring ("Scanning file `")
io.putstring (file_name)
io.putstring ("'.%N")
analyze (file_name)
end
end
end -- class EIFFEL_SCAN
</code>

View File

@@ -0,0 +1,100 @@
[[Property:title|EiffelLex Samples]]
[[Property:weight|0]]
[[Property:uuid|2e4911de-4838-00fc-1742-a8ebd1ae05ff]]
<code>
Real $R
Integer $Z
String ("\"" -> "\"")
Div "//"
Mod "\\"
Quotient '/'
Product '*'
Plus '+'
Minus '-'
Relational ('=' | '<' | '>' | ('<' '=') | ('>' '=') | ('/' '='))
Comment ("--" -> "\n") *(' '| '\t') *("--" -> "\n")
FeatureAddress '$'
Dot '.'
Dotdot ".."
Semicolon ';'
Colon ':'
Comma ','
Assign ":="
ReverseAssign "?="
Lparan '('
Rparan ')'
Lcurly '{'
Rcurly '}'
Lsquare '['
Rsquare ']'
Bang '!'
LeftArray "<<"
RightArray ">>"
Power '^'
Constraint "->"
Character (('\''$P'\'') | ('\'''\\'['t'|'n'|'r'|'f']'\'') | ('\''+('0'..'7')'\''))
Identifier ~('a'..'z') *(~('a'..'z') | '_' | ('0'..'9'))
-- Keywords
as
and
check
class
current
debug
deferred
do
else
elseif
end
ensure
expanded
export
external
false
feature
from
if
implies
indexing
infix
inherit
inspect
integer
invariant
is
language
like
local
loop
not
obsolete
old
once
or
prefix
real
redefine
require
rename
rescue
result
retry
select
strip
then
true
undefine
unique
until
variant
void
when
xor
</code>

View File

@@ -0,0 +1,13 @@
[[Property:title|Eiffel scanner]]
[[Property:weight|0]]
[[Property:uuid|c0d6ad9d-2bac-c5ad-69b6-873db2e47aa9]]
In the directory '''$ISE_EIFFEL/examples/lex''' you will find a system that scans Eiffel classes. It consists of the class [[EIFFEL_SCAN Text|EIFFEL_SCAN]] . It uses the file [[EiffelLex Samples|eiffel_regular]] as lexical grammar to analyze an Eiffel class passed on the command line.
When compiling the example, the executable '''eiffel_scan(.exe)''' is created. Use the program as follows:
<code>eiffel_scan <Eiffel class file>;</code>
As an example, when the [[eiffel_scan console input|source code]] of the root class is run through the scanner, it outputs a [[eiffel_scan console output|list]] of all consecutive tokens and keywords in that class to the console.

View File

@@ -0,0 +1,8 @@
[[Property:title|EiffelLex Sample]]
[[Property:weight|2]]
[[Property:uuid|79ad35f3-75a9-429c-ad47-f304fec23306]]
* [[Eiffel scanner|Eiffel scanner]]

View File

@@ -0,0 +1,567 @@
[[Property:title|EiffelLex Tutorial]]
[[Property:weight|0]]
[[Property:uuid|9ea43bef-1483-fbf2-4791-2be6a31d394d]]
==OVERVIEW==
When analyzing a text by computer, it is usually necessary to split it into individual components or '''tokens'''. In human languages, the tokens are the words; in programming languages, tokens are the basic constituents of software texts, such as identifiers, constants and special symbols.
The process of recognizing the successive tokens of a text is called lexical analysis. This chapter describes the EiffelLex library, a set of classes which make it possible to build and apply lexical analyzers to many different languages.
Besides recognizing the tokens, it is usually necessary to recognize the deeper syntactic structure of the text. This process is called '''parsing''' or '''syntax analysis''' and is studied in the next chapter.
Figure 1 shows the inheritance structure of the classes discussed in this chapter. Class [[ref:libraries/parse/reference/l_interface_chart|L_INTERFACE]] has also been included although we will only study it in the [[EiffelParse Tutorial]]; it belongs to the Parse library, where it takes care of the interface between parsing and lexical analysis.
[[Image:figure1]]
Figure 1: Lexical classes
==AIMS AND SCOPE OF THE EIFFELLEX LIBRARY==
To use the EiffelLex library it is necessary to understand the basic concepts and terminology of lexical analysis.
===Basic terminology===
The set of tokens accepted by a lexical analyzer is called a '''lexical grammar'''. For example, the basic constructs of Eiffel (identifiers, keywords, constants, special symbols) constitute a lexical grammar. For reasons that will be clear below, a lexical grammar is also known as a '''regular grammar'''.
A lexical grammar defines a number of '''token types''', such as Identifier and Integer for Eiffel. A token that conforms to the structure defined for a certain token type is called a '''specimen''' of that token type. For example, the token my_identifier, which satisfies the rules for Eiffel tokens, is a specimen of the token type Identifier; 201 is a specimen of the token type Integer.
To define a lexical grammar is to specify a number of token types by describing precisely, for each token type, the form of the corresponding specimens. For example a lexical grammar for Eiffel will specify that Identifier is the token type whose specimens are sequences of one or more characters, of which the first must be a letter (lower-case or upper-case) and any subsequent one is a letter, a digit (0 to 9) or an underscore. Actual grammar descriptions use a less verbose and more formal approach, studied below: regular expressions.
A lexical analyzer is an object equipped with operations that enable it to read a text according to a known lexical grammar and to identify the text's successive tokens.
The classes of the EiffelLex library make it possible to define lexical grammars for many different applications, and to produce lexical analyzers for these grammars.
===Overview of the classes===
For the user of the EiffelLex library, the classes of most direct interest are [[ref:libraries/lex/reference/token_chart|TOKEN]] , [[ref:libraries/lex/reference/lexical_chart|LEXICAL]] , [[ref:libraries/lex/reference/metalex_chart|METALEX]] and [[ref:libraries/lex/reference/scanning_chart|SCANNING]] .
An instance of [[ref:libraries/lex/reference/token_chart|TOKEN]] describes a token read from an input file being analyzed, with such properties as the token type, the corresponding string and the position in the text (line, column) where it was found.
An instance of [[ref:libraries/lex/reference/lexical_chart|LEXICAL]] is a lexical analyzer for a certain lexical grammar. Given a reference to such an instance, say analyzer, you may analyze an input text through calls to the features of class [[ref:libraries/lex/reference/lexical_chart|LEXICAL]] , for example:
<code>
analyzer.get_token
</code>
Class [[ref:libraries/lex/reference/metalex_chart|METALEX]] defines facilities for building such lexical analyzers. In particular, it provides features for reading the grammar from a file and building the corresponding analyzer. Classes that need to build and use lexical analyzers may be written as descendants of [[ref:libraries/lex/reference/metalex_chart|METALEX]] to benefit from its general-purpose facilities.
Class [[ref:libraries/lex/reference/scanning_chart|SCANNING]] is one such descendant of [[ref:libraries/lex/reference/metalex_chart|METALEX]] . It contains all the facilities needed to build an ordinary lexical analyzer and apply it to the analysis of input texts. Because these facilities are simpler to use and are in most cases sufficient, SCANNING will be discussed first; the finer-grain facilities of [[ref:libraries/lex/reference/metalex_chart|METALEX]] are described towards the end of this chapter.
These classes internally rely on others, some of which may be useful for more advanced applications. [[ref:libraries/lex/reference/lex_builder_chart|LEX_BUILDER]] , one of the supporting classes, will be introduced after [[ref:libraries/lex/reference/metalex_chart|METALEX]] .
===Library example===
The EiffelStudio delivery includes (in the examples/library/lex subdirectory) a simple example using the EiffelLex library classes. The example applies EiffelLex library facilities to the analysis of a language which is none other than Eiffel itself.
The root class of that example, <eiffel>EIFFEL_SCAN</eiffel>, is only a few lines long; it relies on the general mechanism provided by [[ref:libraries/lex/reference/scanning_chart|SCANNING]] (see below). The actual lexical grammar is given by a lexical grammar file (a concept explained below): the file of name eiffel_regular in the same directory.
===Dealing with finite automata===
Lexical analysis relies on the theory of finite automata. The most advanced of the classes discussed in this chapter, [[ref:libraries/lex/reference/lex_builder_chart|LEX_BUILDER]] , relies on classes describing various forms of automata:
* [[ref:libraries/lex/reference/dfa_chart|DFA]] : deterministic finite automata.
* [[ref:libraries/lex/reference/pdfa_chart|PDFA]] : partially deterministic finite automata.
* [[ref:libraries/lex/reference/ndfa_chart|NDFA]] : non-deterministic finite automata.
* [[ref:libraries/lex/reference/automaton_chart|AUTOMATON]] , the most general: finite automata.
* [[ref:libraries/lex/reference/fixed_automaton_chart|FIXED_AUTOMATON]] , [[ref:libraries/lex/reference/linked_automaton_chart|LINKED_AUTOMATON]] .
These classes may also be useful for systems that need to manipulate finite automata for applications other than lexical analysis. The interface of [[ref:libraries/lex/reference/lex_builder_chart|LEX_BUILDER]] , which includes the features from [[ref:libraries/lex/reference/automaton_chart|AUTOMATON]] , [[ref:libraries/lex/reference/ndfa_chart|NDFA]] and [[ref:libraries/lex/reference/pdfa_chart|PDFA]] , will provide the essential information.
==TOKENS==
A lexical analyzer built through any of the techniques described in the rest of this chapter will return tokens - instances of class [[ref:libraries/lex/reference/token_chart|TOKEN]] . Here are the most important features of this class:
* <eiffel>string_value</eiffel>: a string giving the token's contents.
* <eiffel>type</eiffel>: an integer giving the code of the token's type. The possible token types and associated integer codes are specified during the process of building the lexical analyzer in one of the ways described below.
* <eiffel>is_keyword</eiffel>: a boolean indicating whether the token is a keyword.
* <eiffel>keyword_code</eiffel>: an integer, meaningful only if <eiffel>is_keyword</eiffel> is <eiffel>True</eiffel>, and identifying the keyword by the code that was given to it during the process of building the analyzer.
* <eiffel>line_number</eiffel>, <eiffel>column_number</eiffel>: two integers indicating where the token appeared in the input text.
==BUILDING AND USING LEXICAL ANALYZERS==
The general method for performing lexical analysis is the following.
# Create an instance of [[ref:libraries/lex/reference/lexical_chart|LEXICAL]] , giving a lexical analyzer for the desired grammar.
# Store the analyzer into a file.
# Retrieve the analyzer from the file.
# Use the analyzer to analyze the tokens of one or more input texts by calling the various features of class [[ref:libraries/lex/reference/lexical_chart|LEXICAL]] on this object.
Steps 2 and 3 are obviously unnecessary if this process is applied as a single sequence. But in almost all practical cases you will want to use the same grammar to analyze many different input texts. Then steps 1 and 2 will be performed once and for all as soon as the lexical grammar is known, yielding an instance of [[ref:libraries/lex/reference/lexical_chart|LEXICAL]] that step 2 stores into a file; then in every case that requires analyzing a text you will simply retrieve the analyzer and apply it, performing steps 3 and 4 only.
The simplest way to store and retrieve the instance of [[ref:libraries/lex/reference/lexical_chart|LEXICAL]] and all related objects is to use the facilities of class [[ref:libraries/base/reference/storable_chart|STORABLE]] : procedure store and one of the retrieval procedures. To facilitate this process, [[ref:libraries/lex/reference/lexical_chart|LEXICAL]] inherits from [[ref:libraries/base/reference/storable_chart|STORABLE]] .
The next sections explain how to perform these various steps. In the most common case, the best technique is to inherit from class [[ref:libraries/lex/reference/scanning_chart|SCANNING]] , which provides a framework for retrieving an analyzer file if it exists, creating it from a grammar description otherwise, and proceed with the lexical analysis of one or more input texts.
==LEXICAL GRAMMAR FILES AND CLASS SCANNING==
Class [[ref:libraries/lex/reference/scanning_chart|SCANNING]] may be used as an ancestor by classes that need to perform lexical analysis. When using [[ref:libraries/lex/reference/scanning_chart|SCANNING]] you will need a '''lexical grammar file''' that contains the description of the lexical grammar. Since it is easy to edit and adapt a file without modifying the software proper, this technique provides flexibility and supports the incremental development and testing of lexical analyzers.
===The build procedure===
To obtain a lexical analyzer in a descendant of class [[ref:libraries/lex/reference/scanning_chart|SCANNING]] , use the procedure
<code>
build (store_file_name, grammar_file_name: STRING)</code>
If no file of name <code>store_file_name</code> exists, then <eiffel>build</eiffel> reads the lexical grammar from the file of name <code>grammar_file_name</code>, builds the corresponding lexical analyzer, and stores it into <code>store_file_name</code>.
If there already exists a file of name <code>grammar_file_name</code>, <eiffel>build</eiffel> uses it to recreate an analyzer without using the <code>grammar_file_name </code>.
===Lexical grammar files===
A lexical grammar file (to be used as second argument to <eiffel>build</eiffel>, corresponding to <code>grammar_file_name</code>) should conform to a simple structure, of which the file ''eiffel_regular'' in the examples directory provides a good illustration.
Here is the general form:
<code>
Token_type_1 Regular_expression_1
Token_type_2 Regular_expression_2
...
Token_type_m Regular_expression_m
-- Keywords
Keyword_1
Keyword_2
...
Keyword_n
</code>
In other words: one or more lines, each containing the name of a token type and a '''regular expression'''; a line beginning with two dashes -- (the word '''Keywords''' may follow them to signal that this is the beginning of keywords); and one or more lines containing one keyword each.
Each ''Token_type_i'' is the name of a token type, such as ''Identifier'' or ''Decimal_constant''. Each ''Regular_expression_i'' is a regular expression, built according to a precisely specified format. That format is defined later in this chapter, but even without having seen that definition it is not hard to understand the following small and typical example of lexical grammar file without keywords:
<code>
Decimal '0'..'9'
Natural +('0'..'9')
Integer ['+'|'-'] '1'..'9' *('0'..'9')
</code>
The first expression describes a token type whose specimens are tokens made of a single-letter decimal digit (any character between 0 and 9). In the second, the + sign denotes repetition (one or more); the specimens of the corresponding token type are all non-empty sequences of decimal digits - in other words, natural numbers, with leading zeroes permitted. In the third, the | symbol denotes alternation, and the asterisk denotes repetition (zero or more); the corresponding tokens are possibly signed integer constants, with no leading zeroes.
As explained below, keywords are regular expressions which are treated separately for convenience and efficiency. If you are using lexical grammar files of the above form, all keywords must be specimens of the last regular expression given ( ''Regular_expression_m'' above). More details below.
===Using a lexical analyzer===
Once <eiffel>build</eiffel> has given you an analyzer, you may use it to analyze input texts through calls to the procedure
<code>
analyze (input_file_name: STRING)</code>
This will read in and process successive input tokens. Procedure <eiffel>analyze</eiffel> will apply to each of these tokens the action of procedure <eiffel>do_a_token</eiffel>. As defined in SCANNING, this procedure prints out information on the token: its string value, its type, whether it is a keyword and if so its code. You may redefine it in any descendant class so as to perform specific actions on each token.
The initial action <eiffel>begin_analysis</eiffel>, which by default prints a header, and the terminal action <eiffel>end_analysis</eiffel>, which by default does nothing, may also be redefined.
To build lexical analyzers which provide a higher degree of flexibility, use [[ref:libraries/lex/reference/metalex_chart|METALEX]] or [[ref:libraries/lex/reference/lex_builder_chart|LEX_BUILDER]] , as described in the last part of this chapter.
==ANALYZING INPUT==
Let us look more precisely at how we can use a lexical analyzer to analyze an input text.
===Class LEXICAL===
Procedure <eiffel>analyze</eiffel> takes care of the most common needs of lexical analysis. But if you need more advanced lexical analysis facilities you will need an instance of class [[ref:libraries/lex/reference/lexical_chart|LEXICAL]] (a direct instance of [[ref:libraries/lex/reference/lexical_chart|LEXICAL]] itself or of one of its proper descendants). If you are using class [[ref:libraries/lex/reference/scanning_chart|SCANNING]] as described above, you will have access to such an instance through the attribute <eiffel>analyzer</eiffel>.
This discussion will indeed assume that you have an entity attached to an instance of [[ref:libraries/lex/reference/lexical_chart|LEXICAL]] . The name of that entity is assumed to be <eiffel>analyzer</eiffel>, although it does not need to be the attribute from [[ref:libraries/lex/reference/scanning_chart|SCANNING]] . You can apply to that <eiffel>analyzer</eiffel> the various exported features features of class [[ref:libraries/lex/reference/lexical_chart|LEXICAL]] , explained below. All the calls described below should use <eiffel>analyzer</eiffel> as their target, as in
<code>
analyzer.set_file ("my_file_name")
</code>
===Creating, retrieving and storing an analyzer===
To create a new analyzer, use
<code>
create analyzer.make_new
</code>
You may also retrieve an analyzer from a previous session. [[ref:libraries/lex/reference/lexical_chart|LEXICAL]] is a descendant from [[ref:libraries/base/reference/storable_chart|STORABLE]] , so you can use feature retrieved for that purpose. In a descendant of [[ref:libraries/base/reference/storable_chart|STORABLE]] , simply write
<code>
analyzer ?= retrieved
</code>
If you do not want to make the class a descendant of [[ref:libraries/base/reference/storable_chart|STORABLE]] , use the creation procedure <eiffel>make</eiffel> of [[ref:libraries/lex/reference/lexical_chart|LEXICAL]] , not to be confused with <eiffel>make_new</eiffel> above:
<code>
create analyzer.make
analyzer ?= analyzer.retrieved
</code>
===Choosing a document===
To analyze a text, call <eiffel>set_file</eiffel> or <eiffel>set_string</eiffel> to specify the document to be parsed. With the first call, the analysis will be applied to a file; with the second, to a string.
{{note|if you use procedure <eiffel>analyze</eiffel> of [[ref:libraries/lex/reference/scanning_chart|SCANNING]] , you do not need any such call, since <eiffel>analyze</eiffel> calls <eiffel>set_file</eiffel> on the file name passed as argument. }}
===Obtaining the tokens===
The basic procedure for analyzing successive tokens in the text is <eiffel>get_token</eiffel>, which reads in one token and sets up various attributes of the analyzer to record properties of that token:
* <eiffel>last_token</eiffel>, a function of type [[ref:libraries/lex/reference/token_chart|TOKEN]] , which provides all necessary information on the last token read.
* <eiffel>token_line_number</eiffel> and<eiffel> token_column_number</eiffel>, to know where the token is in the text. These queries return results of type <eiffel>INTEGER</eiffel>.
* <eiffel>token_type</eiffel>, giving the regular expression type, identified by its integer number (which is the value <eiffel>No_token</eiffel> if no correct token was recognized).
* <eiffel>other_possible_tokens</eiffel>, an array giving all the other possible token types of the last token. (If <eiffel>token_type</eiffel> is <eiffel>No_token</eiffel> the array is empty.)
* <eiffel>end_of_text</eiffel>, a boolean attribute used to record whether the end of text has been reached. If so, subsequent calls to <eiffel>get_token</eiffel> will have no effect.
Procedure <eiffel>get_token</eiffel> recognizes the longest possible token. So if <code><</code>, <code>=</code> and <code><=</code> are all regular expressions in the grammar, the analyzer recognizes <code><=</code> as one token, rather than <code><</code> followed by <code>=</code>. You can use <eiffel>other_possible_tokens</eiffel> to know what shorter tokens were recognized but not retained.
If it fails to recognize a regular expression, <eiffel>get_token</eiffel> sets <eiffel>token_type</eiffel> to <eiffel>No_token</eiffel> and advances the input cursor by one character.
===The basic scheme===
Here is the most common way of using the preceding facilities:
<code>
from
set_file ("text_directory/text_to_be_parsed")
-- Or: set_string ("string to parse")
begin_analysis
until
end_of_text
loop
analyzer.get_token
if analyzer.token_type = No_token then
go_on
end
do_a_token (lexical.last_token)
end
end_analysis
</code>
This scheme is used by procedure <eiffel>analyze</eiffel> of class [[ref:libraries/lex/reference/scanning_chart|SCANNING]] , so that in standard cases you may simply inherit from that class and redefine procedures <eiffel>begin_analysis</eiffel>, <eiffel>do_a_token</eiffel>, and <eiffel>end_analysis</eiffel>. If you are not inheriting from [[ref:libraries/lex/reference/scanning_chart|SCANNING]] , these names simply denote procedures that you must provide.
==REGULAR EXPRESSIONS==
The EiffelLex library supports a powerful set of construction mechanisms for describing the various types of tokens present in common languages such as programming languages, specification languages or just text formats. These mechanisms are called '''regular expressions'''; any regular expression describes a set of possible tokens, called the '''specimens''' of the regular expression.
Let us now study the format of regular expressions. This format is used in particular for the lexical grammar files needed by class [[ref:libraries/lex/reference/scanning_chart|SCANNING]] and (as seen below) by procedure <eiffel>read_grammar</eiffel> of class [[ref:libraries/lex/reference/metalex_chart|METALEX]] . The ''eiffel_regular'' grammar file in the examples directory provides an extensive example.
Each regular expression denotes a set of tokens. For example, the first regular expression seen above, <br/>
<code>
'0'..'9'
</code>
<br/>
denotes a set of ten tokens, each consisting of a single digit.
===Basic expressions===
A character expression, written '' 'character' '' where ''character'' is a single character, describes a set of tokens with just one element: the one-character token character. For example, '' '0' '' describes the set containing the single-digit single token ''0''.
Cases in which character is not a printable character use the following conventions:
{| border="1"
|-
| '\ooo'
| Character given by its three-digit octal code ''ooo''.
|-
| '\xx'
| Character given by its two-digit hexadecimal code ''xx''. <br/>
(Both lower- and upper-case may be used for letters in ''xx''.)
|-
| '\r'
| Carriage return
|-
| '\''
| Single quote
|-
| '\\'
| Backslash
|-
| '\t'
| Tabulation
|-
| '\n'
| New line
|-
| '\b'
| Backspace
|-
| '\f'
| Form feed
|}
===Intervals===
An interval, written ''lower..upper'' where ''lower'' and ''upper'' are character expressions, describes a set of one-character tokens: all the characters whose ASCII code is between the codes for the characters in ''lower'' and ''upper''. For example, '' '0'..'9' '' contains all tokens made of a single decimal digit.
===Basic operator expressions===
A parenthesized expression, written (''exp'') where ''exp'' is a regular expression, describes the same set of tokens as ''exp''. This serves to remove ambiguities in complex regular expressions. For example, the parenthesized expression ('' '0'..'9' '') also describes all single-decimal-digit tokens.
A difference, written ''interval - char'', where ''interval'' is an interval expression and ''char'' is a character expression, describes the set of tokens which are in ''exp'' but not in ''char''. For example, the difference '' '0'..'9' - '4' '' describes all single-decimal-digit tokens except those made of the digit 4.
{{caution|A difference may only apply to an interval and a single character. }}
===Iterations===
An unbounded iteration, written ''*exp'' or ''+exp'' where ''exp'' is a regular expression, describes the set of tokens made of sequences of zero or more specimens of ''exp'' (in the first form, using the asterisk), or of one or more specimens of ''exp'' (in the second form, using the plus sign). For example, the iteration ''+('0'..'9')'' describes the set of tokens made of one or more consecutive decimal digits.
A fixed iteration, written ''n exp'' where ''n'' is a natural integer constant and ''exp'' is a regular expression, describes the set of tokens made of sequences of exactly ''n'' specimens of ''exp''. For example, ''3 ('A'..'Z')'' describes the set of all three-letter upper-case tokens.
===Other operator expressions===
A concatenation, written exp<sub>1</sub> exp<sub>2</sub> ... exp<sub>n</sub>, describes the set of tokens made of a specimen of exp<sub>1</sub> followed by a specimen of exp<sub>2</sub> etc. For example, the concatenation '' '1'..'9' * ('0'..'9')'' describes the set of tokens made of one or more decimal digits, not beginning with a zero - in other words, integer constants in the usual notation.
An optional component, written ''[exp]'' where ''exp'' is a regular expression, describes the set of tokens that includes the empty token and all specimens of ''exp''. Optional components usually appear in concatenations.
Concatenations may be inconvenient when the concatenated elements are simply characters, as in '' 'A' ' ' 'T' 'e' 'x' 't' ''. In this case you may use a '''string''' in double quotes, as in <br/>
<code>
"A Text"</code>
More generally, a string is written "a<sub>1</sub> a<sub>2</sub> ... a<sub>n</sub>" for ''n >= 0'', where the "a<sub>i</sub>" are characters, and is an abbreviation for the concatenation 'a<sub>1</sub>' 'a<sub>2</sub>' ... 'a<sub>n</sub>', representing a set containing a single token. In a string, the double quote character " is written \" and the backslash character \ is written \\. No other special characters are permitted; if you need special characters, use explicit concatenation. As a special case, "" represents the set containing a single empty token.
A union, written exp<sub>1</sub> | exp<sub>2</sub> | ... | exp<sub>n</sub>, describes the set of tokens which are specimens of exp<sub>1</sub>, or of exp<sub>2</sub>, etc. For example, the union ''('a'..'z') | ('A'..'Z')'' describes the set of single-letter tokens (lower-case or upper-case).
===Predefined expressions===
A joker, written '''$?''', describes the set of all tokens made of exactly one character. A joker is considered to be an interval expression, so that it may be the first operand of a difference operation.
A printable, written '''$P''', describes the set of all tokens made of exactly one printable character.
A blank, written '''$B''', describes the set of all tokens made of one or more specimens of the characters blank, new-line, carriage-return and tabulation.
The following non-elementary forms are abbreviations for commonly needed regular expressions:
{| border="1"
|-
| Code
| Equivalent expression
| Role
|-
| '''$L'''
| '' '\n' ''
| New-line character
|-
| '''$N'''
| ''+('0'..'9')''
| Natural integer constants
|-
| '''$R'''
| '' <nowiki>['+'|'-'] +('0'..'9') '.' *('0'..'9')['e'|'E' ['+'|'-'] +('0'..'9')]</nowiki> ''
| Floating point constants
|-
| '''$W'''
| '' +( '''$P''' - ' ' - '\t' - '\n' - '\r') ''
| Words
|-
| '''$Z'''
| '' <nowiki>['+'|'-'] +('0'..'9')</nowiki> ''
| Possibly signed integer constants
|}
A delimited string, written ''->string'', where ''string'' is of the form,"a<sub>1</sub> a<sub>2</sub> ... a<sub>n</sub>", represents the set of tokens made of any number of printable characters and terminated by ''string''.
One more form of regular expression, case-sensitive expressions, using the ~ symbol, will be introduced below.
===Combining expression-building mechanisms===
You may freely combine the various construction mechanisms to describe complex regular expressions. Below are a few examples.
{| border="1"
|-
| '' 'a'..'z' - 'c' - 'e' ''
| Single-lower-case-letter tokens, except ''c'' and ''e''.
|-
| ''$? - '\007'''
| Any single-character token except ASCII 007.
|-
| ''+('a'..'z')''
| One or more lower-case letters.
|-
| '' <nowiki>['+'|'-'] '1'..'9' *('0'..'9')</nowiki> ''
| Integer constants, optional sign, no leading zero.
|-
| ''->"*/"''
| Any string up to and including an occurrence of */ <br/>
(the closing symbol of a PL/I or C comment).
|-
| ''"\"" ->"\""''
| Eiffel strings.
|}
===Dealing with keywords===
Many languages to be analyzed have keywords - or, more generally, "reserved words". Eiffel, for example, has reserved words such as <code>class</code> and <code>Result</code>.
{{note|in Eiffel terminology reserved words include keywords; a keyword is a marker playing a purely syntactical role, such as <code>class</code>. Predefined entities and expressions such as <code>Result</code> and <code>Current</code>, which have an associated value, are considered reserved words but not keywords. The present discussion uses the term "keyword" although it can be applied to all reserved words. }}
In principle, keywords could be handled just as other token types. In Eiffel, for example, one might treat each reserved words as a token type with only one specimen; these token types would have names such as Class or Then and would be defined in the lexical grammar file:
''Class'' ''''c' 'l' 'a' 's' 's'''' <br/>
''Then'' ''''t' 'h' 'e' 'n'''' <br/>
...
This would be inconvenient. To simplify the task of language description, and also to improve the efficiency of the lexical analysis process, it is usually preferable to treat keywords as a separate category.
If you are using class SCANNING and hence a lexical grammar file, the list of keywords, if any, must appear at the end of the file, one per line, preceded by a line that simply reads
'''-- Keywords'''
For example the final part of the example Eiffel lexical grammar file appears as:
<code>
... Other token type definitions ...
Identifier ~('a'..'z') *(~('a'..'z') | '_' | ('0'..'9'))
-- Keywords
alias
all
and
as
BIT
BOOLEAN
... Other reserved words ...
</code>
{{caution|Every keyword in the keyword section must be a specimen of one of the token types defined for the grammar, and that token type must be the last one defined in the lexical grammar file, just before the '''Keywords''' line. So in Eiffel where the keywords have the same lexical structure as identifiers, the last line before the keywords must be the definition of the token type ''Identifier'', as shown above. }}
{{note|The rule that all keywords must be specimens of one token type is a matter of convenience and simplicity, and only applies if you are using SCANNING and lexical grammar files. There is no such restriction if you rely directly on the more general facilities provided by [[ref:libraries/lex/reference/metalex_chart|METALEX]] or [[ref:libraries/lex/reference/lex_builder_chart|LEX_BUILDER]] . Then different keywords may be specimens of different regular expressions; you will have to specify the token type of every keyword, as explained later in this chapter. }}
===Case sensitivity===
By default, letter case is not significant for regular expressions and keywords. So if ''yes'' matches a token type defined by a regular expression, or is a keyword, the input values ''Yes'', ''yEs'' and ''yES'' will all yield the same token or keyword. This also means that '' 'a'..'z' '' and '' 'a'..'z' | 'A'..'Z' '' describe the same set of tokens.
The regular expression syntax introduced above offers a special notation to specify that a particular expression is case-sensitive: ''~exp'', where ''exp'' is a regular expression. For example, ''~('A'..'Z')'' only covers single-upper-case-letter tokens. But for all other kinds of expression letter case is not taken into account.
You may change this default behavior through a set of procedures introduced in class [[ref:libraries/lex/reference/lex_builder_chart|LEX_BUILDER]] and hence available in its descendants [[ref:libraries/lex/reference/metalex_chart|METALEX ]] and [[ref:libraries/lex/reference/scanning_chart|SCANNING]] .
To make subsequent regular expressions case-sensitive, call the procedure
<code>
distinguish_case
</code>
To revert to the default mode where case is not significant, call the procedure
<code>
ignore_case
</code>
Each of these procedures remains in effect until the other one is called, so that you only need one call to define the desired behavior.
For keywords, the policy is less tolerant. A single rule is applied to the entire grammar: keywords are either all case-sensitive or all case-insensitive. To make all keywords case-sensitive, call
<code>
keywords_distinguish_case
</code>
The inverse call, corresponding to the default rule, is
<code>
keywords_ignore_case
</code>
Either of these calls must be executed before you define any keywords; if you are using [[ref:libraries/lex/reference/scanning_chart|SCANNING]] , this means before calling procedure build. Once set, the keyword case-sensitivity policy cannot be changed.
==USING METALEX TO BUILD A LEXICAL ANALYZER==
(You may skip the rest of this chapter if you only need simple lexical facilities.)
Class [[ref:libraries/lex/reference/scanning_chart|SCANNING]] , as studied above, relies on a class [[ref:libraries/lex/reference/metalex_chart|METALEX]] . In some cases, you may prefer to use the features of [[ref:libraries/lex/reference/metalex_chart|METALEX]] directly. Since [[ref:libraries/lex/reference/scanning_chart|SCANNING]] inherits from [[ref:libraries/lex/reference/metalex_chart|METALEX]] , anything you do with [[ref:libraries/lex/reference/metalex_chart|METALEX]] can in fact be done with [[ref:libraries/lex/reference/scanning_chart|SCANNING]] , but you may wish to stay with just [[ref:libraries/lex/reference/metalex_chart|METALEX]] if you do not need the additional features of [[ref:libraries/lex/reference/scanning_chart|SCANNING]] .
===Steps in using METALEX===
[[ref:libraries/lex/reference/metalex_chart|METALEX]] has an attribute analyzer which will be attached to a lexical analyzer. This class provides tools for building a lexical analyzer incrementally through explicit feature calls; you can still use a lexical grammar file, but do not have to.
The following extract from a typical descendant of [[ref:libraries/lex/reference/metalex_chart|METALEX]] illustrates the process of building a lexical analyzer in this way:
<code>
Upper_identifier, Lower_identifier, Decimal_constant, Octal_constant, Word: INTEGER is unique
...
distinguish_case
keywords_distinguish_case
put_expression("+('0'..'7'"), Octal_constant, "Octal")
put_expression ("'a'..'z' *('a'..'z'|'0'..'9'|'_')", Lower_identifier, "Lower")
put_expression ("'A'..'Z' *('A'..'Z'|'0'..'9'|'_' )", Upper_identifier, "Upper")
dollar_w (Word)
...
put_keyword ("begin", Lower_identifier)
put_keyword ("end", Lower_identifier)
put_keyword ("THROUGH", Upper_identifier)
...
make_analyzer
</code>
This example follows the general scheme of building a lexical analyzer with the features of [[ref:libraries/lex/reference/metalex_chart|METALEX]] , in a class that will normally be a descendant of [[ref:libraries/lex/reference/metalex_chart|METALEX]] :
# Set options, such as case sensitivity.
# Record regular expressions.
# Record keywords (this may be interleaved with step 2.)
# "Freeze" the analyzer by a call to <eiffel>make_analyzer</eiffel>.
To perform steps 2 to 4 in a single shot and generate a lexical analyzer from a lexical grammar file, as with [[ref:libraries/lex/reference/scanning_chart|SCANNING]] , you may use the procedure
<code>
read_grammar (grammar_file_name: STRING)
</code>
In this case all the expressions and keywords are taken from the file of name <code>grammar_file_name</code> rather than passed explicitly as arguments to the procedures of the class. You do not need to call <eiffel>make_analyzer</eiffel>, since <eiffel>read_grammar</eiffel> includes such a call.
The rest of this discussion assumes that the four steps are executed individually as shown above, rather than as a whole using <eiffel>read_grammar</eiffel>.
===Recording token types and regular expressions===
As shown by the example, each token type, defined by a regular expression, must be assigned an integer code. Here the developer has chosen to use Unique constant values so as not to worry about selecting values for these codes manually, but you may select any values that are convenient or mnemonic. The values have no effect other than enabling you to keep track of the various lexical categories. Rather than using literal values directly, it is preferable to rely on symbolic constants, Unique or not, which will be more mnemonic.
Procedure <eiffel>put_expression</eiffel> records a regular expression. The first argument is the expression itself, given as a string built according to the rules seen earlier in this chapter. The second argument is the integer code for the expression. The third argument is a string which gives a name identifying the expression. This is useful mostly for debugging purposes; there is also a procedure <eiffel>put_nameless_expression</eiffel> which does not have this argument and is otherwise identical to <eiffel>put_expression</eiffel>.
Procedure <eiffel>dollar_w</eiffel> corresponds to the '''$W''' syntax for regular expressions. Here an equivalent call would have been
<code>
put_nameless_expression ( "$W" ,Word )
</code>
Procedure <eiffel>declare_keyword</eiffel> records a keyword. The first argument is a string containing the keyword; the second argument is the regular expression of which the keyword must be a specimen. The example shows that here - in contrast with the rule enforced by [[ref:libraries/lex/reference/scanning_chart|SCANNING]] - not all keywords need be specimens of the same regular expression.
The calls seen so far record a number of regular expressions and keywords, but do not give us a lexical analyzer yet. To obtain a usable lexical analyzer, you must call
<code>
make_analyzer
</code>
After that call, you may not record any new regular expression or keyword. The analyzer is usable through attribute <eiffel>analyzer</eiffel>.
{{note|for readers knowledgeable in the theory of lexical analysis: one of the most important effects of the call to <eiffel>make_analyzer</eiffel> is to transform the non-deterministic finite automaton resulting from calls such as the ones above into a deterministic finite automaton. }}
Remember that if you use procedure <eiffel>read_grammar</eiffel>, you need not worry about <eiffel>make_analyzer</eiffel>, as the former procedure calls the latter.
Another important feature of class [[ref:libraries/lex/reference/metalex_chart|METALEX]] is procedure <eiffel>store_analyzer</eiffel>, which stores the analyzer into a file whose name is passed as argument, for use by later lexical analysis sessions. To retrieve the analyzer, simply use procedure <eiffel>retrieve_analyzer</eiffel>, again with a file name as argument.
==BUILDING A LEXICAL ANALYZER WITH LEX_BUILDER==
To have access to the most general set of lexical analysis mechanisms, you may use class [[ref:libraries/lex/reference/lex_builder_chart|LEX_BUILDER]] , which gives you an even finer grain of control than [[ref:libraries/lex/reference/metalex_chart|METALEX]] . This is not necessary in simple applications.
===Building a lexical analyzer===
[[ref:libraries/lex/reference/lex_builder_chart|LEX_BUILDER]] enables you to build a lexical analyzer by describing successive token types and keywords. This is normally done in a descendant of [[ref:libraries/lex/reference/lex_builder_chart|LEX_BUILDER]] . For each token type, you call a procedure that builds an object, or '''tool''', representing the associated regular expression.
For the complete list of available procedures, refer to the flat-short form of the class; there is one procedure for every category of regular expression studied earlier in this chapter. Two typical examples of calls are:
<code>
interval ('a', 'z')
-- Create an interval tool
union (Letter, Underlined)
-- Create a union tool
</code>
Every such procedure call also assigns an integer index to the tool it creates; this number is available through the attribute <eiffel>last_created_tool</eiffel>. You will need to record it into an integer entity, for example <eiffel>Identifier</eiffel> or <eiffel>Letter</eiffel>.
===An example===
The following extract from a typical descendant of [[ref:libraries/lex/reference/lex_builder_chart|LEX_BUILDER]] illustrates how to create a tool representing the identifiers of an Eiffel-like language.
<code>
Identifier, Letter, Digit, Underlined, Suffix, Suffix_list: INTEGER
build_identifier
do
interval ('a', 'z')
Letter := last_created_tool
interval ('0', '9')
Digit := last_created_tool
interval ('_', '_')
Underlined := last_created_tool
union (Digit, Underlined)
Suffix := last_created_tool
iteration (Suffix)
Suffix_list := last_created_tool
append (Letter, Suffix_list)
Identifier := last_created_tool
end
</code>
Each token type is characterized by a number in the tool_list. Each tool has a name, recorded in <eiffel>tool_names</eiffel>, which gives a readable form of the corresponding regular expression. You can use it to check that you are building the right tool.
===Selecting visible tools===
In the preceding example, only some of the tools, such as <eiffel>Identifier</eiffel>, are of interest to the clients. Others, such as <eiffel>Suffix</eiffel> and <eiffel>Suffix_list</eiffel>, only play an auxiliary role.
When you create a tool, it is by default invisible to clients. To make it visible, use procedure <eiffel>select_tool</eiffel>. Clients will need a number identifying it; to set this number, use procedure <eiffel>associate</eiffel>. For example the above extract may be followed by:
<code>
select_tool (Identifier)
associate (Identifier, 34)
put_keyword ("class", Identifier)
put_keyword ("end", Identifier)
put_keyword ("feature", Identifier)
</code>
If the analysis encounters a token that belongs to two or more different selected regular expressions, the one entered last takes over. Others are recorded in the array <eiffel>other_possible_tokens</eiffel>.
If you do not explicitly give an integer value to a regular expression, its default value is its rank in <eiffel>tool_list</eiffel>.

View File

@@ -0,0 +1,11 @@
[[Property:title|EiffelLex]]
[[Property:weight|1]]
[[Property:uuid|52e88d58-1a02-d4e2-5503-e405253e7656]]
==EiffelLex Library==
Type: Library <br/>
Platform: Any <br/>
Eiffel classes to facilitate lexical analysis.

View File

@@ -0,0 +1,5 @@
[[Property:title|EiffelParse Class Reference]]
[[Property:weight|1]]
[[Property:uuid|6b37fcc9-198c-a846-2ff2-32fc30c0d029]]
==View the [[ref:libraries/parse/reference/index|EiffelParse Class Reference]]==

View File

@@ -0,0 +1,784 @@
[[Property:title|EiffelParse Tutorial]]
[[Property:weight|0]]
[[Property:uuid|b5861080-e5fd-dbb8-bf89-452915b3483e]]
==OVERVIEW==
Parsing is the task of analyzing the structure of documents such as programs, specifications or other structured texts.
Many systems need to parse documents. The best-known examples are compilers, interpreters and other software development tools; but as soon as a system provides its users with a command language, or processes input data with a non-trivial structure, it will need parsing facilities.
This chapter describes the EiffelParse library, which you can use to process documents of many different types. It provides a simple and flexible parsing scheme, resulting from the full application of object-oriented principles.
Because it concentrates on the higher-level structure, the EiffelParse library requires auxiliary mechanisms for identifying a document's lexical components: words, numbers and other such elementary units. To address this need it is recommended, although not required, to complement EiffelParse with the EiffelLex library studied in the previous chapter.
Figure 1 shows the inheritance structure of the classes discussed in this chapter.
[[Image:figure1]]
Figure 1: Parse class structure
==WHY USE THE EIFFELPARSE LIBRARY==
Let us fist look at the circumstances under which you may want - or not want - to use the EiffelParse library.
===The EiffelParse library vs. parser generators===
Parsing is a heavily researched area of computing science and many tools are available to generate parsers. In particular, the popular Yacc tool, originally developed for Unix, is widely used to produce parsers.
In some cases Yacc or similar tools are perfectly adequate. It is also sometimes desirable to write a special-purpose parser for a language, not relying on any parser generator. Several circumstances may, however, make the Parse library attractive:
* The need to interface the parsing tasks with the rest of an object-oriented system (such as a compiler or more generally a "document processor" as defined below) in the simplest and most convenient way.
* The desire to apply object-oriented principles as fully as possible to all aspects of a system, including parsing, so as to gain the method's many benefits, in particular reliability, reusability and extendibility.
* The need to tackle languages whose structure is not easily reconciled with the demands of common parser generator, which usually require the grammar to be LALR (1). (The EiffelParse library uses a more tolerant LL scheme, whose only significant constraint is absence of left-recursivity; the library provides mechanisms to detect this condition, which is easy to correct.)
* The need to define several possible semantic treatments on the same syntactic structure.
The last reason may be the most significant practical argument in favor of using EiffelParse. Particularly relevant is the frequent case of a software development environment in which a variety of tools all work on the same basic syntactic structure. For example an environment supporting a programming language such as Pascal or Eiffel may include a compiler, an interpreter, a pretty-printer, software documentation tools (such as Eiffel's short and flat-short facilities), browsing tools and several other mechanisms that all need to perform semantic actions on software texts that have the same syntactic structure. With common parser generators such as Yacc, the descriptions of syntactic structure and semantic processing are inextricably mixed, so that you normally need one new specification for each tool. This makes design, evolution and reuse of specifications difficult and error-prone.
In contrast, the EiffelParse library promotes a specification style whereby syntax and semantics are kept separate, and uses inheritance to allow many different semantic descriptions to rely on the same syntactic stem. This will make EiffelParse particularly appropriate in such cases.
===A word of caution===
At the time of publication the EiffelParse library has not reached the same degree of maturity as the other libraries presented in this book. It should thus be used with some care. You will find at the end of this chapter a few comments about the work needed to bring the library to its full realization.
==AIMS AND SCOPE OF THE EIFFELPARSE LIBRARY==
To understand the EiffelParse library it is necessary to appreciate the role of parsing and its place in the more general task of processing documents of various kinds.
===Basic terminology===
First, some elementary conventions. The word '''document''' will denote the texts to be parsed. The software systems which perform parsing as part of their processing will be called '''document processors'''.
Typical document processors are compilers, interpreters, program checkers, specification analyzers and documentation tools. For example the EiffelStudio environment contains a number of document processors, used for compiling, documentation and browsing.
===Parsing, grammars and semantics===
Parsing is seldom an end in itself; rather, it serves as an intermediate step for document processors which perform various other actions.
Parsing takes care of one of the basic tasks of a document processor: reconstructing the logical organization of a document, which must conform to a certain '''syntax''' (or structure), defined by a '''grammar'''.
{{note|The more complete name '''syntactic grammar''' avoids any confusion with the ''lexical'' grammars discussed in the [[EiffelLex Tutorial]]. By default, "grammar" with no further qualification will always denote a syntactic grammar. A syntactic grammar normally relies on a lexical grammar, which gives the form of the most elementary components - the tokens - appearing in the syntactic structure. }}
Once parsing has reconstructed the structure of a document, the document processor will perform various operations on the basis of that structure. For example a compiler will generate target code corresponding to the original text; a command language interpreter will execute the operations requested in the commands; and a documentation tool such as the short and flat-short commands for Eiffel will produce some information on the parsed document. Such operations are called '''semantic actions'''. One of the principal requirements on a good parsing mechanism is that it should make it easy to graft semantics onto syntax, by adding semantic actions of many possible kinds to the grammar.
The EiffelParse library provides predefined classes which handle the parsing aspect automatically and provide the hooks for adding semantic actions in a straightforward way. This enables developers to write full document processors - handling both syntax and semantics - simply and efficiently.
As noted at the beginning of this chapter, it is possible to build a single syntactic base and use it for several processors (such as a compiler and a documentation tool) with semantically different goals, such as compilation and documentation. In the EiffelParse library the semantic hooks take the form of deferred routines, or of routines with default implementations which you may redefine in descendants.
==LIBRARY CLASSES==
The EiffelParse library contains a small number of classes which cover common document processing applications. The classes, whose inheritance structure was shown at the beginning of this chapter, are:
* [[ref:libraries/parse/reference/construct_chart|CONSTRUCT]] , describing the general notion of syntactical construct.
* [[ref:libraries/parse/reference/aggregate_chart|AGGREGATE]] , describing constructs of the "aggregate" form.
* [[ref:libraries/parse/reference/choice_chart|CHOICE]] , describing constructs of the "choice" form.
* [[ref:libraries/parse/reference/repetition_chart|REPETITION]] , describing constructs of the "repetition" form.
* [[ref:libraries/parse/reference/terminal_chart|TERMINAL]] , describing "terminal" constructs with no further structure.
* [[ref:libraries/parse/reference/keyword_chart|KEYWORD]] , describing how to handle keywords.
* [[ref:libraries/parse/reference/l_interface_chart|L_INTERFACE]] , providing a simple interface with the lexical analysis process and the Lex library.
* [[ref:libraries/parse/reference/input_chart|INPUT]] , describing how to handle the input document.
==EXAMPLES==
The EiffelStudio delivery includes (in the examples/library/parse subdirectory) a simple example using the EiffelParse Library classes. The example is a processor for "documents" which describe computations involving polynomials with variables. The corresponding processor is a system which obtains polynomial specifications and variable values from a user, and computes the corresponding polynomials.
This example illustrates the most important mechanisms of the EiffelParse Library and provides a guide for using the facilities described in this chapter. The components of its grammar appear as illustrations in the next sections.
==CONSTRUCTS AND PRODUCTIONS==
A set of documents possessing common properties, such as the set of all valid Eiffel classes or the set of all valid polynomial descriptions, is called a '''language'''.
In addition to its lexical aspects, the description of a language includes both syntactic and semantic properties. The grammar - the syntactic specification - describes the structure of the language (for example how an Eiffel class is organized into a number of clauses); the semantic specification defines the meaning of documents written in the language (for example the run-time properties of instances of the class, and the effect of feature calls).
To discuss the EiffelParse library, it is simpler to consider "language' as a purely syntactic notion; in other words, a language is simply the set of documents conforming to a certain syntactic grammar (taken here to include the supporting lexical grammar). Any semantic aspect will be considered to belong to the province of a specific document processor for the language, although the technique used for specifying the grammar will make it easy to add the specification of the semantics, or several alternative semantic specifications if desired.
This section explains how you may define the syntactic base - the grammar.
===Constructs===
A grammar consists of a number of '''constructs''', each representing the structure of documents, or document components, called the '''specimens''' of the construct. For example, a grammar for Eiffel will contain constructs such as Class, Feature_clause and Instruction. A particular class text is a specimen of construct Class.
Each construct will be defined by a '''production''', which gives the structure of the construct's specimens. For example, a production for Class in an Eiffel grammar should express that a class (a specimen of the Class construct) is made of an optional Indexing part, a Class_header, an optional Formal_generics part and so on. The production for Indexing will indicate that any specimen of this construct - any Indexing part - consists of the keyword '''indexing''' followed by zero or more specimens of Index_clause.
Although some notations for syntax descriptions such as BNF allow more than one production per construct, the EiffelParse library relies on the convention that every construct is defined by '''at most one''' production. Depending on whether there is indeed such a production, the construct is either '''non-terminal''' or '''terminal''':
* A non-terminal construct (so called because it is defined in terms of others) is specified by a production, which may be of one of three types: aggregate, choice and repetition. The construct will accordingly be called an aggregate, choice or repetition construct.
* A terminal construct has no defining production. This means that it must be defined outside of the syntactical grammar. Terminals indeed come from the '''lexical grammar'''. Every terminal construct corresponds to a token type (regular expression or keyword) of the lexical grammar, for which the parsing duty will be delegated to lexical mechanisms, assumed in the rest of this chapter to be provided by the Lex library although others may be substituted if appropriate.
All specimens of terminal constructs are instances of class [[ref:libraries/parse/reference/terminal_chart|TERMINAL]] . A special case is that of keyword constructs, which have a single specimen corresponding to a keyword of the language. For example, <code>if</code> is a keyword of Eiffel. Keywords are described by class [[ref:libraries/parse/reference/keyword_chart|KEYWORD]] , an heir of [[ref:libraries/parse/reference/terminal_chart|TERMINAL]] .
The rest of this section concentrates on the parsing-specific part: non-terminal constructs and productions. Terminals will be studied in the discussion of how to interface parsing with lexical analysis.
===Varieties of non-terminal constructs and productions===
An aggregate production defines a construct whose specimens are obtained by concatenating ("aggregating") specimens of a list of specified constructs, some of which may be optional. For example, the production for construct Conditional in an Eiffel grammar may read:
<code>
Conditional [=] if Then_part_list [Else_part] end
</code>
This means that a specimen of Conditional (a conditional instruction) is made of the keyword <code>if</code>, followed by a specimen of Then_part_list, followed by zero or one specimen of Else_part (the square brackets represent an optional component), followed by the keyword <code>end</code>.
{{note|This notation for productions uses conventions similar to those of the book ''Eiffel: The Language''. Keywords are written in '''boldface italics''' and stand for themselves. Special symbols, such as the semicolon, are written in double quotes, as in ";". The <nowiki>[=]</nowiki> symbol means "is defined as" and is more accurate mathematically than plain <nowiki>=</nowiki>, which, however, is often used for this purpose (see "Introduction to the Theory of Programming Languages", Prentice Hall, 1991, for a more complete discussion of this issue). }}
A choice production defines a construct whose specimens are specimens of one among a number of specified constructs. For example, the production for construct Type in an Eiffel grammar may read:
<code>
Type [=] Class_type | Class_type_expanded | Formal_generic_name | Anchored | Bit_type
</code>
This means that a specimen of Type is either a specimen of Class_type, or a specimen of Class_type_expanded etc.
Finally, a repetition production defines a construct whose specimens are sequences of zero or more specimens of a given construct (called the '''base''' of the repetition construct), separated from each other by a '''separator'''. For example, the production for construct Compound in an Eiffel grammar may read
<code>
Compound [=] {Instruction ";" ...}
</code>
This means that a specimen of Compound is made of zero or more specimens of Instruction, each separated from the next (if any) by a semicolon.
These three mechanisms - aggregate, choice and repetition - suffice to describe the structure of a wide array of practical languages. Properties which cannot be handled by them should be dealt with through '''semantic actions''', as explained below.
===An example grammar===
The example directory included in the delivery implements a processor for a grammar describing a simple language for expressing polynomials. A typical "document" in this language is the line
x; y: x * (y + 8 - (2 * x))
The beginning of the line, separated from the rest by a colon, is the list of variables used in the polynomial, separated by semicolons. The rest of the line is the expression defining the polynomial.
Using the conventions defined above, the grammar may be written as:
<code>
Line [=] Variables ":" Sum
Variables [=] {Identifier ";" ...}
Sum [=] {Diff "+" ...}
Diff [=] {Product "-" ...}
Product [=] {Term " * " ...}
Term [=] Simple_var Int_constant Nested
Nested [=] "(" Sum ")"
</code>
This grammar assumes a terminal Identifier, which must be defined as a token type in the lexical grammar. The other terminals are keywords, shown as strings appearing in double quotes, for example "+".
==PARSING CONCEPTS==
The EiffelParse library supports a parsing mechanism based on the concepts of object-oriented software construction.
===Class CONSTRUCT===
The deferred class [[ref:libraries/parse/reference/construct_chart|CONSTRUCT]] describes the general notion of construct; instances of this class and its descendants are specimens of the constructs of a grammar.
Deferred though it may be, [[ref:libraries/parse/reference/construct_chart|CONSTRUCT]] defines some useful general patterns; for example, its procedure process appears as: <br/>
<code>
parse
if parsed then
semantics
end
</code>
<br/>
where procedures <eiffel>parse</eiffel> and <eiffel>semantics</eiffel> are expressed in terms of some more specific procedures, which are deferred. This defines a general scheme while leaving the details to descendants of the class.
Such descendants, given in the library, are classes [[ref:libraries/parse/reference/aggregate_chart|AGGREGATE]] , [[ref:libraries/parse/reference/choice_chart|CHOICE]] , [[ref:libraries/parse/reference/repetition_chart|REPETITION]] and [[ref:libraries/parse/reference/terminal_chart|TERMINAL]] . They describe the corresponding types of construct, with features providing the operations for parsing their specimens and applying the associated semantic actions.
===Building a processor===
To build a processor for a given grammar, you write a class, called a '''construct class''', for every construct of the grammar, terminal or non-terminal. The class should inherit from [[ref:libraries/parse/reference/aggregate_chart|AGGREGATE]] , [[ref:libraries/parse/reference/choice_chart|CHOICE]] , [[ref:libraries/parse/reference/repetition_chart|REPETITION]] or [[ref:libraries/parse/reference/terminal_chart|TERMINAL]] depending on the nature of the construct. It describes the production for the construct and any associated semantic actions.
To complete the processor, you must choose a "top construct" for that particular processor, and write a root class. In accordance with the object-oriented method, which implies that "roots" and "tops" should be chosen last, these steps are explained at the end of this chapter.
The next sections explain how to write construct classes, how to handle semantics, and how to interface parsing with the lexical analysis process. All these tasks rely on a fundamental data abstraction, the notion of '''abstract syntax tree'''.
===Abstract syntax trees===
The effect of processing a document with a processor built from a combination of construct classes is to build an abstract syntax tree for that document, and to apply any requested semantic actions to that tree.
The syntax tree is said to be abstract because it only includes important structural information and does not retain the concrete information such as keywords and separators. Such concrete information, sometimes called "syntactic sugar", serves only external purposes but is of no use for semantic processing.
The combination of Eiffel techniques and libraries yields a very simple approach to building and processing abstract syntax trees. Class [[ref:libraries/parse/reference/construct_chart|CONSTRUCT]] is a descendant of the Data Structure Library class [[ref:libraries/base/reference/two_way_tree_chart|TWO_WAY_TREE]] , describing a versatile implementation of trees; so, as a consequence, are [[ref:libraries/parse/reference/construct_chart|CONSTRUCT's]] own descendants. The effect of parsing any specimen of a construct is therefore to create an instance of the corresponding construct class. This instance is (among other things) a tree node, and is automatically inserted at its right place in the abstract syntax tree.
As noted in the discussion of trees, class [[ref:libraries/base/reference/two_way_tree_chart|TWO_WAY_TREE]] makes no formal distinction between the notions of tree and tree node. So you may identify the abstract syntax tree with the object (instance of [[ref:libraries/parse/reference/construct_chart|CONSTRUCT]] ) representing the topmost construct specimen in the structure of the document being analyzed.
===The production function===
A construct class describes the syntax of a given construct through a function called production, which is a direct representation of the corresponding production. This function is declared in CONSTRUCT as
<code>
production: LINKED_LIST [CONSTRUCT]
-- Right-hand side of the production for the construct
deferred
end
</code>
Function production remains deferred in classes [[ref:libraries/parse/reference/aggregate_chart|AGGREGATE]] , [[ref:libraries/parse/reference/choice_chart|CHOICE]] and [[ref:libraries/parse/reference/repetition_chart|REPETITION]] . Every effective construct class that you write must provide an effecting of that function. It is important for the efficiency of the parsing process that every effective version of production be a <eiffel>once</eiffel> function. Several examples of such effectings are given below.
Classes [[ref:libraries/parse/reference/aggregate_chart|AGGREGATE]] , [[ref:libraries/parse/reference/choice_chart|CHOICE]] , [[ref:libraries/parse/reference/repetition_chart|REPETITION]] and [[ref:libraries/parse/reference/terminal_chart|TERMINAL]] also have a deferred function <eiffel>construct_name</eiffel> of type STRING, useful for tracing and debugging. This function should be effected in every construct class to return the string name of the construct, such as "INSTRUCTION" or "CLASS" for construct classes in a grammar of Eiffel. For efficiency reasons, the <eiffel>construct_name</eiffel> function should also be a <eiffel>once</eiffel> function. The form of such a function will always be the same, as illustrated by the following example which may appear in the construct class <eiffel>INSTRUCTION</eiffel> in a processor for Eiffel:
<code>
construct_name: STRING
-- Symbolic name of the construct
once
Result := "INSTRUCTION"
end
</code>
The examples of the next few sections, which explain how to write construct classes, are borrowed from the small "Polynomial" language mentioned above, which may be found in the examples directory in the ISE Eiffel delivery.
==PREPARING GRAMMARS==
Having studied the EiffelParse library principles, let us see how to write grammar productions for various kinds of construct. The main task is to write the production function for each construct class.
The production function for a descendant of [[ref:libraries/parse/reference/aggregate_chart|AGGREGATE]] will describe how to build a specimen of the corresponding function from a sequence of specimens of each of the constituent constructs. Writing this function from the corresponding production is straightforward.
As an example, consider the production function of class LINE for the Polynomial example language. The corresponding production is <br/>
<code>
Line [=] Variables ":" Sum
</code>
<br/>
where Variables and Sum are other constructs, and the colon ":" is a terminal. This means that every specimen of Line consists of a specimen of Variables, followed by a colon, followed by a specimen of Sum.
Here is the corresponding production function as it appears in class LINE:
<code>
production: LINKED_LIST [CONSTRUCT]
local
var: VARIABLES
sum: SUM
once
create Result.make
Result.forth
create var.make
put (var)
keyword (":")
create sum.make
put (sum)
end
</code>
As shown by this example, the production function for an aggregate construct class should declare a local entity (here <code>var</code> and <code>sum</code>) for each non-keyword component of the right-hand side, the type of each entity being the corresponding construct class (here VARIABLES and SUM).
The body of the function should begin with
<code>
create Result.make
Result.forth
</code>
to create the object containing the result. Then for each non-keyword component, represented by the local entity <code>component</code> (this applies to <code>var</code> and <code>sum</code> in the example), there should be a sequence of two instructions, of the form
<code>
create component.make
put (component)
</code>
For any keyword of associated string ''symbol'', such as the colon ":" in the example, there should be a call to
<code>
keyword (symbol)
</code>
The order of the various calls to <eiffel>put</eiffel> (for non-keywords) and <eiffel>keyword</eiffel> (for keywords) must be the order of the components in the production. Also, every <eiffel>create component.make</eiffel> instruction must occur before the corresponding call to <eiffel>put (symbol)</eiffel>.
All components in the above example are required. In the general case an aggregate production may have optional components. To signal that a component component of the right-hand side is optional, include a call of the form
<code>
component.set_optional
</code>
This call may appear anywhere after the corresponding <eiffel>create component</eiffel> instruction. The recommended place is just after the call to <eiffel>put</eiffel>, as in
<code>
create component
put (symbol)
component.set_optional
</code>
===Choices===
The <eiffel>production</eiffel> function for a descendant of <eiffel>CHOICE</eiffel> will describe how to build a specimen of the corresponding function as a specimen of one of the alternative constructs.
As an example, consider the <eiffel>production</eiffel> function of class <eiffel>TERM</eiffel> for the Polynomial example language. The corresponding production is
<code>
Term [=] Simple_var Poly_integer Nested
</code>
<br/>
where Simple_var, Poly_integer and Nested are other constructs. This means that every specimen of Term consists of one specimen of any one of these three constructs. Here is the corresponding <eiffel>production</eiffel> function as it appears in class <eiffel>TERM</eiffel>:
<code>
production: LINKED_LIST [CONSTRUCT]
local
id: SIMPLE_VAR
val: POLY_INTEGER
nest: NESTED
once
create Result.make
Result.forth
createid.make
put (id)
create val.make
put (val)
create nest.make
put (nest)
end
</code>
As shown by this example, the <eiffel>production</eiffel> function for a choice construct class must declare a local entity - here <code>id</code>, <code>val</code> and <code>nest</code> - for each alternative component of the right-hand side. The type of each entity is the corresponding construct class - here <eiffel>SIMPLE_VAR</eiffel>, <eiffel>POLY_INTEGER</eiffel> and <eiffel>NESTED</eiffel>.
The body of the function must begin by
<code>
create Result.make
Result.forth
</code>
Then for each alternative component represented by a local entity component (in the example this applies to <code>id</code>, <code>val</code> and <code>nest</code>) there should be two instructions of the form
<code>
create component.make
put (component)
</code>
{{caution|The order of the various calls to <eiffel>put</eiffel> is irrelevant in principle. When a document is parsed, however, the choices will be tried in the order given; so if you know that certain choices occur more frequently than others it is preferable to list them first to speed up the parsing process. }}
===Repetitions===
The <eiffel>production</eiffel> function for a descendant of [[ref:libraries/parse/reference/repetition_chart|REPETITION]] will describe how to build a specimen of the corresponding function as a sequence or zero or more (or, depending on the grammar, one or more) specimens of the base construct. The class must also effect a feature <eiffel>separator</eiffel> of type <eiffel>STRING</eiffel>, usually as a constant attribute. (This feature is introduced as deferred in class [[ref:libraries/parse/reference/repetition_chart|REPETITION]] .)
As an example, consider the construct Variables in the Polynomial example language. The right-hand side of the corresponding production is <br/>
<code>
Variables [=] {Identifier ";" ...}
</code>
<br/>
where Identifier is another construct, and the semicolon ";" is a terminal. This means that every specimen of Variables consists of zero or more specimens of Identifier, separated from each other (if more than one) by semicolons.
Here are the corresponding <eiffel>production</eiffel> function and <eiffel>separator</eiffel> attribute as they appear in class <eiffel>VARIABLES</eiffel>:
<code>
production: LINKED_LIST [IDENTIFIER]
local
base: VAR
once
create Result.make
Result.forth
create base.make
put (base)
end
separator: STRING = ";"
</code>
As shown by this example, function <eiffel>production</eiffel> is built along the same ideas as for aggregates and choices, except that here only one component, <code>base</code>, is required; its type must be the class corresponding to the construct serving as the base of the repetition, VAR in the example.
==INTERFACE TO LEXICAL ANALYSIS==
One more type of construct class remains to be discussed: terminal construct classes. Since terminal constructs serve to elevate lexical tokens (regular expressions and keywords) to the dignity of syntactical construct, we must first take a look at how the EiffelParse library classes collaborate with their counterparts in the EiffelLex library.
===The notion of lexical interface class===
To parse a document, you need to get tokens from a lexical analyzer. This is achieved by making some construct classes, in particular those describing terminals, descendants of one of the lexical classes.
The best technique is usually to write a class covering the lexical needs of the language at hand, from which all construct classes that have some lexical business will inherit. Such a class is called a lexical interface class.
Lexical interface classes usually follow a common pattern. To take advantage of this uniformity, the EiffelParse library includes a deferred class L_INTERFACE which describes that pattern. Specific lexical interface classes may be written as descendants of L_INTERFACE.
L_INTERFACE is a simple deferred class, with a deferred procedure <eiffel>obtain_analyzer</eiffel>. It is an heir of METALEX.
===Obtaining a lexical analyzer===
An effective descendant of [[ref:libraries/parse/reference/l_interface_chart|L_INTERFACE]] must define procedure <eiffel>obtain_analyzer</eiffel> so that it records into the lexical analyzer the regular expressions and keywords of the language at hand. In writing <eiffel>obtain_analyzer</eiffel> you may use any one of three different techniques, each of which may be the most convenient depending on the precise context, to obtain the required lexical analyzer:
* You may build the lexical analyzer by defining its regular expressions one by one, using the procedures described in the presentation of METALEX, in particular <eiffel>put_expression</eiffel> and <eiffel>put_keyword</eiffel>.
* You may use use procedure <eiffel>retrieve_analyzer</eiffel> from METALEX to retrieve an analyzer which a previous session saved into a file.
* Finally, you may write a lexical grammar file (or reuse an existing one) and process it on the spot by using procedure <eiffel>read_grammar</eiffel> from METALEX.
===A lexical interface class===
An example of a lexical interface class is POLY_LEX for the Polynomial example language. Here is the complete text of that class:
<code>
indexing
description: "Lexical interface class for the Polynomial language"
class
POLY_LEX
inherit
L_INTERFACE
CONSTANTS
undefine
consistent,
copy,
is_equal,
setup
end
feature {NONE}
obtain_analyzer
-- Create lexical analyzer for the Polynomial language
do
ignore_case
keywords_ignore_case
build_expressions
build_keywords
set_separator_type (blanks)
end
build_expressions
-- Define regular expressions
-- for the Polynomial language
do
put_expression (special_expression, special, "Special")
put_expression ("*('a'..'z')", simple_identifier, "Simple_identifier")
put_expression ("+('0'..'9')", integer_constant, "Integer_constant")
put_expression ("+('\t'|'\n'|' ')", blanks, "Blanks")
end
special_expression: STRING
-- Regular expression describing Special
once
create Result.make (80)
Result.append ("('\050'..'\057')|")
Result.append ("('\072'..'\076')|")
Result.append ("'['|']'|'|'|'{'|'}'|%"->%"|%":=%"")
end
build_keywords
-- Define keywords (special symbols)
-- for the Polynomial language
do
put_keyword ("+", special)
put_keyword ("-", special)
put_keyword (";", special)
put_keyword (":", special)
put_keyword ("(", special)
put_keyword (")", special)
put_keyword ("*", special)
end
end
</code>
This class illustrates the straightforward scheme for writing lexical interface classes. It introduces constants such as Special to represent the regular expressions supported, and effects procedure <eiffel>obtain_analyzer</eiffel>. The role of this procedure is to define lexical conventions (here through calls to <eiffel>ignore_case</eiffel> and <eiffel>keywords_ignore_case</eiffel>), to record the regular expressions (through calls to <eiffel>put_expression</eiffel>, packaged in a procedure <eiffel>build_expressions</eiffel> for clarity), and records the keywords (through calls to <eiffel>put_keyword</eiffel>, packaged in <eiffel>build_keywords</eiffel>).
All the classes of a document processor that need to interact with the lexical analysis should inherit from a lexical interface class such as <eiffel>POLY_LEX</eiffel>. This is true in particular of the root class of a processor, as discussed below.
===More on terminal constructs===
Terminal construct classes are examples of classes that need to interact with the lexical analysis, and should thus inherit from the lexical interface class.
Class <eiffel>TERMINAL</eiffel> includes a deferred function <eiffel>token_type</eiffel> of type <eiffel>INTEGER</eiffel>. Every effective descendant of <eiffel>TERMINAL</eiffel> should effect this feature as a constant attribute, whose value is the code for the associated regular expression, obtained from the lexical interface class. As every other construct class, such a descendant should also effect <eiffel>construct_name</eiffel> as a <eiffel>once</eiffel> function. For example, in the Polynomial language, class <eiffel>INT_CONSTANT</eiffel> has the following text:
<code>
class
INT_CONSTANT
inherit
TERMINAL
CONSTANTS
feature
token_type: INTEGER
do
Result := integer_constant
end
feature {NONE}
construct_name: STRING
once
Result := "INT_CONSTANT"
end
end
</code>
==SPECIFYING THE SEMANTICS==
As mentioned at the beginning of this chapter, parsing is usually done not for itself but as a way to perform some semantic processing. The EiffelParse Library classes define the general framework for grafting such semantics onto a syntactical stem.
===Semantic procedures===
The principal procedures for defining semantic actions are <eiffel>pre_action</eiffel> and <eiffel>post_action</eiffel>. These are features of class CONSTRUCT. Procedure <eiffel>pre_action</eiffel> describes the actions to be performed before a construct has been recognized; <eiffel>post_action</eiffel>, the actions to be performed after a construct has been recognized.
As defined in <eiffel>CONSTRUCT</eiffel>, both <eiffel>pre_action</eiffel> and post_action do nothing by default. Any construct class which is a descendant of <eiffel>CONSTRUCT</eiffel> may redefine one or both so that they will perform the semantic actions that the document processor must apply to specimens of the corresponding construct. These procedures are called automatically during processing, before and after the corresponding structures have been parsed.
For <eiffel>TERMINAL</eiffel>, only one semantic action makes sense. To avoid any confusion, <eiffel>post_action</eiffel> is renamed <eiffel>action</eiffel> in that class and <eiffel>pre_action</eiffel> is renamed <eiffel>unused_pre_action</eiffel> to indicate that it is irrelevant.
Often, the semantic procedures need to compute various elements of information. These may be recorded using appropriate attributes of the corresponding construct classes.
{{note|Readers familiar with the theory of parsing and compiling will see that this scheme, using the attributes of Eiffel classes, provides a direct implementation of the "attribute grammar" mechanism. }}
===Polynomial semantics===
As an example let us examine the semantics of the Product construct for the polynomial language. It is a repetition construct, with Term as the base construct; in other words a specimen of Product is a sequence of one or more terms, representing the product term<sub>1</sub> * term<sub>2</sub> ... * term<sub>n</sub>. Here is the <eiffel>post_action</eiffel> procedure in the corresponding class <eiffel>PRODUCT</eiffel>:
<code>
post_action
local
int_value: INTEGER
do
if not no_components then
from
child_start
if not child_after then
int_value := 1
end
until
child_after
loop
child.post_action
nt_value := int_value * info.child_value
child_forth
end
info.set_child_value (int_value)
end
end
</code>
Here each relevant construct class has an attribute <eiffel>info</eiffel> used to record the semantic information associated with polynomials and their components, such as <eiffel>child_value</eiffel>, an <eiffel>INTEGER</eiffel>. The <eiffel>post_action</eiffel> takes care of computing the product of all <eiffel>child_value</eiffel>s for the children. First, of course, <eiffel>post_action</eiffel> must recursively be applied to each child, to compute its own <eiffel>child_value</eiffel>.
{{note|Recall that an instance of <eiffel>CONSTRUCT</eiffel> is also a node of the abstract syntax tree, so that all the <eiffel>TWO_WAY_TREE</eiffel> features such as <eiffel>child_value</eiffel>, <eiffel>child_start</eiffel>, <eiffel>child_after</eiffel> and many others are automatically available to access the syntactical structure. }}
===Keeping syntax and semantics separate===
For simple examples such as the Polynomial language, it is convenient to use a single class to describe both the syntax of a construct (through the production function and associated features) and its semantics (action routines and associated features).
For more ambitious languages and processors, however, it is often preferable to keep the two aspects separate. Such separation of syntax and semantics, and in particular the sharing of the same syntax for different processors with different semantic actions, is hard or impossible to obtain with traditional document processing tools such as Yacc on Unix. Here is how to achieve it with the EiffelParse library:
* First write purely '''syntactic classes''', that is to say construct classes which only effect the syntactical part (in particular function production). As a consequence, these classes usually remain deferred. The recommended convention for such syntactic classes is to use names beginning with <eiffel>S_</eiffel>, for example <eiffel>S_INSTRUCTION</eiffel> or <eiffel>S_LOOP</eiffel>.
* Then for each construct for which a processor defines a certain semantics, define another class, called a '''semantic class''', which inherits from the corresponding syntactic class. The recommended convention for semantic classes is to give them names which directly reflect the corresponding construct name, as in <eiffel>INSTRUCTION</eiffel> or <eiffel>LOOP</eiffel>.
To build a semantic class in in step 2 it is often convenient to use multiple inheritance from a syntactic class and a "semantics-only" class. For example in a processor for Eiffel class <eiffel>INSTRUCTION</eiffel> may inherit from both <eiffel>IS_INSTRUCTION</eiffel> and from a semantics-only class <eiffel>INSTRUCTION_PROPERTIES</eiffel> which introduces the required semantic features.
One of the advantages of this scheme is that it makes it easy to associate two or more types of processing with a single construct, by keeping the same syntactic class (such as <eiffel>IS_INSTRUCTION</eiffel>) but choosing a different pure-semantics class each time.
As noted earlier in this chapter, this is particularly useful in an environment where different processors need to perform differents actions on specimens of the same construct. In an Eiffel environment, for example, processors that manipulate classes and other Eiffel construct specimens may include a compiler, an interpreter, a flattener (producing the flat form), a class abstracter (producing the short or flat-short form), and various browsing tools such as those provided by Eiffel Software.
For obvious reasons of convenience and ease of maintenance, it is desirable to let these processors share the same syntactic descriptions. The method just described, relying on multiple inheritance, achieves this goal.
==HOW PARSING WORKS==
Classes AGGREGATE, CHOICE, TERMINAL and REPETITION are written in such a way that you do not need to take care of the parsing process. They make it possible to parse any language built according to the rules given - with one limitation, left recursion, discussed below. You can then concentrate on writing the interesting part - semantic processing.
To derive the maximum benefit from the EiffelParse library, however, it is useful to gain a little more insight into the way parsing works. Let us raise the veil just enough to see any remaining property that is relevant to the building of parsers and document processors.
===The parsing technique===
The EiffelParse library relies on a general approach known as '''recursive descent''', meaning that various choices will be tried in sequence and recursively to recognize a certain specimen.
If a choice is attempted and fails (because it encounters input that does not conform to what is expected), the algorithm will try remaining choices, after having moved the input cursor back to where it was before the choice that failed. This process is called '''backtracking'''. It is handled by the parsing algorithms in an entirely automatic fashion, without programmer intervention.
===Left recursion===
Recursive descent implies the danger of infinite looping when parsing is attempted for left-recursive productions of the form <br/>
<code>
A [=] A ...
</code>
<br/>
or, more generally, cases in which the left recursion is indirect, as in <br/>
<code>
A [=] B ...
B [=] C ...
...
L [=] A ...
</code>
Direct left recursion is easy to avoid, but indirect recursion may sneak in in less trivial ways.
To determine whether the production for a construct is directly or indirectly left-recursive, use the query left_recursion from class <eiffel>CONSTRUCT</eiffel>.
===Backtracking and the commit procedure===
Another potential problem may arise from too much backtracking. In contrast with left recursion, this is a performance issue, not a threat to the correctness of the parsing algorithm. Automatic backtracking is in fact essential to the generality and flexibility of the recursive descent parsing algorithm; but too much of it may degrade the efficiency of the parsing mechanism.
Two techniques are available to minimize backtracking. One, mentioned above, is to organize the production functions for choice construct classes so that they list the most frequent cases first. The other is to use the commit procedure in the production functions for aggregate constructs.
A call to commit in an aggregate A is a hint to the parser, which means:
:''If you get to this point in trying to recognize a specimen of A as one among several possible choices for a choice construct C, and you later fail to obtain an A, then forget about other choices for C: you won't be able to find a C here. You may go back to the next higher-level choice before C - or admit failure if there is no such choice left.''
Such a hint is useful when you want to let the parser benefit from some higher-level knowledge about the grammar, which is not directly deducible from the way the productions have been written.
Here is an example. The production function for <eiffel>NESTED</eiffel> in the Polynomial language, which attempts to parse specimens of the form <br/>
<code>(s)</code>
where ''s'' is a specimen of <eiffel>SUM</eiffel>, is written as
<code>
production: LINKED_LIST [CONSTRUCT]
local
expression: SUM
once
create Result.make
Result.forth
keyword ("(")
commit
create expression.make
put (expression)
keyword (")")
end
</code>
The commit after the recognition of the keyword "(" is there to use the following piece of higher-level knowledge:
:''No choice production of the grammar that has NESTED as one of its alternatives has another alternative construct whose specimens could begin with an opening parenthesis "(".''
Because of this property, if the parser goes so far as to recognize an opening parenthesis as part of parsing any construct <eiffel>C</eiffel> for which NESTED is an alternative, but further tokens do not match the structure of <eiffel>NESTED</eiffel> specimens, then we will have failed to recognize not only a <eiffel>NESTED</eiffel> but also a <eiffel>C</eiffel>.
{{note|Some readers will have recognized commit as being close to the Prolog "cut" mechanism. }}
In this example, <eiffel>NESTED</eiffel> is used in only one right-hand side production: the choice production for TERM, for which the other alternatives are <eiffel>SIMPLE_VAR</eiffel> and <eiffel>POLY_INTEGER</eiffel>, none of whose specimens can include an opening parenthesis.
The use of commit assumes global knowledge about the grammar and its future extensions, which is somewhat at odds with the evolutionary approach suggested by the Eiffel method. Applied improperly, this mechanism could lead to the rejection of valid texts as invalid. Used with care, however, it helps in obtaining high-performance parsing without impairing too much the simplicity of preparing parsers and other document processors.
==BUILDING A DOCUMENT PROCESSOR==
We are ready now to put together the various elements required to build a document processor based on the EiffelParse library.
===The overall picture===
The documents to be processed will be specimens of a certain construct. This construct is called the '''top construct''' for that particular processing.
{{caution|Be sure to note that with the EiffelParse library there is no room for a concept of top construct of a '''grammar''': the top construct is only defined with respect to a particular processor for that grammar. <br/>
Attempting to define the top of a grammar would be contrary to the object-oriented approach, which de-emphasizes any notion of top component of a system. <br/>
Different processors for the same grammar may use different top constructs. }}
A document processor will be a particular system made of construct classes, complemented by semantic classes, and usually by other auxiliary classes. One of the construct classes corresponds to the top construct and is called the '''top construct class'''.
{{note|This notion of top construct class has a natural connection to the notion of root class of a system, as needed to get executable software. The top construct class could indeed be used as root of the processor system. In line with the previous discussion, however, it appears preferable to keep the top construct class (which only depends on the syntax and remains independent of any particular processor) separate from the system's root class. With this approach the root class will often be a descendant of the top construct class. <br/>
This policy was adopted for the Polynomial language example as it appears in the delivery: the processor defined for this example uses <eiffel>LINE</eiffel> as the top construct class; the root of the processor system is a class <eiffel>PROCESS</eiffel>, which inherits from <eiffel>LINE</eiffel>. }}
===Steps in the execution of a document processor===
As any root class of a system, the root of a document processor must have a creation procedure which starts the execution. Here the task of this class is the following:
# Define an object representing the input document to be processed; this will be an instance of class <eiffel>INPUT</eiffel>.
# Obtain a lexical analyzer applicable to the language, and connect it with the document.
# Select an input file, containing the actual document to be processed.
# Process the document: in other words, parse it and, if parsing is successful, apply the semantics.
# To execute these steps in a simple and convenient fashion, it is useful to declare the root class as a descendant of the lexical interface class. The root class, being an heir to the top construct class, will also be a descendant of <eiffel>CONSTRUCT</eiffel>.
===Connecting with lexical analysis===
To achieve the effect of steps [[#step_e1|1]] and [[#step_e2|2]] , a simple call instruction suffices: just call the procedure build, inherited from <eiffel>L_INTERFACE</eiffel> using as argument document, a feature of type INPUT, obtained from <eiffel>METALEX</eiffel> (the lexical analyzer generator class) through <eiffel>L_INTERFACE</eiffel>. The call, then, is just:
<code>
build (document)
</code>
Although you may use this line as a recipe with no need for further justification, it is interesting to see what build does. Feature document describes the input document to be processed; it is introduced as a Once function in class <eiffel>CONSTRUCT</eiffel> to ensure that all instances of <eiffel>CONSTRUCT</eiffel> share a single document - in other words, that all processing actions apply to the same document. The text of build is:
<code>
build (doc: INPUT)
-- Create lexical analyzer and set doc
-- to be the input document.
require
document_exists: doc /= void
do
metalex_make
obtain_analyzer
make_analyzer
doc.set_lexical (analyzer)
end
</code>
The call to obtain_analyzer defines the regular grammar for the language at hand. Recall that obtain_analyzer is deferred in <eiffel>L_INTERFACE</eiffel>; its definition for the <eiffel>POLY_LEX</eiffel> example was given above. The call to make_analyzer freezes the regular grammar and produces a usable lexical analyzer, available through the attribute analyzer obtained from <eiffel>METALEX</eiffel>. Finally, the call to set_lexical, a procedure of class <eiffel>INPUT</eiffel>, ensures that all lexical analysis operations will use analyzer as the lexical analyzer.
===Starting the actual processing===
The call build <code> ( </code>document takes care of steps [[#step_e1|1]] and [[#step_e2|2]] of the root's creation procedure. Step [[#step_e3|3]] selects the file containing the input document; this is achieved through the call <br/>
<code>
document.set_input_file (some_file_name)
</code>
<br/>
where set_input_file, from class <eiffel>INPUT</eiffel>, has a self-explanatory effect.
Finally, step [[#step_e4|4]] (processing the document) is simply a call to procedure process, obtained from [[ref:libraries/parse/reference/construct_chart|CONSTRUCT]] . Recall that this procedure simply executes <br/>
<code>
parse
if parsed then
semantics
end
</code>
<br/>
===The structure of a full example===
The polynomial example provides a simple example of a full document processor, which you may use as a guide for your own processors. The root class of that example is <eiffel>PROCESS</eiffel>. Its creation procedure, make, follows the above scheme precisely; here is its general form:
<code>
root_line: LINE
make
local
text_name: STRING
do
create root_line.make
build (root_line.document)
... Instructions prompting the user for the name of the
file to be parsed, and assigning it to text_name ...
root_line.document.set_input_file (text_name)
root_line.process
end
</code>
Although it covers a small language, this example may serve as a blueprint for most applications of the EiffelParse library.
==FUTURE WORK==
It was mentioned at the beginning of this chapter that further work is desirable to make the EiffelParse library reach its full bloom. Here is a glimpse of future improvements.
===Expressions===
Many languages include an expression construct having the properties of traditional arithmetic expressions:
* An expression is a succession of basic operands and operators.
* The basic operands are lexical elements, such as identifiers and constants.
* Operators are used in prefix mode (as in ''- a'') or infix mode (as in ''b - a'').
* Each operator has a precedence level; precedence levels determine the abstract syntactic structure of expressions and consequently their semantics. For example the abstract structure of ''a + b * c'' shows this expression to be the application of the operator ''+'' to ''a'' and to the application of the operator ''*'' to ''b'' and ''c''. That this is the correct interpretation of the instruction follows from the property that ''*'' has a higher precedence ("binds more tightly") than ''+''.
* Parentheses pairs, such as ( ) or [ ], can be used to enforce a structure different from what the precedence rules would imply, as in ''(a + b) * c''.
* Some infix operators may be applied to more than two arguments; in this case it must be clear whether they are right-associative (in other words, ''a ^ b ^ c'' means ''a ^ (b ^ c)'', the conventional interpretation if ^ denotes the power operator) or left-associative.
It is of course possible to apply the EiffelParse library in its current state to support expressions, as illustrated by this extract from the Polynomial grammar given in full above:
<code>
Variables [=] {Identifier ";" ...}
Sum [=] {Diff "+" ...}
Diff [=] {Product "-" ...}
Product [=] {Term "*" ...}
</code>
The problem then is not expressiveness but efficiency. For such expressions the recursive descent technique, however well adapted to the higher-level structures of a language, takes too much time and generates too many tree nodes. Efficient bottom-up parsing techniques are available for this case.
The solution is straightforward: write a new heir <eiffel>EXPRESSION</eiffel> to class <eiffel>CONSTRUCT</eiffel>. The preceding discussion of expressions and their properties suggests what kinds of feature this class will offer: define a certain terminal as operator, define a terminal as operand type, set the precedence of an operator, set an operator as left-associative or right-associative and so on. Writing this class based on this discussion is indeed a relatively straightforward task, which can be used as a programming exercise.
Beyond the addition of an <eiffel>EXPRESSION</eiffel> class, some changes in the data structures used by EiffelParse may also help improve the efficiency of the parsing process.
===Yooc===
To describe the syntax of a language, it is convenient to use a textual format such as the one that has served in this chapter to illustrate the various forms of production. The correspondence between such a format and the construct classes is straightforward; for example, as explained above, the production <br/>
<code>
Line [=] Variables ":" Sum
</code>
<br/>
will yield the class
<code>
class
LINE
inherit
AGGREGATE
feature
production: LINKED_LIST [CONSTRUCT]
local
var: VARIABLES
sum: SUM
once
create Result.make
Result.forth
create var.make
put (var)
keyword (":")
create sum.make
put (sum)
end
...
end
</code>
This transformation of the textual description of the grammar into its equivalent Eiffel form is simple and unambiguous; but it is somewhat annoying to have to perform it manually.
A tool complementing the EiffelParse library and known as YOOC ("Yes! an Object-Oriented Compiler", a name meant as an homage to the venerable Yacc) has been planned for future releases of EiffelParse. Yooc, a translator, will take a grammar specification as input and transform it into a set of parsing classes, all descendants of CONSTRUCT and built according to the rules defined above. The input format for syntax specification, similar to the conventions used throughout this chapter, is a variant of LDL (Language Description Language), a component of the ArchiText structural document processing system.
===Further reading===
The following article describes some advanced uses of the EiffelParse library as well as a Yooc-like translator called PG: Per Grape and Kim Walden: Automating the Development of Syntax Tree Generators for an Evolving Language, in Proceedings of TOOLS 8 (Technology of Object-Oriented Languages and Systems), Prentice Hall, 1992, pages 185-195.

View File

@@ -0,0 +1,11 @@
[[Property:title|EiffelParse]]
[[Property:weight|5]]
[[Property:uuid|0984d15a-6ee9-3bd4-71d6-31df2987af3a]]
==EiffelParse Library==
Type: Library <br/>
Platform: Any <br/>
Eiffel classes for building parsers.

View File

@@ -0,0 +1,23 @@
[[Property:title|Eiffel polynomial parser]]
[[Property:weight|0]]
[[Property:uuid|63f0e737-4ad7-c574-3bbc-05e005815785]]
In the directory '''$ISE_EIFFEL/examples/parse''' you will find a system that implements a processor for a grammar describing a simple language for expressin polynomials. A typical document in this language is the line
<code>
x;y: x * (y + 8 - (2 * x))
</code>
The beginning of the line, separated from the rest by a colon, is the list of variables used in the polynomial, separated by semicolons. The rest of the line is the expression defining the polynomial. The grammar can be described with the following grammar:
<code>
LINE = VARIABLES ":" SUM
VARIABLES = VAR .. ";"
SUM = DIFF .. "+"
DIFF = PRODUCT .. "-"
PRODUCT = TERM .. "*"
TERM = SIMPLE_VAR | INT_CONSTANT | NESTED
NESTED = "(" SUM ")" </code>
This grammar assumes a terminal '''VAR''', which must be defined as a token type in the lexical grammar. The other terminals are keywords, shown as strings appearing in the double quotes, for example "+".
When compiling the example, the executable '''process(.exe)''' is created. When executing the program, it will prompt for the name of a file with a polynomial description, reads a polynomial from the given file, prompts for integer values of the variables, and evaluates the polynomial.

View File

@@ -0,0 +1,8 @@
[[Property:title|Parse Sample]]
[[Property:weight|2]]
[[Property:uuid|ad48d4f5-a113-65f7-15fc-8c8fd3f5c284]]
* [[Eiffel polynomial parser|Eiffel polynomial parser]]

View File

@@ -0,0 +1,6 @@
[[Property:title|Text processing]]
[[Property:weight|-6]]
[[Property:uuid|e74b5b47-d87d-2eb0-49ba-981dae52d338]]
== Text processing (lexical analysis and parsing) solutions==