From 4d6df90e6cbd3f0327b30bc51b4b1e7c04177a04 Mon Sep 17 00:00:00 2001 From: jfiat Date: Tue, 9 Dec 2008 17:47:17 +0000 Subject: [PATCH] Author:admin Date:2008-12-09T17:47:17.000000Z git-svn-id: https://svn.eiffel.com/eiffel-org/trunk@132 abb3cda0-5349-4a8f-a601-0c33ac3a8c38 --- .../eiffellex/eiffellex-tutorial.wiki | 128 +++++++++--------- 1 file changed, 65 insertions(+), 63 deletions(-) diff --git a/documentation/current/solutions/text-processing/eiffellex/eiffellex-tutorial.wiki b/documentation/current/solutions/text-processing/eiffellex/eiffellex-tutorial.wiki index 8d9adff0..fd307e87 100644 --- a/documentation/current/solutions/text-processing/eiffellex/eiffellex-tutorial.wiki +++ b/documentation/current/solutions/text-processing/eiffellex/eiffellex-tutorial.wiki @@ -139,7 +139,7 @@ As explained below, keywords are regular expressions which are treated separatel Once build has given you an analyzer, you may use it to analyze input texts through calls to the procedure - analyze (input_file_name: STRING) + analyze (input_file_name: STRING) This will read in and process successive input tokens. Procedure analyze will apply to each of these tokens the action of procedure do_a_token. As defined in SCANNING, this procedure prints out information on the token: its string value, its type, whether it is a keyword and if so its code. You may redefine it in any descendant class so as to perform specific actions on each token. @@ -157,25 +157,25 @@ Procedure analyze takes care of the most common needs of lexical analysis. But i This discussion will indeed assume that you have an entity attached to an instance of [[ref:/libraries/lex/reference/lexical_chart|LEXICAL]] . The name of that entity is assumed to be analyzer, although it does not need to be the attribute from [[ref:/libraries/lex/reference/scanning_chart|SCANNING]] . You can apply to that analyzer the various exported features features of class [[ref:/libraries/lex/reference/lexical_chart|LEXICAL]] , explained below. All the calls described below should use analyzer as their target, as in - analyzer.set_file ("my_file_name") + analyzer.set_file ("my_file_name") ===Creating, retrieving and storing an analyzer=== To create a new analyzer, use - create analyzer.make_new + create analyzer.make_new You may also retrieve an analyzer from a previous session. [[ref:/libraries/lex/reference/lexical_chart|LEXICAL]] is a descendant from [[ref:/libraries/base/reference/storable_chart|STORABLE]] , so you can use feature retrieved for that purpose. In a descendant of [[ref:/libraries/base/reference/storable_chart|STORABLE]] , simply write - analyzer ?= retrieved + analyzer ?= retrieved If you do not want to make the class a descendant of [[ref:/libraries/base/reference/storable_chart|STORABLE]] , use the creation procedure make of [[ref:libraries/lex/reference/lexical_chart|LEXICAL]] , not to be confused with make_new above: - create analyzer.make - analyzer ?= analyzer.retrieved + create analyzer.make + analyzer ?= analyzer.retrieved ===Choosing a document=== @@ -201,20 +201,20 @@ If it fails to recognize a regular expression, get_token sets token_type to No_t Here is the most common way of using the preceding facilities: - from - set_file ("text_directory/text_to_be_parsed") - -- Or: set_string ("string to parse") - begin_analysis - until - end_of_text - loop - analyzer.get_token - if analyzer.token_type = No_token then - go_on - end - do_a_token (lexical.last_token) - end - end_analysis + from + set_file ("text_directory/text_to_be_parsed") + -- Or: set_string ("string to parse") + begin_analysis + until + end_of_text + loop + analyzer.get_token + if analyzer.token_type = No_token then + go_on + end + do_a_token (lexical.last_token) + end + end_analysis This scheme is used by procedure analyze of class [[ref:/libraries/lex/reference/scanning_chart|SCANNING]] , so that in standard cases you may simply inherit from that class and redefine procedures begin_analysis, do_a_token and end_analysis. If you are not inheriting from [[ref:libraries/lex/reference/scanning_chart|SCANNING]] , these names simply denote procedures that you must provide. @@ -227,7 +227,7 @@ Let us now study the format of regular expressions. This format is used in parti Each regular expression denotes a set of tokens. For example, the first regular expression seen above,
- '0'..'9' + '0'..'9'
denotes a set of ten tokens, each consisting of a single digit. @@ -292,7 +292,7 @@ A concatenation, writtenexp 1 exp 2 ... exp n - "A Text" + "A Text" More generally, a string is written"a 1 a 2 ... a n"for ''n >= 0'', where thea i are characters, and is an abbreviation for the concatenation 'a 1' 'a 2' ... 'a n' @@ -410,24 +410,24 @@ You may change this default behavior through a set of procedures introduced in c To make subsequent regular expressions case-sensitive, call the procedure - distinguish_case + distinguish_case To revert to the default mode where case is not significant, call the procedure - ignore_case + ignore_case Each of these procedures remains in effect until the other one is called, so that you only need one call to define the desired behavior. For keywords, the policy is less tolerant. A single rule is applied to the entire grammar: keywords are either all case-sensitive or all case-insensitive. To make all keywords case-sensitive, call - keywords_distinguish_case + keywords_distinguish_case The inverse call, corresponding to the default rule, is - keywords_ignore_case + keywords_ignore_case Either of these calls must be executed before you define any keywords; if you are using [[ref:/libraries/lex/reference/scanning_chart|SCANNING]] , this means before calling procedure build. Once set, the keyword case-sensitivity policy cannot be changed. @@ -444,22 +444,24 @@ Class [[ref:/libraries/lex/reference/scanning_chart|SCANNING]] , as studied abov The following extract from a typical descendant of [[ref:/libraries/lex/reference/metalex_chart|METALEX]] illustrates the process of building a lexical analyzer in this way: - Upper_identifier, Lower_identifier, Decimal_constant, Octal_constant, Word: INTEGER is unique - ... - distinguish_case - keywords_distinguish_case - put_expression("+('0'..'7'"), Octal_constant, "Octal") - put_expression ("'a'..'z' *('a'..'z'|'0'..'9'|'_')", Lower_identifier, "Lower") - put_expression ("'A'..'Z' *('A'..'Z'|'0'..'9'|'_' )", Upper_identifier, "Upper") + Upper_identifier, Lower_identifier, Decimal_constant, Octal_constant, Word: INTEGER is unique + + ... + + distinguish_case + keywords_distinguish_case + put_expression("+('0'..'7'"), Octal_constant, "Octal") + put_expression ("'a'..'z' *('a'..'z'|'0'..'9'|'_')", Lower_identifier, "Lower") + put_expression ("'A'..'Z' *('A'..'Z'|'0'..'9'|'_' )", Upper_identifier, "Upper") - dollar_w (Word) - ... - put_keyword ("begin", Lower_identifier) - put_keyword ("end", Lower_identifier) - put_keyword ("THROUGH", Upper_identifier) - ... - make_analyzer + dollar_w (Word) + ... + put_keyword ("begin", Lower_identifier) + put_keyword ("end", Lower_identifier) + put_keyword ("THROUGH", Upper_identifier) + ... + make_analyzer This example follows the general scheme of building a lexical analyzer with the features of [[ref:/libraries/lex/reference/metalex_chart|METALEX]] , in a class that will normally be a descendant of [[ref:libraries/lex/reference/metalex_chart|METALEX]] : @@ -470,7 +472,7 @@ This example follows the general scheme of building a lexical analyzer with the To perform steps 2 to 4 in a single shot and generate a lexical analyzer from a lexical grammar file, as with [[ref:/libraries/lex/reference/scanning_chart|SCANNING]] , you may use the procedure - read_grammar (grammar_file_name: STRING) + read_grammar (grammar_file_name: STRING) In this case all the expressions and keywords are taken from the file of name grammar_file_name rather than passed explicitly as arguments to the procedures of the class. You do not need to call make_analyzer, since read_grammar includes such a call. @@ -484,14 +486,14 @@ Procedure put_expression records a regular expression. The first argument is the Procedure dollar_w corresponds to the '''$W''' syntax for regular expressions. Here an equivalent call would have been - put_nameless_expression ( "$W" ,Word ) + put_nameless_expression ( "$W" ,Word ) Procedure declare_keyword records a keyword. The first argument is a string containing the keyword; the second argument is the regular expression of which the keyword must be a specimen. The example shows that here - in contrast with the rule enforced by [[ref:/libraries/lex/reference/scanning_chart|SCANNING]] - not all keywords need be specimens of the same regular expression. The calls seen so far record a number of regular expressions and keywords, but do not give us a lexical analyzer yet. To obtain a usable lexical analyzer, you must call - make_analyzer + make_analyzer After that call, you may not record any new regular expression or keyword. The analyzer is usable through attribute analyzer. @@ -512,11 +514,11 @@ To have access to the most general set of lexical analysis mechanisms, you may u For the complete list of available procedures, refer to the flat-short form of the class; there is one procedure for every category of regular expression studied earlier in this chapter. Two typical examples of calls are: - interval ('a', 'z') - -- Create an interval tool + interval ('a', 'z') + -- Create an interval tool - union (Letter, Underlined) - -- Create a union tool + union (Letter, Underlined) + -- Create a union tool Every such procedure call also assigns an integer index to the tool it creates; this number is available through the attribute last_created_tool. You will need to record it into an integer entity, for example Identifier or Letter. @@ -524,18 +526,18 @@ Every such procedure call also assigns an integer index to the tool it creates; The following extract from a typical descendant of [[ref:/libraries/lex/reference/lex_builder_chart|LEX_BUILDER]] illustrates how to create a tool representing the identifiers of an Eiffel-like language. - Identifier, Letter, Digit, Underlined, Suffix, Suffix_list: INTEGER + Identifier, Letter, Digit, Underlined, Suffix, Suffix_list: INTEGER - build_identifier is - do - interval ('a', 'z'); Letter := last_created_tool - interval ('0', '9'); Digit := last_created_tool - interval ('_', '_'); Underlined := last_created_tool - union (Digit, Underlined); - Suffix := last_created_tooliteration (Suffix); - Suffix_list := last_created_toolappend (Letter, Suffix_list); - Identifier := last_created_tool - end + build_identifier + do + interval ('a', 'z'); Letter := last_created_tool + interval ('0', '9'); Digit := last_created_tool + interval ('_', '_'); Underlined := last_created_tool + union (Digit, Underlined); + Suffix := last_created_tooliteration (Suffix); + Suffix_list := last_created_toolappend (Letter, Suffix_list); + Identifier := last_created_tool + end Each token type is characterized by a number in the tool_list. Each tool has a name, recorded in tool_names, which gives a readable form of the corresponding regular expression. You can use it to check that you are building the right tool. @@ -545,11 +547,11 @@ In the preceding example, only some of the tools, such as Identifierselect_tool. Clients will need a number identifying it; to set this number, use procedure associate. For example the above extract may be followed by: - select_tool (Identifier) - associate (Identifier, 34) - put_keyword ("class", Identifier) - put_keyword ("end", Identifier) - put_keyword ("feature", Identifier) + select_tool (Identifier) + associate (Identifier, 34) + put_keyword ("class", Identifier) + put_keyword ("end", Identifier) + put_keyword ("feature", Identifier) If the analysis encounters a token that belongs to two or more different selected regular expressions, the one entered last takes over. Others are recorded in the array other_possible_tokens.