IES20070144A2

IES20070144A2 - System and method for lexical analysis

Info

Publication number: IES20070144A2
Authority: IE
Inventors: Francis Flanagan
Original assignee: Dajeil Ltd
Priority date: 2007-03-06
Filing date: 2007-03-06
Publication date: 2007-11-14

Abstract

A lexical analyser that comprises a state/action table is disclosed. The table comprises entries that define a state and an associated action, the action being encoded in an action code. The lexical analyser may be embodied within a co-processor formed in an FPGA. The co-processor advantageously includes a character folding stage that is operative to reduce the width of multi-byte characters in an input to the co-processor prior to processing of the reduced characters by the lexical analyser. A lexical analyser that implements character folding in software is also disclosed.

Description

System and method for lexical analysis : (rObF (; This invention provides a system and method for lexical analysis.

A language to be interpreted automatically can be considered to be a sequence of logical tokens. A symbol can, for example, be a keyword, an identifier, or a punctuation mark. Considered logically, a symbol is an atomic item - it is a single, indivisible thing. However, when a language is represented in a document, a symbol is represented by a token that may include one or more characters to allow the document to be understood by a human reader.

A lexical analyser is the first stage in almost any automatic processing of any language. A lexical analyser recognise tokens within a document and generates an output that contains a sequence of symbols represented by the tokens. Symbols may be defined by a grammar recorded in Backus-Naur form (BNF). The lexical analyser identifies lexemes within the document with reference to the grammar. Tokens take two basic forms: predefined tokens (each of which may only be represented as a single lexeme) and identifier tokens such as names. Taking the simple example of a statement in the C programming language: for (iterate != 0) The lexeme “for” in this statement is predefined and would normally be equated to a token FOR. The lexeme “iterate” is not represented by a pre-defined token; it is an identifier, and it along with all other similar lexemes would be represented by some token, perhaps NAME.

The structure of identifiers is normally defined (in the BNF) along the following lines: Name ::= NameStartChar (NameChar)* IE Ο / Ο 1 44 -2NameStartChar ::= [A-Z] NameChar ::= NameStartChar | [0-9] The preceding definition defines names as consisting of all combinations of a capital letter optionally followed by any number of capital letters or digits. Thus A Al A1B AZ23 are all valid names but 1A 12 are invalid names.

As the basic function of a lexical analyser is to recognise lexemes and thereby generate an appropriate stream of tokens to represent the document, if the above example were an entire grammar for a language the input stream could be pre-processed for instance by representing all upper-case letters as 1 and all digits as 2.

The grammar would then be re-written as: Name ::= 1 (1|2)* i.e. a ‘ 1 ’ optionally followed by any combination of ‘ 1 ’ or ‘2’ As most computer languages originate in the US, the lexemes for the predefined tokens of the language are typically be drawn from the original ASCII character set. ASCII code is a set of 128 characters, suitable only for representing lexemes in English or very English-like languages. However, ordinary lexemes such as names in a document are typically represented by characters drawn from the language of the author of the document. Therefore, they may be drawn from characters that cannot be represented in ASCII, since ASCII is a subset of almost all modem international character sets.

Recent developments towards internationalization have resulted in the development of highly expressive character sets, such as Unicode, and representations of Unicode including UTF8 (a transformation format of ISO10646 defined in http://www.ietf.org/rfc/rfc3629.txt), UTF16 (an encoding of ISO 10646 defined in http://www.ietf.org/rfc/rfc2781.txt) and UCS4. These IE070144 -3character sets permit the representation of almost all characters in use in the world and as currently defined, contain of the order of two million characters.

Lexical analysers to run on general-purpose computers have for many years been constructed using automatic lexical generator tools such as lex or its open-source equivalent, flex. The input to flex is a series of patterns and actions. Patterns are regular expressions which represent fragments of language as defined by its grammar, and actions are freely coded in a programming language. The output from flex is program code that, when executed, operates as a lexical analyser. The pattern matching element of the lexical analyser thereby automated, while when a pattern is matched the action code (hand crafted by a programmer) is executed.

Processing XML on a general-purpose processor is a CPU-intensive application. Offloading a CPU-intensive task to an application-specific co-processor is well-known, having taken place in relation to, inter alia, floating point arithmetic, video compression and coding, and cryptographic processing.

US-A-7 013 424, James et. al., discloses how a dedicated processor may be used for processing XML to reduce load on a host processor. Use of co-processors to off-load host processors is well known, an example being disclosed in US-A-6 963 979.

US-A-5 317 509 describes a system and method for building a lexical analyser that can scan multibyte character sets. The present invention factors regular expressions that contain multibyte characters, so that a single byte finite state automata can be constructed. The system recodes a document encoded in a character set such as UTF8 into a much reduced character set for lexical analysis without altering the generated token stream. However, this disclosure requires additional execution cycles to accommodate multi-byte characters.

Other references include US-A-2004/083466, US-A-2005/0138381, and US-A-6 611 524.

Lexical analysis by use of regular-expression-based rules is known from Aho, et. al., Compilers - Principles, Techniques and Tools, Addison Wesley, 1986. A number of systems exist that implement such analysers including the “j7ex” tool. >£ 070 1 44 -4Current systems either handle relatively small character sets or after Caldwell utilise multiple cycles to accommodate large character sets.

An aim of this invention is to implement a lexical analyser in a processor that is dedicated to that task in a maimer that can handle multi-byte character sets efficiently.

From a first aspect, this invention provides a lexical analyser that comprises a state/action table, the table comprising entries that define a state and an associated action, the action being encoded in an action code.

From a second aspect, the invention provides a co-processor for performing lexical analysis of a formal language that includes a lexical analyser according to the first aspect of the invention.

Embodiments of the invention can be thought of as comprising three basic elements: • formalised actions for a modified version of flex permitting the construction of tables for a hardware lexical analyser; • a hardware lexical analyser suitable to be driven using the tables generated above; and · a character folding mechanism to provide speed of execution and control of memory size.

The augmentation of the actions and thereby the lexical analysis to provide those elements of a parser which are reasonably amenable to implementation therein including the matching of starting and ending tags.

From a third aspect, this invention provides a lexical analysis system comprising a lexical analyser and a character folding stage, the character folding stage being operative to reducing the size of multi-byte characters in an input document and forward the reduced characters to the lexical analyser.

IE O/Q 144 -5Embodiments of this aspect of the invention may be embodied within a software program executable on a general-purpose computer (in much the same manner as the flex utility) or partially or entirely in hardware.

An embodiment of the invention will now be described in detail, by way of example, and 5 with reference to the accompanying drawings, in which: Figure 1 shows a co-processor according to the invention in use in a computer connected using a PCI bus; Figure 2 shows a co-processor according to the invention in use in a computer connected using a PCI Express bus; Figure 3 shows a co-processor according to the invention being a network-connected acceleration appliance; and Figure 4 is a block diagram of a co-processor being an embodiment of the invention; and Figure 5 is a block diagram of a hardware lexical analyser being a component of the embodiment of Figure 4.

Typically, lexical analysers are generated using a software tool such as flex. The flex utility generates programs (typically, C programs) to be used in lexical processing of character input The normal flex lexical analyser uses hand-coded actions in a conventional programming language such as C. An example standard flex statement is shown below: {NameStartChar} ( string_start(yy_text); tag_push(yy_text); BEGIN(st_tag_a); fprintf(token_stream,STAG\n); } In the above, is a state. The following pattern will be matched only in state st_tag. {NameStartChar} matches any name start character. If the next character in the input stream matches this pattern in state st_tag, the action will be executed. «0/0144 -6The action is the entire segment of code enclosed by chain brackets “ { } ”. In standard flex, this is free-format C code which must be executed on a processor. Such actions are not suitable for incorporation into a purely table-driven lexical analyser. Therefore, in this embodiment formalized actions are used to provide a restricted range of actions on the matching of a given pattern. These actions are suitable to be represented by a single table entry of a fixed width-per-action and are therefore suitable for a hardware implementation.

The formalised equivalent to the above statement as used in this invention is: {NameStartChar} @STRING_ACTION=START; TAG_ACTION=PUSH; STATE_ACTION=BEGIN(ST_TAG_A); TOKEN_ACTION=OUTPUT(STAG);G The actions required for the processing of XML have been constructed to form a limited set. This allows the representation of the actions in a single table, which permits the execution of the actions by simple state machines rather than a complex processor. The set of actions defined by one embodiment of this invention is sufficient for the augmented lexical analysis of the XML language. In another embodiment additional actions are added to permit additional verification including inter alia counting of bracket levels.

This arrangement permits the generation of a hardware lexical analyser consisting of simple blocks of hardware, which are entirely table driven.

The Action Set The hardware lexical analyser generates actions that are table driven and therefore must remove much of the flexibility inherent in free-form C action statements. Therefore, the states and actions are encoded in a modified form of the flex syntax.

Each action is encoded as follows: TableActions::= 'Θ'(StateAction)? (StringAction)?(TagAction)? (TokenAction)? '0' StateAction:== 'STATE_ACTION=' ( 'PUSH(' State ');' | 'TAG_PUSH('State');' I 'BEGIN(' State ');' I ’TOP_BEGIN('State');'I 'RESET_BEGIN('State'); I 'END_ERROR_CHECK; | 070 1 44 -7ΙΕ 'POP(' Count ' I 'NOOP;' ) State is any valid flex state Count::= '1' I '2' NOOP denotes that the next character will be matched in the same lexical analyser state as this character.

PUSH denotes that the current lexical analyser state should be pushed onto the state stack and that the next character should be matched in the state “State”. This is used for instance on encountering a start tag to enter a state intended for processing tag characters. The depth of the state stack permits multiple levels of nesting appropriate for handling the complex constructs of XML.

TAG_PUSH is a special form of push that denotes that the tag_depth as recorded in the TLO (type-length-offset) output (see below) should be incremented.

BEGIN causes a jump to a new state without pushing anything onto the stack.

TOP_BEGIN jumps to a new state and dumps all of the contents of the stack.

RESET_BEGIN is used at the start of a document to jump to a suitable start state for a document and initialise various other hardware blocks.

END_ERROR_CHECK is used at the end of a document to verify various things 20 such as the tag_depth being 0 (i.e. all tags have been closed).

POP denotes that the next character should be matched in a previous state popped from the state stack. This may either be the last state pushed onto the stack (Count = 1) or the last but one state (Count = 2).

IE 070 Η* -8String_Action:: ='STRING_ACTION=' ('START;' | 'CONT;' | 'END;' I 'NOOP;’) START denotes that the character matched is the first character of a string.

CONT denotes that the character is a continuation character of a string.

END denotes that the character is the first character beyond the end of a string.

NOOP denotes that the current character is not part of a string for output purposes.

TagAction::= 'TAG_ACTION=' ('PUSH;' I 'COMPARE;' | 'POP;' | 'NOOP;').

PUSH denotes that the current character should be pushed onto the tag stack.

COMPARE denotes that the current character should be compared with the top of the tag stack POP denotes that the tag currently on the top of the stack should be dumped with no comparison. This accommodates tags of the form < sample /> where the closing / is in lieu of a closing tag and no comparison is required.

All tags are strings and the start and end of string provides framing for tags.

TokenAction::= 'TOKEN=' ('OUTPUT' I 'NOOUTPUT') '(' Token ');' Token is any valid token to be output from flex. This includes tokens such as STAG (start tag).

ETAG (end tag). -9ΙΕ 07ο 144 Where actions are omitted on a rule the following defaults apply STATE_ACTION=NOOP STRING_ACTION=NOOP TAG_ACTION=NOOP TOKEN_ACTION=NOOUTPUT(0) Hardware Lexical Analyser A co-processor for performing lexical analysis embodying the invention is shown in Figure 1. The co-processor includes a number of blocks as shown in Figure 4.

The lexical analyser itself is a relatively small hardware block 180, further detailed in figure 5, which reads characters from an input FIFO 170, matches patterns based on the tables, and generates outputs that are sent to an output FIFO 190 and/or a TLO generator.

The input FIFO 170 is a complex first-in-first-out memory which allows the lexical analyser to read characters from it and to accept and reject characters already read in a number of ways to provide for the needs of the lexical analyser.

Character Folding The first stage of operation of a conventional lexical analyser is to use each character in the input as an index into a table. While tables of two million entries (that is, one entry for each character in a modem extended character set) are quite feasible in software implementations on modem general-purpose computers, they are rarely desirable, and they are wholly unsuited to implementation in hardware, and are inconveniently large for use in embedded systems. The embodiment of the invention provides a mechanism to overcome this difficulty.

This embodiment provides for character folding by identifying a suitable base character set for use by the lexical analyser; in this embodiment, the 7-bit ASCII character set is selected but the invention is in no way limited to this character set.

Within the base character set, those characters that are not used to represent the lexemes of predefined tokens that are identified. Each of these characters may be used to represent a IE 070 1 44 -10class of characters. One or more of these classes are used to represent the characters to be replaced.

It will be understood that the use of character folding can also be applied to a lexical analyser that is implemented entirely is software to reduce the size of the tables of the analyser in memory. This can have particular application to an analyser that is to operate on a system with restricted memory, such as in an embedded processor. The flex utility can be amended to generate such an analyser in a straightforward manner.

Lexical Analyser In the preferred embodiment, the lexical analyser is encoded in the hardware description 10 language VHDL and synthesised to hardware implemented in a field-programmable gate array (FPGA). In this embodiment, the tables have been constructed to provide for the lexical analysis XML.

In another embodiment the lexical analyser is synthesised to hardware implemented in an application specific integrated circuit (ASIC).

Character Folding The specific embodiment is constructed to pre-process XML text prior to lexical analysis. Examination of the XML specification reveals that characters 0x1 to 0x8 are not used to represent predefined tokens within XML and are in fact restricted characters.

Each of these characters may therefore be represented by the character 0x1 within the lexical 20 analyser without affecting its operation. This is possible because each of these characters results in the same behaviour by the lexical analyser. Making this substitution opens a window of unused codes in the seven bit ASCII code that is used to provide code points for characters outside the ASCII code set based on which XML character class they fall into.

Looking at three definitions from the XML specification: NameStartChar ::= I [A-Z] I | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] IE 07 0 1 44 I [#x37F-#xlFFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDFO-#xFFFD] | [#xlOOOO-#xEFFFF] = NameStartChar | | | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040] = NameStartChar (NameChar)* -11[4a] NameChar [5] Name It is apparent that the characters constituting NameStartChar are a subset of those constituting NameChar. All characters less than Oxf f, other those in the range 0xl-0x8, are passed directly through the lexical analyser. Characters which are members of the set NameStartChar are represented by the character 0x2. Characters which are members of the set NameChar but not members of NameStartChar are represented by 0x3. Valid characters (which are not invalid name start characters or name characters) are replaced with 0x7.

The modified rules are re-written to reflect these changes as: NameStartChar [4a] NameChar [5] Name = \02Γ: 1 [A-Z] | | [a-z] = \03| | ί [0-9] | = NameStartChar (NameStartChar|NameChar)* The specific advantage of this folding is that the input character to the lexical analyser is reduced to seven bits. This permits a slightly modified version of the flex lexical analyser generator to generate tables for the analysis of XML. It also reduces the size of some of the tables required for lexical analysis by a factor of (4 * 1024 * 1024) 1 128 while enabling the lexical analyser to process a complete 21-bit character-per-cycle, instead of operating on some representation of that character by multiple 8-bit characters, each of which takes an iteration of the state machines.

Detailed Description An XML acceleration subsystem embodying the invention is equally suitable for deployment as a server add-in card on PCI/PCIX, PCI Express (as illustrated in Figures 1 and 2, respectively) or on an Ethernet or other network as a connected appliance (as illustrated in Figure 3).

IE 0 70 1 44 -12As shown in Figure 4, the document to be processed is delivered to the system by an input buffer 100.

To illustrate the operation of the system the following fragment of XML will be used: © P Koberl Table 1 shows the encoding of the input document for three sample character encodings: UTF8 (NO BOM), UTF16 LE (NO BOM) and UTF16BE (BOM).

The character encoding and byte order of the document are determined 110 while the document is delayed by a few cycles 120.

The document is then forwarded to a swapping/assembly block 130 that reconstructs the document as UCS-4 characters (these are 21 bits wide). Along with the character, the swapping assembly block 130 also calculates the position in the original file of each character 140. It is of significance to note that each UCS-4 character may be represented by a variable number of characters in the original input document where it is encoded in variable length codes such as UTF-8.

The swapping/assembly block 130 also inserts start-of-file and end-of-file characters, thereby permitting the later stages of the system to operate in a streaming mode.

The output from the swapping/assembly block 130 is next passed through a character equivalence class mapping block 150 which generates tables for use in the lexical analyser from a rule-based description of the formal language (which is, in this case, XML).

The output from the character equivalence class mapper is next passed to the input FIFO 170 which stores both the original UCS-4 character and the related equivalence character along with the size of the character in the original document.

The input to the FIFO is shown in Table 2. It is based on a UTF-8 input file with no byteorder marker (BOM). Only the equivalence character is fed to the lexical analyser 180.

IE 070 1 44 -13The full 21-bit character is fed to a tag stack 175 which needs to compare the characters in start and end tags for the document to ensure matching. The full 21-bit character is also fed to an output FIFO 190 as it is required for XPATH operations which may be implemented in further embodiments of this invention.

The lexical analyser 180 generates control signals for the tag stack 175, the input FIFO 170 and the output FIFO 190 along with tokens which are passed to the output FIFO 190.

The tag stack is used to verify the valid nesting of tags. XML documents are structured by nested opening and closing tags e.g.: 1 EXAMPLE, NUMBER and TEXT are TAGS Each full 21-bit character of an opening tag is pushed onto the tag stack as it is scanned and each character of a closing tag is compared as it is scanned. The stack also supports a tag pop operation for tags of the form: , which are closed in themselves.

The output FIFO 190 emits data streams to two different entities: to the PCI interface to transfer type, length and offset data back to the host, and to the back-end XPath processor.

Taking the sample document shown in Table 2, the generated TLO will be of a form shown in Table 3. Note that in Table 3, the column headed “REPRESENTS” is not part of the TLO; it is shown for ease of comprehension only.

The generated TLO data is sufficient to allow a host-based application programming interface (API) to navigate the corresponding XML document and to provide substantial acceleration over software only implementations.

The output to the XPATH subsystem includes the token type and all of the 21-bit UCS-4 25 characters. This subsystem performs pattern matching on the data stream to return the results of a number of XPath queries on the document directly from the hardware.

Operation of the lexical analyser 180 will now be described with reference to Figure 5.

IE 07 0 1 44 -14The pseudo code of Listing 1 illustrates the operation of the subsystem. With minor modifications, all of the pseudo code of Listing 1, except for the last line and the FIFO handling, can be implemented in hardware in much the same way as a normal lexical analyser as generated by flex, except that it is implemented in hardware.

The modified flex program transforms the pattem/action source code to non-deterministic finite automata and from this to deterministic finite automata which are converted into the contents of the arrays yy_def 270, yy_nxt 260, yy_check 250, and yy_base_accept 240. The techniques used in this transformation are well known to those in the technical field and the names chosen for the arrays correspond to those in the code generated by the standard flex utility. The standard tables generated by this embodiment are generated exactly as they would have been by an unmodified version offlex. Since these are well documented and known to those familiar with the technical field (see for example Chapter 3 of Aho et al. [op cit]), they will not be described here further. However, there are restrictions placed on flex in this embodiment relating to the acceptance of part of the character stream that has been read. These restrictions ensure that the p_accept operation, as implemented in hardware, is sufficient to implement the compiled rules. In the modified version rules which would exceed the limitations of the hardware will generate an error message and any such problems have been successfully avoided by restructuring rules. It is to be understood that the aforementioned limitation is specific to an embodiment and can be set to any arbitrary value in another embodiment.

Apart from the hardware limitations, the remaining modifications have been introduced into the structure provided by flex. An additional symbol table has been added for output tokens (STAG, ETAG. . .). The lexical analysis of flex rules has been modified to accommodate the action syntax described above. Conventionally, flex has functions that store the free25 format C code associated with an action. These have effectively been replaced by storage of the correct table-based actions.

The final line of the pseudocode drives the tag stack 175, the state stack 185 and the output FIFO 190. Errors detected by any of these blocks will cause global error to be set which will force an error into the multiplexer 200.

IE 07 0 1 44 The current state of the lexical analyser is stored in a register YY_CURENT_STATE 295.

(This state should not be confused with the lexical analyser source states of the form ; it is purely a number generated during pattern matching.) -15As the tables are compressed to a degree; the embodiment does not have entries in all of the tables for all possible states and all possible input characters. Instead, an input character is checked for validity in the current state by adding the input character and the current state to form an index into yy_check 250. If the output of yy_check 250 is equal to the current state then the input is valid and the output of yy_next 260 is used as the next value for the current state. Otherwise, the next state is given by yy_def 270.

When the value generated for the new value for the current state is greater than the allowable value, this indicates that a further stage of compression on the input character is required. The input character is re-read via yy_meta 210. This provides a further reduction in required memory size.

The entire analyser may loop a number of times on a single input character in some cases as it accepts complex patterns where the start is ambiguous. For example, taking production 56 from the XML Version 1.1 specification:

[56]TokenizedType : := 'ID' [VC: ID] [VC: One ID per Element Type] [VC: ID Attribute Default] 20 1 'IDREF' [VC: IDREF] 1 'IDREFS' [VC: IDREF] 1 'ENTITY' [VC: Entity Name] 1 'ENTITIES' [VC: Entity Name] 1 'NMTOKEN' [VC: Name Token] 25 1 'NMTOKENS' [VC: Name Token] On processing ID (in the correct context) it may be followed by , REF or REFS. On encountering ID, the lexical analyser will probably be in a state based on an assumption as to what will follow. If its assumption turns out to be false, one or more of the characters will be rejected, the analyser will change state (and implicitly its assumption) and reread the character.

IE 070 1 44 -16When the output of yy_nxt 270 is equal to a generated number j am state 280 an action is triggered and dj_action 220 drives the rest of the system to generate output, accepts/rejects etc.

IE 07 0 1 44 -17Table 1 Sample Document in Various Encodings Character UTF8 (NO BOM ) UTF16 LE (NO BOM) UTF16BE (BOM) CHARI POS2 CHAR POS CHAR POS BOM - 0 - FFFE 0 < 3C 0 003C 0 3C00 2 c 63 1 0063 2 6300 4 n 6E 2 006E 4 6E00 6 0 6F 3 006F 6 6F00 8 t 74 4 0074 8 7400 10 e 65 5 0065 10 6500 12 > 3E 6 003E 12 3E00 14 © C2 A9 7 00A9 14 A900 16 space 20 9 0020 16 2000 18 P 50 10 0050 18 5000 20 space 20 11 0020 20 2000 22 K 4B 12 004B 22 4B00 24 δ C3B6 13 00F6 24 F600 26 b 62 15 0062 26 6200 28 e 65 16 0065 28 6500 30 r 72 17 0072 30 7200 32 1 6C 18 006C 32 6C00 34 < 3C 19 003C 34 3C00 36 / 2F 20 002F 36 2F00 38 c 6C 21 006C 38 6C00 40 n 6E 22 006E 40 6E00 42 0 6F 23 006F 42 6F00 44 t 74 24 0074 44 7400 46 e 65 25 0065 46 6500 48 > 3E 26 003E 48 3E00 50 line feed 0A 27 OOOA 50 OA00 52 Character values in hexadecimal Positions in decimal IE 07Q144 -18Table 2 UCS4 Characters and Equivalence Character for Sample Document Character UTF8 (INPUT) UCS4 CHAR EQUIVALENCE CLASS CHAR POS CHAR POS SOF - - 000006 0 06 (SOF) < 3C 0 00003C 0 3C c 63 1 000063 1 63 n 6E 2 00006E 2 6E 0 6F 3 00006F 3 6F t 74 4 000074 4 74 e 65 5 000065 5 65 > 3E 6 00003E 5 3E © C2 A9 7 0000A9 7 07 (Char) space 20 9 000020 9 20 P 50 10 000050 10 50 space 20 11 000020 11 20 K 4B 12 00004B 12 4B δ C3 B6 13 0000F6 13 03 (Name Start Char) b 62 15 000062 15 62 e 65 16 000065 16 65 r 72 17 000072 17 72 1 6C 18 00006C 18 6C < 3C 19 00003C 19 3C / 2F 20 00002F 20 2F c 6C 21 00006C 21 6C n 6E 22 00006E 22 6E 0 6F 23 00006F 23 6F t 74 24 000074 24 74 e 65 25 000065 25 65 > 3E 26 00003E 26 3E line feed 0A 27 00000A 27 0A EOF - - 000005 28 05 (EOF) ί£ 0 70 1 44 -19Table 3 TLO for Sample Document TYPE LENGTH OFFSET REPRESENTS SOF 0 0 STAG 5 1 CNOTE CONTENT 12 7 © P Koberl ETAG 5 21 CNOTE CONTENT 1 27 EOF 0 28 1Εθ7θ 144 -20Listing 1: pseudo code to illustrate the operation of the Lexical Analyser subsystem do forever { if yy_base_accept[yy_current_state](0:0) = 1 then 5 { // Accepting state last__accepting_state = yy_current_state; IFlFO_Accept(); } if global_error = 0 then yy_c = IFIFO_Read() else yy_c = Invalid_Char; end if; do { while (yy_chk[yy_base_accept [yy_current__state] + yy_c) ! = yy_current_state // If this input character is not expected in the I/ current state { // Take default state yy_current_state = yy_def[yy_current_state]; // If this is greater than allowed if (yy_current_state >= NDFAS) then // Use meta character for this ip i.e. compress more yy_c = yy_meta[yy_c] end if } // Expected so take next yy_current_state = yy_nxt[yy_base[yy_current_state]+yy_c] // Loop until we need to op } while yy_base[yy_current_state] != JAMSTATE if yy_base_accept[yy_current_state](0:0) == 0 then // backup IFIFO_Reject(); yy_current_state = last_accepting_state; end if; // Drive all external hardware do_action(dj_action[yy_current_state]) IE 07 Ο 1**

Claims

1. A lexical analyser that comprises a state/action table, the table comprising entries that 5 define a state and an associated action, the action being encoded in an action code.

2. A lexical analyser according to claim 1 that includes one or more state machines that can act upon action codes in the state/action tables.

3. A lexical analyser according to claim 2 in which the state machines process an action code in one machine cycle. 10

4. A lexical analyser according to any preceding claim in which a table includes fewer entries than all possible states and all possible input characters.

5. A lexical analyser according to claim 4 in which an input character is checked for validity in the current state by adding the input character and the current state to form an index into a check table. 15

6. A lexical analyser according to claim 5 in which if the output of the check table is equal to the current state then the input is valid and the output of a next-state value table is used as the next value for the current state, otherwise the next state is given by a default next value for the present state.

7. A co-processor for performing lexical analysis of a formal language that includes a 20 lexical analyser according to any preceding claim.

8. A co-processor according to claim 7 in which the lexical analyser includes a character folding stage for reducing the size of multi-byte characters in an input document. IE 0 70 1 44 -229. A co-processor according to claim 8 in which the output characters are of a width such that each character can form an index into a table of the lexical analyser.

9. 10. A co-processor according to claim 8 or claim 9 in which in which the output of the character folding stage comprises a stream of 7-bit characters. 5

10. 11. A co-processor according to claim 10 in which the input to the character folding stage is a stream of encoded Unicode characters.

11. 12. A co-processor according to any one of claims 9 to 11 in which the character folding stage is operative to receive a stream of input characters, to identify a class to which each character belongs, and to generate output characters. 10

12. 13. A co-processor according to any one of claims 7 to 12 further including a FIFO to which the character folding stage sends its output characters and from which the lexical analyser receives input characters.

13. 14. A co-processor according to any one of claims 7 to 13 implemented in one of a FPGA, ASIC or other form.

14. 15 15. A lexical analysis system comprising a lexical analyser and a character folding stage, the character folding stage being operative to reducing the size of multi-byte characters in an input document and forward the reduced characters to the lexical analyser.

15. 16. A lexical analysis system according to claim 15 in which the output characters are of a 20 width such that each character can form an index into a table of the lexical analyser.

16. 17. A lexical analysis system according to claim 15 or claim 16 in which in which the output of the character folding stage comprises a stream of 7-bit characters.

17. 18. A lexical analysis system according to claim 17 in which the input to the character folding stage is a stream of encoded Unicode characters. -23®07(j ί 44

18. 19. A lexical analysis system according to any one of claims 15 to 18 which is embodied within a software program executable on a general-purpose computer.

19.

20. A lexical analysis system according to any one of claims 15 to 18 which is at least partially embodied within hardware.