EP3117307A1 - Procede et dispositif pour gerer les ambigüites dans l'analyse d'un code source - Google Patents
Procede et dispositif pour gerer les ambigüites dans l'analyse d'un code sourceInfo
- Publication number
- EP3117307A1 EP3117307A1 EP15709632.2A EP15709632A EP3117307A1 EP 3117307 A1 EP3117307 A1 EP 3117307A1 EP 15709632 A EP15709632 A EP 15709632A EP 3117307 A1 EP3117307 A1 EP 3117307A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- name
- grammar
- lexical
- symbol
- typedef
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/42—Syntactic analysis
- G06F8/425—Lexical analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/43—Checking; Contextual analysis
Definitions
- the invention relates to the field of source code compilation and in particular relates to a method and a device for managing ambiguities in the analysis of the source code.
- the compilers perform first-stage analysis operations through three main components.
- the analysis steps generally include a first step of lexical analysis performed by a lexical analyzer, also known as 'scanner', which breaks the source code into lexical entities called tokens ("tokens" according to anglicism). spent).
- tokens tokens
- the tokens are then used by a parser to identify the syntactic structure of the
- a syntactic tree structure is usually constructed from the tokens according to a grammar that defines the syntax of the language.
- a semantic analyzer completes the syntax tree and produces a symbol table that contains the definitions of source code symbols. The tree from the parse string is then used by other compiler components to generate the target code.
- the grammar of a programming language such as C is a set of symbols and rules of production.
- the symbols are of two kinds: "terminal symbol” corresponding to a lexical entity of the language, and "non-terminal symbol".
- One of the non-terminal symbols is designated as the starting point of the grammar.
- a production rule defines how a sequence of symbols belonging to the grammar can correspond to a new symbol.
- the parsing then consists in constructing a tree from a source code decomposed into lexical entities, a tree commonly called a parse tree whose vertex is the starting symbol of the grammar, the leaves are the terminal symbols, and each of which node is created by the application of the production rules.
- the grammar of the C language makes it possible to define the syntactic elements of the language as instructions, via the symbols of the grammar, and to define the role of the various lexical elements
- a "Declaration” consists of defining a "symbol” in the program and associating it with a "type", that is, information describing the data associated with or designated by this symbol such as numerical values (integers, real numbers, etc.) ), text, or composite elements
- Type / symbol For declarations whose description is “typedef”, and whose symbol name is “typedef-name”. Such a symbol name may appear in the rest of the source code only in another type description, or on the right side of another declaration. Apart from this exception, such a symbol can only be found in a type description.
- the lexical analyzer queries the symbol table to determine if the name has already been defined and what kind. If the name encountered is already defined as "typedef-name", the lexical element is associated with a "typedef-name” token, otherwise the lexical element is associated with an "identify” token.
- a first type of implementation is to extend the grammar of the C standard and allow the appearance of "typedef-name” tokens from contextual lexical analysis. This is to make the statement "a * a" syntactically correct if 'a' is a "typedef-name” token.
- This so-called extended grammar implementation is used in existing compilers such as 'GCC (GNU Compiler Collection) in versions prior to V4.0 or' PCC (Portable C Compiler) and is also used in C such tools 'CIL' (C Intermediate Language), 'Frama-C or' FrontC.
- 'GCC GPU Compiler Collection
- PCC Portable C Compiler
- C C Intermediate Language
- the grammar of the C standard only allows the generation of "identify" tokens in a statement, and performing an extended grammar is a complex operation with a high risk of introducing new
- an extended grammar is significantly longer than the grammar of the C standard, thus increasing the complexity of parsing.
- the development of an analyzer consists of generating the implementation of the parser by a compiler generator such compilers for example 'Berkeley or AT & T YACC,' GNU bison 'or' USF ANTLR '. It is the generator that supports the development of the rest of the parser and ensures that the parser behavior as well
- Another type of implementation is to abandon the grammar of the C standard and to rewrite the parser by hand, for example in C or C ++.
- This technique called the downward recursive parser because it assumes a certain type of parser, has the drawbacks of operating in a simple way, only for a smaller class of grammars, whose grammar of C is not part of it.
- the parser thus developed is generally very complex, both in code size and behavior, and has no guarantee of conformity, except as a manual transcription element by element, of the norm of the language analyzed. In addition, it requires an expensive test procedure, which does not guarantee complete compliance. Tools using this technique are for example 'GNU GCC from version 4.0 or' Apple CLANG '.
- Another type of known implementation consists in considering that a partial analysis of the source code is sufficient for the purposes of the implementation, and that a complete resolution of the problem of
- An object of the present invention is to provide a method that exploits a norm grammar without modifying it or to use an extended grammar to handle ambiguities in symbol declarations.
- the device of the present invention allows the lexical analyzer to generate selective tokens differentiating ambiguous or multivocal lexical entities.
- the technical advantages of the present invention are to significantly reduce the complexity of a parser, both in terms of software development cost thanks to a smaller code size, than in terms of its validation because it ensures a respect at most. close to the norm and not a revalidation of extensions that would not be defined in the standard.
- Another object of the present invention is to provide a method for activating within the lexical analyzer an indicator of differentiation of the issued tokens.
- the invention will find application for source code compilers and analyzers of the programming language C, as well as its extensions.
- the invention applies to the field of development and analysis tools, compilers and code verification tools for software security and engineering.
- a device coupled to a lexical analyzer comprises components adapted to: identifying in a source code received by the lexical analyzer, a lexical entity having ambiguity of interpretation relative to a given grammar of language; identifying in a symbol table associated with the lexical analyzer the existence of a symbol for said lexical entity; determining for the identified symbol the definition stored in the symbol table; and generating a token representative of said definition.
- the components for identifying a lexical entity having ambiguity of interpretation comprise means for determining whether the lexical entity is a name corresponding to a lexeme of type "typedef-name".
- the device comprises means for defining in the grammar a first zone where the lexemes "typedef-name” are differentiated from lexemes "to identify”, and a second zone where said lexemes are not differentiated.
- the device comprises means for generating an identifier token if the name is in the second zone of the grammar.
- the device includes means for enabling a search in the symbol table if the name is in the first area of the grammar.
- the components for determining the recorded symbol definition further include means for determining whether the symbol is defined as "typedef-name".
- the components for generating a token comprise means for generating a 'typedef-name' token.
- the components for generating a token include means for generating an identifier token if the symbol is not defined as "typedef-name”.
- the source code is in the C language and the grammar is the standard grammar of the C language.
- the device is implemented in a code compiler.
- the invention further relates to a method for handling interpretation ambiguities with respect to a given language grammar, the method comprising the steps of: identifying in a source code received by a lexical analyzer a lexical entity having ambiguity of interpretation ; identifying in a symbol table associated with the lexical analyzer the existence of a symbol for said lexical entity; - determine for the identified symbol, the definition recorded in the symbol table; and generating a token representative of said definition.
- the invention may operate in the form of a computer program product that includes code instructions for performing the claimed process steps when the program is run on a computer.
- Figure 1 shows schematically the components of the device of the invention in a preferred implementation
- Figures 2a and 2b show a sequence of steps of the method of the invention in one embodiment
- Figure 1 shows schematically a device 100 comprising the components of a lexical, syntactic and semantic analysis chain for a preferred implementation of the invention in a compiler C.
- a lexical analysis module 120 receives source code 102 in the C language.
- the source code C can come from a file stored on the computer implementing the device 100 or from a remote computer or any other computer-readable medium.
- the lexical analysis module 120 comprises a set of analysis components 122, common to any lexical analyzer, for receiving the source code and breaking it down into lexical entities.
- the lexical analyzer further comprises a state change component 124 for switching a status indicator from a true state to a false state and vice versa.
- the component is implemented in the form of an "Application Programming Interface" (API) according to known anglicism - including the functions for changing the status of the status indicator.
- API Application Programming Interface
- the lexical analyzer 120 further includes a test component 126 for testing the value of the status indicator.
- the test checks the value of the status indicator to determine the nature of the token to be issued, either an "identify” token or a "typedef-name” token.
- the lexical analysis module generates tokens 104 to a parser module 140.
- the parsing module 140 analyzes the tokens received from the lexical analyzer, generates a syntax tree based on the grammar used, and generates semantic actions (AST) 106 which are processed by a semantic analysis module 1 60. a preferred implementation, the parsing module makes it possible to use the grammar of ISO / ANSI C without extension, in which the production listed in chapter "6.7.8 ISO / IEC 9899: 201 1" is removed.
- the grammar of the C language is analyzed in order to determine a first zone where it is necessary to distinguish between the entities "typedef-name" and "identifier", and a second zone where it is not necessary to make this distinction. .
- the area of distinction is called “active zone” and the zone without distinction is called “passive zone”.
- the grammar is considered to represent a space where the parsing is to move in this space in agreement with a selected parsing method.
- Each production of the grammar (and the non-terminal in the left part of the production) represents a point of this space and the paths to reach it.
- Each point of this space can be assigned one or more semantic actions, which are performed when the parser has finished its analysis at this point. When a point is at the boundary between the two active and passive zones, a semantic action is associated with this point to activate the state indicator component of the lexical analyzer, according to two cases:
- the state change component 124 is activated to switch the state indicator from the true state to the false state;
- the state indicator is toggled from the false state to the true state.
- the semantic analysis module 1 60 receives the actions
- semantics from the parser to process them and complete the syntax tree It includes a component 1 62 for the definition of zoning of the grammar that defines the boundary between the active zone and the passive zone.
- the semantic analysis module is also coupled to a memory 180.
- the memory is organized as a table of symbols which makes it possible to record the definitions of the symbols.
- the memory is also coupled to the lexical analysis module.
- the elements (108) generated by the semantic analysis module are addressed to compiler components (not shown in the figure) to finalize the code generation and optimization operations.
- Figures 2a and 2b illustrate the steps 200 operated by the component chain of Figure 1 in a preferred implementation of the invention.
- the method starts on reading a sequence of characters of a source code 202 submitted to the lexical analyzer.
- the next step 204 consists in extracting lexical entities from the source code.
- the present invention is implemented in a TC compiler, in its analysis portion of C, with the Smalltalk® language in which the concepts of structures are classes, objects (class instances), methods ( object behaviors) and attributes (object state variables).
- the lexical analyzer is a lexer class called CCScanner in charge of lexical analysis.
- the device comprises a class "parser” called CCParser corresponding to the parser in charge of the parsing and part of the semantic analysis.
- the device further comprises a class of management of the symbol table, called CScope.
- the compiler generator used is Smalltalk Compiler-Compiler (SmaCC) which defines the abstract classes containing the lexical analyzer and parser behavioral base, as well as the parser compiler from the grammar and lexical description.
- the method allows the step 206 to test whether an entity is a name corresponding to a lexeme type "typedef-name" or not. If the entity is not of type "typedef- name" (branch No), the lexical analyzer produces a token appropriate to the type of the entity (207). If the lexical entity is a name (Yes branch), the method sends a request (208) to the symbol table to search for the symbol associated with the entity. The symbol is returned to the lexical analyzer.
- the method verifies (step 210) in which area of the grammar the entity is encountered to determine whether the entity is in the active or passive grammar area. If the entity encountered is in the passive grammar area (No branch), that is, the symbol a has already been met and that there is no ambiguity in its interpretation, the lexical analyzer will produce a token type Identifier '(step 21 6).
- step 210 If the entity encountered is in the active grammar area (Yes branch, step 210), meaning there is ambiguity in its interpretation, the method proceeds to the next step (212) to check whether the symbol is set to this entity in the symbol table, and how it is defined (type or variable). If the symbol is defined and specified by its type as "typedef-name", the method proceeds to step 214 where the lexical analyzer produces a 'typedef-name' token. If the symbol has not been previously defined, or if the symbol is defined but not specified as "typedef-name" (branch No), the method proceeds to step 216 where the lexical analyzer produces a token identifier. (step 21 6).
- the CCScanner class defines an attribute "fTypename" initialized to the value 'true' ('true' in English) when an instance of the scanner is created by the method (CCScanner "initialize).
- the Device Control API consists of two methods, (CCScanner "setFTypename) and (CCScanner" unsetFTypename), the first setting the attribute to 'true', the second setting the attribute to 'false'.
- the CCScanner lexical analyzer has a routine 'CCScanner' iDENTiFiER 'which is started when a lexeme of the type' identifier 'is detected as input. Contextualization is implemented by querying the symbol table (an instance of CScope). The implementation is performed by a test on the fTypename attribute following the following logical expression:
- step 218 by parsing and executing semantic actions (106).
- the method then makes it possible to verify (step 220) whether a point of the analysis is located at the boundary of the active zone and the passive zone. If the point is not on the boundary (branch No), the method allows generation of the syntax tree (step 228). If the point is located on the boundary of the two zones (Yes branch), the method makes it possible to determine the direction of movement in the space (222) and to verify from which zone of the grammar the entity originates. If the zone of origin is the active zone (branch Yes), meaning that the direction of movement consists of a transition from the active zone to the passive zone, the method makes it possible to actuate the zone passing device (step 224). Then, the method generates the syntax tree (228) taking into account the modifications.
- the method makes it possible to actuate the active zone passing device (step 226). Then, the method makes it possible to generate nodes of the syntax tree (228) taking into account the modifications, and the corresponding semantic actions.
- the boundary between the two zones is implemented in the semantic analyzer and the zones are materialized in the semantic actions defined in the grammar, according to the following realization:
- the semantic action contains the code 'self unsetFTypename' which activates the components of the lexical analyzer.
- the semantic action contains the code 'self setFTypename' which deactivates the components of the lexical analyzer.
- the boundaries between the two zones are identified at the following points:
- declaration_specifiers declaration_specifier
- I struct_or_union_specifier ⁇ self unsetFTypename. ... ⁇ enum_specifier ⁇ self unsetFTypename. ... ⁇
- declaration_specifiers ";" ⁇ self setFTypename. ... ⁇
- declaration_specifiers abstract_declarator ⁇ self setFTypename. I declaration_specifiers ⁇ self setFTypename. ... ⁇
- an optional device can be added to the lexical analyzer to check on a token of advance (known as "look-ahead token” in English).
- a token of advance known as "look-ahead token” in English.
- the optional device is implemented in the lexical analyzer in the form of an API corresponding to the device (CCParser »setFTypename) and (CCParser» unsetFTypename) adapted to operate the optional device as follows :
- the token being processed is contained in the "currentToken” attribute of the parser. If the device (setFTypename) is activated, the token is a "IDENTIFIERId” and its symbol exists and is a type, then the token is changed to a "TypeNameld”.
- the token is a "TypeNameld” then it is changed to a "IDENTIFIERId”. In other cases, the current token is not changed.
- the method makes it possible to manage the ambiguities that may appear on the symbol declarations.
- the present invention can be implemented from hardware and / or software elements. It may be available as a computer program product on a computer readable medium.
- the support can be electronic, magnetic, optical, electromagnetic or be an infrared type of diffusion medium.
- Such supports are, for example, Random Access Memory RAMs (ROMs), magnetic or optical tapes, disks or disks (Compact Disk - Read Only Memory (CD-ROM), Compact Disk - Read / Write (CD-R / W) and DVD).
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| FR1451922A FR3018368B1 (fr) | 2014-03-10 | 2014-03-10 | Procede et dispositif pour gerer les ambiguites dans l'analyse d'un code source |
| PCT/EP2015/054187 WO2015135769A1 (fr) | 2014-03-10 | 2015-02-27 | Procede et dispositif pour gerer les ambigüites dans l'analyse d'un code source |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| EP3117307A1 true EP3117307A1 (fr) | 2017-01-18 |
Family
ID=50639763
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| EP15709632.2A Withdrawn EP3117307A1 (fr) | 2014-03-10 | 2015-02-27 | Procede et dispositif pour gerer les ambigüites dans l'analyse d'un code source |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20170024193A1 (fr) |
| EP (1) | EP3117307A1 (fr) |
| FR (1) | FR3018368B1 (fr) |
| WO (1) | WO2015135769A1 (fr) |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11537791B1 (en) * | 2016-04-05 | 2022-12-27 | Intellective Ai, Inc. | Unusual score generators for a neuro-linguistic behavorial recognition system |
| RU2016137177A (ru) | 2016-09-16 | 2018-03-19 | Оракл Интернэйшнл Корпорейшн | Усовершенствованное преобразование исходного кода языка программирования |
| RU2016137176A (ru) * | 2016-09-16 | 2018-03-19 | Оракл Интернэйшнл Корпорейшн | Связывание преобразованного исходного кода с первоначальным исходным кодом с помощью метаданных |
| CN116414395A (zh) * | 2021-12-30 | 2023-07-11 | 广东优特云科技有限公司 | 一种基于递归下降算法的语法树构建方法及装置 |
| CN114090017B (zh) | 2022-01-20 | 2022-06-24 | 北京大学 | 编程语言的解析方法及装置、非易失性存储介质 |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5560010A (en) * | 1993-10-28 | 1996-09-24 | Symantec Corporation | Method for automatically generating object declarations |
-
2014
- 2014-03-10 FR FR1451922A patent/FR3018368B1/fr not_active Expired - Fee Related
-
2015
- 2015-02-27 EP EP15709632.2A patent/EP3117307A1/fr not_active Withdrawn
- 2015-02-27 US US15/123,937 patent/US20170024193A1/en not_active Abandoned
- 2015-02-27 WO PCT/EP2015/054187 patent/WO2015135769A1/fr not_active Ceased
Non-Patent Citations (2)
| Title |
|---|
| None * |
| See also references of WO2015135769A1 * |
Also Published As
| Publication number | Publication date |
|---|---|
| FR3018368B1 (fr) | 2017-12-15 |
| US20170024193A1 (en) | 2017-01-26 |
| WO2015135769A1 (fr) | 2015-09-17 |
| FR3018368A1 (fr) | 2015-09-11 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Shah et al. | Resolving ambiguities in natural language software requirements: a comprehensive survey | |
| US20220197611A1 (en) | Intent-based machine programming | |
| RU2610241C2 (ru) | Способ и система синтеза текста на основе извлеченной информации в виде rdf-графа с использованием шаблонов | |
| EP3117307A1 (fr) | Procede et dispositif pour gerer les ambigüites dans l'analyse d'un code source | |
| US10803254B2 (en) | Systematic tuning of text analytic annotators | |
| FR2906049A1 (fr) | Procede, mis en oeuvre par ordinateur, de developpement d'une ontologie a partir d'un texte en langage naturel | |
| US20070112718A1 (en) | Method and apparatus to enable integrated computation of model-level and domain-level business semantics | |
| WO2006122886A1 (fr) | Methode dynamique de generation de documents xml a partir d'une base de donnees | |
| US20150261507A1 (en) | Validating sql queries in a report | |
| Bader et al. | AI in software engineering at Facebook | |
| WO2006136565A1 (fr) | Procede de traitement de donnees compatible avec un formalisme de modelisation d'objets | |
| US20060129418A1 (en) | Method and apparatus for analyzing functionality and test paths of product line using a priority graph | |
| WO2007001637A2 (fr) | Utilisation de types de donnees forts pour exprimer des grammaires de reconnaissance vocale dans des programmes logiciels | |
| EP1788497A1 (fr) | Motif de conception et procédé de transformation d'un modèle objet | |
| Muller et al. | HUTN as a bridge between modelware and grammarware-an experience report | |
| WO2011098677A1 (fr) | Systeme et un procede de gestion et de compilation d'un cadre d'applications de developpement logiciel. | |
| Decker | srcdiff: Syntactic differencing to support software maintenance and evolution | |
| Subahi et al. | A New Framework for Classifying Information Systems Modelling Languages. | |
| EP2182435A1 (fr) | Procédé d'implémentation d'une machine à états finis via l'utilisation d'annotations Java | |
| Ratiu et al. | Towards a repository of common programming technologies knowledge | |
| EP2738693A1 (fr) | Procédé d'enregistrement de données hiérarchisées | |
| FR2887348A1 (fr) | Structure de donnees compatible avec un formalisme de modelisation d'objets et procede de traitement de donnees | |
| FR2821972A1 (fr) | Procede pour permettre un dialogue homme-machine | |
| Kroha et al. | Using semantic web technology in requirements specifications | |
| Lorenzo et al. | Supporting safe metamodel evolution with edelta |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
| 17P | Request for examination filed |
Effective date: 20160906 |
|
| AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
| AX | Request for extension of the european patent |
Extension state: BA ME |
|
| DAX | Request for extension of the european patent (deleted) | ||
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
| 17Q | First examination report despatched |
Effective date: 20180509 |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
| 18D | Application deemed to be withdrawn |
Effective date: 20210518 |