Search code examples
jflexcup

JFlex recognition conflict


I'm developing a lexical analyzer and parser using JFlex and CUP. I’m running into a conflict in my lexer, and I’m having trouble understanding why it’s happening.

Here’s my lexer:

import java_cup.runtime.Symbol;
%%
%unicode
%cup
%line
%column

%eof{
    System.out.println("End of file");
%eof}

companyName = [A-Z][a-zA-Z0-9]*[0-9][a-zA-Z0-9]*
weekdays = Lundi|Mardi|Mercredi|Jeudi|Vendredi|Samedi|Dimanche|lundi|mardi|mercredi|jeudi|vendredi|samedi|dimanche
hours = [0-9]{1,2}h([0-9]{1,2})?
city = [A-Z][a-zA-Z -]+
number = [0-9]+

%%

Compagnie {
    System.out.println("Compagnie");
    return new Symbol(sym.COMPAGNIE);
}

{companyName} {
    System.out.println("Company name: " + yytext());
    return new Symbol(sym.COMPANY_NAME, yytext());
}

{weekdays} {
    System.out.println("Weekday: " + yytext());
    return new Symbol(sym.WEEKDAY, yytext());
}

"au depart de" {
    System.out.println("Departure");
    return new Symbol(sym.DEPART);
}

pour {
    System.out.println("For");
    return new Symbol(sym.FOR);
}

par {
    System.out.println("By");
    return new Symbol(sym.BY);
}

Fin {
    System.out.println("End");
    return new Symbol(sym.FIN);
}

==== {
    System.out.println("Separator");
    return new Symbol(sym.SEPARATOR);
}

: {
    System.out.println("Colon");
    return new Symbol(sym.COLON);
}

car {
    System.out.println("Car");
    return new Symbol(sym.CAR);
}

= {
    System.out.println("Equal");
    return new Symbol(sym.EQUAL);
}

, {
    System.out.println("Comma");
    return new Symbol(sym.COMMA);
}

"(" {
    System.out.println("Open parenthesis");
    return new Symbol(sym.OPEN_PARENTHESIS);
}

")" {
    System.out.println("Close parenthesis");
    return new Symbol(sym.CLOSE_PARENTHESIS);
}

{hours} {
    System.out.println("Hours: " + yytext());
    return new Symbol(sym.HOURS, yytext());
}

{number} {
    System.out.println("Number: " + yytext());
    return new Symbol(sym.NUMBER, Integer.parseInt(yytext()));
}

{city} {
    System.out.println("City: " + yytext());
    return new Symbol(sym.CITY, yytext());
}

[\n\t\r\s]+ {}  // Skip whitespace

. {
    System.out.println("Error: " + yytext() + " at line " + yyline + ", column " + yycolumn);
}

Here’s the input I’m testing:

Compagnie Bloblo007 au depart de Brest ====
Mardi :
    8h = car 2733 pour Nantes (par Quimper),
    15h10 = car 902 pour Rennes (par Morlaix, Saint-Brieuc, Montauban-de-Bretagne),
    09h00 = car 1203 pour Saint-Malo

lundi :
    12h = car 80862 pour Landerneau,
    8h5 = car 70 pour Bordeaux (par Vannes, La Roche-Bernard, Nantes,
    La Roche-sur-Yon, Niort),
    15h15 = car 82019 pour Paris (par Quimper, Nantes)

Fin

The issue arises on the first line. Specifically, Compagnie Bloblo007 is not being correctly matched as COMPAGNIE and COMPANY_NAME. Instead, Compagnie Bloblo is being recognized as a CITY and 007 as a NUMBER. However, if I remove the city and number rules, Compagnie and companyName match correctly.

Question:

How can I adjust my lexer to correctly match the Compagnie keyword and the entire company name (Bloblo007) without mistakenly treating the text as a city or number?

Thanks in advance for your help!


Solution

  • By no means an expert here, but judging by your description it seems the regex for city could be the cause.

    [A-Z][a-zA-Z -]+ includes a space and the JFlex documentation under Rules and Actions says (emphasis mine)

    The lexical rules section of a JFlex specification contains regular expressions and actions (Java code) that are executed when the scanner matches the associated regular expression. As the scanner reads its input, it keeps track of all regular expressions and activates the action of the expression that has the longest match.

    That's why Compagnie Bloblo007 au depart de Brest ==== is matched to City: Compagnie Bloblo

    I would actually have expected it to fail here as 007 doesn't match "city" anymore. But maybe I'm missing something more here