Parsing Roman Numerals


First of all: Roman numerals suck.

That’s actually pretty much all I have to say about them, but I guess you won’t believe me if I don’t prove it. For that we first need to take a closer look on how they work. Let’s get into it.

Basics

The idea of roman numerals is to have a set of symbols that each have a specific value. The most common symbols used are I, V, X, L, C, D, M representing 1, 5, 10, 50, 100, 500, 1000 respectively. These symbols can be combined – from biggest to smallest – to build up other numbers.
So for example LXXXXVIIIIII would be 50 + 4*10 + 5 + 6*1 = 101.

This has two problems:
First, these numbers are very unwieldy if a symbol needs to be repeated often.
Second, for most numbers there is no unique representation. Using the example above: CI is the same as LXXXXVIIIIII

Subtractive Notation

To fix the first problem, a subtractive notation is used. If a symbol is placed out of order (i.e. not descending) it’s value is subtracted instead of added. As a rule, no symbol should be repeated more than 3 times, because then it is more convenient to just subtract from the next symbol.
To give an exmaple: IV = 4 = IIII, or something a bit more complicated: IC = 99, which would be LXXXXVIIII in the best case using the previous idea.

But this doesn’t fix the second problem and adds two more:
Third, the evaluation is potentially ambiguous – is VIX equal to 6 or 4?
Fourth, even if a representation is not ambiguous the calculation might get out of hand. For example, it’s not immediately obvious that VCVLVXIVI is 150.

As a side note: This approach of limiting the number of reoccurring symbols effectively also limits the maximal number that can be represented using a given set of symbols.
Using the standard set from I to M the maximum is MMMDDDCCCLLLXXXVVVIII = 4998.

Standard Notation

The standard notation aims to fix all of these problems.

Starting from the substractive notation, the standard rules are as follows:

  1. An additive symbol may not be repeated more than 3 times (IIII can be written as IV).
  2. An additive symbol may be repeated 4 times if the string is broken up by a subtractive symbol (eg. XXXIX = 39).
  3. 5s symbols (ie. V, L, D) may not be repeated (VV is written as X).
  4. 5s symbols may be not used for substration (VX is written as V).
  5. Subtractive symbols are only allowed left of their next higher 10s or 5s neighbor – which ever is closer.
    1. I is allowed left of V and X.
    2. X is allowed left of L and C.
    3. C is allowed left of D and M.
    4. If there is more than one next higher symbol, the right-most one is used. (XXIX instead of IXXX).
  6. Subtractive symbols may not be repeated (IIV is written as III).
  7. If a symbol is used subtractively, it may not be used additively (IVI is written as V).

This fixes the first problem since we are using a subtractive notation.
The second problem is fixed, because there is only one representation of each number.
The third one is no longer a concern, since there may at most be one subtractive symbol for each position.
And because the position of subtractive symbols is fixed, calculating the value of a numeral is “easy”. So the fourth problem is also solved.

Great, all issues resolved. Except this solution sucks in every possible way.

First of all, because of all the restrictions some nice representations are lost. Using the subtractive notation, the number 49 for example could be written as IL, but this is not allowed in standard notation because the symbol I can only appear subtractively next to V or X. The standard way of writing 49 is XLIX – not very nice.

Since 5s are not allowed to be repeated, the maximum presentable number is also reduced. MMMCMXCIX = 3999

But the biggest problem of all – let me just pre-empt it here, we are going to talk about this in detail in the next chapters – is that, if you generalize this system for arbitrary big numbers, it’s language suddenly becomes context-sensitive.

Parsing Standard Notation

When parsing roman numerals with the standard tools for computer language recognition, there are a few difficulties. Most of them resulting from the fact that one symbol can affect the syntax at a different part of the input – this is called context-sensitivity.

A short example: XCLI is not valid. The L symbol should cause an error during the parsing process, because the X at position 1 “restricts” the valid tokens at position 3 to only V and I.

Does that mean that it’s not possible to create a complete context-free grammar for roman numerals? Actually not quite. Because the set of valid inputs is finite, and therefore regular.

A Grammar For Finite Standard Notation

So here it is. A complete YACC grammar for finite roman numerals in standard notation using the symbol set {I, V, X, L, C, D, M}. I also added actions to calculate the corresponding decimal value for testing. So I guess it’s a fully featured converter now. ^^

As you can see, multiple rules are needed per value position to catch all possible errors. It should also be noted that for each possible repetition count per position, an additional production is necessary. In general we can say that the grammar is very much not simple, given how mundane the language domain is.

I omitted the lex-setup, includes and so on here, since only the grammar is really of interest. If you are curious about the full program, just reach out via email.

Normal: Mstruct { printf("%d\n", $1); } ; Mstruct: Mlast { $$ = $1; } | M Mlast { $$ = 1000 + $2; } | M M Mlast { $$ = 1000 + 1000 + $3; } | M M M Mlast { $$ = 1000 + 1000 + 1000 + $4; } ; Mlast: Dstruct { $$ = $1; } | C M Dlast { $$ = 1000 - 100 + $3; } ; Dstruct: D Cstruct { $$ = 500 + $2; } | Dlast { $$ = $1; } ; Dlast: Cstruct { $$ = $1; } | C D Clast { $$ = 500 - 100 + $3; } ; Cstruct: Clast { $$ = $1; } | C Clast { $$ = 100 + $2; } | C C Clast { $$ = 100 + 100 + $3; } | C C C Clast { $$ = 100 + 100 + 100 + $4; } ; Clast: Lstruct { $$ = $1; } | X C Llast { $$ = 100 - 10 + $3; } ; Lstruct: Llast | L Xstruct { $$ = 50 + $2; } ; Llast: Xstruct { $$ = $1; } | X L Xlast { $$ = 50 - 10 + $3; } ; Xstruct: Xlast { $$ = $1; } | X Xlast { $$ = 10 + $2; } | X X Xlast { $$ = 10 + 10 + $3; } | X X X Xlast { $$ = 10 + 10 + 10 + $4; } ; Xlast: Vstruct { $$ = $1; } | I X { $$ = 10 - 1; } ; Vstruct: Istruct { $$ = $1; } | V Istruct { $$ = 5 + $2; } | I V { $$ = 5 - 1; } ; Istruct: /* empty */ { $$ = 0; } | I { $$ = 1; } | I I { $$ = 1 + 1; } | I I I { $$ = 1 + 1 + 1; } ;
Code language: C++ (cpp)

A Grammar For Infinite Standard Form

Is it possible to create a grammar for a roman numeral system with infinite domain?

Well, it depends. The problem is that an infinite roman numeral system requires infinitely many symbols. And we can’t have that when constructing a grammar. This also begs a practical question: How would infinitely many different symbols look like?

However, I have a workaround for this: We can encode our input symbols as decimal numbers. So the symbol I would correspond to 0, V to 1, X to 2, L to 3 and so on. Now we can consider these decimal numbers ([0-9]+) as tokens and can use them in our grammar.

But now the grammar needs to be able to access semantic information (namely number semantics for the tokens) during the parsing process.

This is actually not a new concept in computer science. Semantic grammars and predicated attributed grammars can be used to model something like this. As far as I can tell the basic idea of using semantic information at parse time dates back to 1976 (“Semantic Grammar: An Engineering Technique for Constructing Natural Language Understanding Systems” by Richard Burton).

So I tried to implement a prove of concept parser in ANTLR (which supports semantic predicates for controlling parsing decisions based semantic criteria).

My first idea was to use attributes to propagate the values from one rule call to another and then use predicates to gate the paths that are semantically not valid. This of course doesn’t work, since, because ANTLR constructs top-down parsers, synthesized attributes are not available during parse time.

I realized that I don’t need the virtual ranks of rule calls, but only the ranks of the next few symbols, which I can access without attributes using the input token stream. This approach could be able to do it in my opinion, but I wasn’t able to get it to work. Maybe someone smarter can figure it out. ^^

Here is the basic idea:

grammar Roman; @header { import java.util.*; } @parser::members { Map<String, Integer> lookup = new HashMap<>(){{ put("I", 0); put("V", 1); put("X", 2); put("L", 3); put("C", 4); put("D", 5); put("M", 6); }}; int getRank(int offset) { String text = _input.LT(offset).getText(); if (text.equals("")) return -1; int i = lookup.get(text); return i; } } start: number EOF; number: digit | { getRank(1) > getRank(2) }? digit number | { getRank(1) == getRank(2) && getRank(1) > getRank(3) }? digit digit number | { getRank(1) == getRank(2) && getRank(1) == getRank(3) && getRank(1) > getRank(4) }? digit digit digit number | { getRank(1) == getRank(2) && getRank(1) == getRank(4) && (getRank(1) % 2 == 0 || getRank(3) == getRank(1) - 1) && (getRank(1) % 2 == 1 || getRank(3) == getRank(1) - 2) && getRank(1) > getRank(5) }? digit digit digit digit number ; digit: 'I' | 'V' | 'X' | 'L' | 'C' | 'D' | 'M' ; NewLine: [ \t\r\n]+ -> skip;
Code language: Java (java)

If you happen to find a working solution, please message me. I’m curious. ^^

But okay, so now what? This is a bit anticlimactic, isn’t it?
Well, I did try one more thing. Because I thought, maybe writing a parser from scratch without relying on parser generators and grammars could actually work too. ^^”

A Hand-Written Parser For Infinite Standard Form

Just to manage expectations: I didn’t implement a parser for arbitrary big roman numerals. I did however write one for finite standard form that can be modified quite easily to accept infinite standard form as described in the previous chapter.

The idea is as follows: All tokens are internally represented as numbers. The parsing code itself does not care what the numbers are but only what their relationship is (eg. the next symbol is one less than the current one, and so on). By using this approach we can use a (theoretically) unlimited number of tokens (Practically we are of course constrained by what ever the size of the used datatype is. Although now that I think about it, maybe it’s possible to use a BigInt type or something.).

For transforming this into a parser for infinite standard form, only the token() and lookahead() functions need to be adjusted accordingly (getnwc() and tokenize() are not used directely by the parser but only the token() and lookahead() functions.).

#include <stdio.h> #include <stdlib.h> #include <math.h> #define I (0) #define V (1) #define X (2) #define L (3) #define C (4) #define D (5) #define M (6) #define is5s(t) (t % 2 != 0) #define is10s(t) (t % 2 == 0) #define value(t) (is5s(t) ? powl(10, (t - 1) / 2) * 5 : powl(10, t / 2)) int tokenize(int c) { switch (c) { case 'I': return I; case 'V': return V; case 'X': return X; case 'L': return L; case 'C': return C; case 'D': return D; case 'M': return M; case EOF: return c; default: fprintf(stderr, "lexical error\n"); exit(2); return -1; } } int getnwc() { for(;;) { int c = getchar(); switch(c) { case ' ': continue; case '\t': continue; case '\n': continue; case '\r': continue; default: return c; } } } int token() { return tokenize(getnwc()); } int lookahead() { int c = getnwc(); if (c != EOF) { ungetc(c, stdin); } return tokenize(c); } int main() { int result = 0; int lastSubtractive = EOF; int last = EOF; int streak = 1; int c, n; while((c = token()) != EOF) { if (last == c) { streak++; if (streak > 3) { fprintf(stderr, "syntax error: streak length exceeded\n"); return 1; } if (lastSubtractive != EOF && c > lastSubtractive) { fprintf(stderr, "syntax error: streak after subtractive\n"); return 1; } } else { streak = 1; } n = lookahead(); if (n != EOF && value(n) > value(c)) { if (is5s(c)) { fprintf(stderr, "syntax error: subtractive 5s\n"); return 1; } else if ( (is5s(n) && n != c + 1) || (is10s(n) && n != c + 2) ) { fprintf(stderr, "syntax error: sutractive range wrong\n"); return 1; } else if (streak > 1) { fprintf(stderr, "syntax error: subtractive streak\n"); return 1; } else { if (c == lastSubtractive) { fprintf(stderr, "syntax error: multiple use of subtractive symbol\n"); return 1; } lastSubtractive = c; result -= value(c); } } else { if (c == lastSubtractive) { fprintf(stderr, "syntax error: subtractive symbol used additively\n"); return 1; } result += value(c); } last = c; } printf("%d\n", result); return 0; }
Code language: C++ (cpp)

As a side note: We could theoretically remove the need for a token lookahead altogether by building a state machine and only using the consumed tokens to determine the next state. The thing we have to be cautious about with this approach is to clean up the state if EOF is reached. Otherwise we might miss a syntax error.

Conclusion

Given how exceedingly hard it is to parse arbitrary big roman numerals – especially considering the dull domain – I think it’s fair to conclude that they are the worst thing since the invention of math. I’m also pretty sure this “number system” – if you can even call it that – is the true reason for the demise of the roman empire.

Still, constructing grammars and parsers for them was (at least for me) an interesting challenge. I also learned some stuff about parsers I didn’t know before – so that’s a win I guess. ^^

I hope this article was interesting for you as well, or at least somewhat entertaining.

Have a nice day,
Sigma

PS: If you know of any other weird historical number systems, please get in touch. ^^


Leave a Reply

Your email address will not be published. Required fields are marked *