Search code examples
algorithmlanguage-agnostictypography

Ideas for converting straight quotes to curly quotes


I have a file that contains "straight" (normal, ASCII) quotes, and I'm trying to convert them to real quotation mark glyphs (“curly” quotes, U+2018 to U+201D). Since the transformation from two different quote characters into a single one has been lossy in the first place, obviously there is no way to automatically perform this conversion; nevertheless I suspect a few heuristics will cover most cases. So the plan is a script (in Emacs) that does something like the following: for each straight quote character,

  1. guess which curly quote character to use, if possible
  2. ask the user (me) to confirm, or make a choice

This question is about the first step: what would be a good algorithm (a set of heuristics, more like) to use, for normal English text (a novel, for example)? Here are some preliminary ideas, which I believe work for double-quotes (counterexamples are welcome!):

  1. If a double-quote is at the beginning of a line, guess that it is an opening quote.
  2. If a double-quote is at the end of a line, guess a closing quote.
  3. If a double-quote is preceded by a space, guess an opening quote.
  4. If a double-quote is followed by a space, guess a closing quote.
  5. If a double-quote doesn't fit into one of the above categories, guess that it is the “opposite” of the most recently used kind of double-quote.

Single quotes are trickier, because a ' might be either an opening quote, closing quote, or apostrophe, and we want to leave apostrophes alone (mustn't write “mustn’t”). Some of the same rules as above apply, but 'tis possible apostrophes are at the beginning of words (or lines), although it's less common than 'twas in the past. I can't offhand think of rules that would properly handle fragments like ["I like 'That '70s show'", she said]. It might require looking at more than just neighbouring characters, and compute distances between quotes, for example…

Any more ideas? It is okay if not all possible cases are covered; the goal is to be as intelligent as possible but no further. :-)

Edit: Some more things that might be worth thinking about (or might be irrelevant, not sure):

  • quotes might not always be in matching pairs: For single quotes it's obvious why as above. But even for double quotes, when there is a quotation that extends for more than one paragraph, usual typographic convention (don't ask me why) is to start each paragraph with a quotation mark, even though it has not been closed in the previous one. So simply keeping a state machine that alternates between two states will not work!
  • Nested quotation (alluded to in the "I like 'That '70s show'" example above): this might make either kind of quote not be preceded or followed by a space.
  • British/American punctuation style: are commas inside the quotes or outside?
  • Many word processors (e.g Microsoft Word) already do some sort of conversion like this. Although they are not perfect and can often be annoying, it might be instructive to learn how they work...

Solution

  • You can't parse English quotation marks with regex because English quotations can't be parsed by regex. Regular expressions aren't sufficiently expressive to parse English quotations. You can get by in a few situations, but a general solution can't be created using regex. See the test cases for my solution.

    Given:

    • A lexer to create lexemes from a character stream.
    • An emitter that publishes various types of quotation marks.
    • An ambiguity resolver that creates nested trees.
    • A set of known ambiguous and unambiguous contractions.
    • A circular buffer of lexemes, length 4.

    Then, super-broadly, one possible algorithm follows:

    1. Iterate over the document using the lexer.
    2. Pass lexemes from the lexer to the emitter.
    3. Push the lexeme into the emitter's circular buffer.
    4. Parse 4 lexemes at a time in the emitter to categorize the curl:
      • opening/closing double/single quote
      • apostrophe
      • straight quote
      • ambiguous opening single quote
      • ambiguous closing single quote
      • ambiguous single quote
      • ambiguous double quote
    5. Emit the categorized quotation mark as a token to the ambiguity resolver.
    6. Have the resolver create trees (for tracking nested quotes):
      1. open a tree for opening quote tokens (single/double)
      2. close the tree for closing quote tokens (single/double)
      3. otherwise, track any ambiguous tokens in the current tree
    7. After all tokens are in nested trees:
      1. start at the root
      2. disambiguate the tokens
      3. sort the list of tokens
      4. resolve the remaining tokens
      5. disambiguate the tokens (yes, again)
      6. relay the tokens to the document parser

    Disambiguating entails replacing ambiguous quotation marks with resolvable equivalents. Basically, you need to count the number of ambiguous leading, lagging, and indeterminate single quotes. Based on whether the current level of the tree already contains some combination of leading/lagging quotes, you can ascertain whether the ambiguous quote is a: closing single quote, opening quote, or apostrophe.

    It's not a trivial algorithm, as it can require:

    • A circular buffer
    • A lexer (tokenizer)
    • A parser (emitter)
    • A resolver (ambiguities)
    • A tree
    • A set of contractions (ambiguous and unambiguous)

    Here are some screenshots of KeenQuotes, which is integrated into my text editor, KeenWrite:

    keenquotes 01

    Nit: It's '70s, not '70's because decades cannot possess anything.

    keenquotes 02