How to find list of unique affixes given a list of words?

An affix can be a prefix (before word), infix (in the middle of a word), or suffix (after word). I have a list of 200k+ latin/greek names used in biological taxonomy. It turns out there is no centralized list of all the affixes used in the taxonomy, unfortunately, other than this very basic list.

The question is, how can I take that 200k+ list of latin/greek names, and divide it into a list of affixes (ideally using just plain JavaScript)?

I don't really know where to begin on this one. If I construct a trie, I need to somehow instead test for specific chunks of words. Or if the chunk can be extended, don't include the chunk until we reach a final extension of some sort...

const fs = require('fs')
const words = fs.readFileSync(`/Users/lancepollard/Downloads/all.csv`, 'utf-8').trim().split(/\n+/)
const trie = { children: {} }

words.forEach(word => addToTrie(trie, word))

function addToTrie(trie, word) {
  let letters = word.trim().split('')
  let node = trie
  let i = 0
  while (i < letters.length) {
    let letter = letters[i++]
    node = node.children[letter] = node.children[letter] || { children: {} }
  }
  node.isWord = true
}

It doesn't need to be exact, like each affix actually means something, it can be dirty (in that, some words mean something, some words don't). But it shouldn't just list every permutation of a word's letters sort of thing. It should include things which are "potential affix candidates", which are chunks which appear more than once in the list. This will at least get me partway there, and I can then manually go through and look up the definitions for each of these "chunks". Ideally, it should also tell whether it is a prefix/infix/suffix. Maybe the output is a CSV format affix,position.

You can get creative in how this is solved, as without knowing a list of possible affixes in advance, we don't know what the exact output should be. This is basically to try and find the affixes, as best as possible. If it includes things like aa- as a prefix, for example, which is probably a common sequence of letters yet I don't think is an affix, that is fine with me, it can be filtered out manually. But if there are two words (I am making this up), say abrogati and abrowendi, then abro would be a "common prefix", and that should be included in the final list, not abr, ab, and a, even though those are common too. Basically, the longest common prefix. However, if we have the words apistal and ariavi, we could say that a is a common prefix, so our final list would include a and abro.

To go into slightly more detail, say we have these two words aprineyanilantli and aboneyanomantli, they have the common prefix a-, and the common suffix -antli, as well as the infix -neyan-, so those should be in the final list.

It doesn't necessarily need to be efficient, as this is only going to run theoretically once, on the 200k+ list. But if it efficient as well, that would be bonus. Ideally though it shouldn't take hours to run, though I am not sure what's possible :)

Another example is this:

brevidentata
brevidentatum
brevidentatus
crassidentata
crassidentatum
crassidentatus

Here, the first 3 have a common prefix, brevidentat, then 2-3 have the common prefix brevidentatu. But later (with human knowledge), we find identat is probably the infix we desire, and a/um/us are word form suffixes. Also, we see that identat is an infix in the two words crass... and brev.... So the end result should be:

brav-
crass-
-identat-
-a
-us
-um

That, in theory, would be the ideal outcome. But you could also have this:

brav-
crass-
-identat-
-identata
-identatus
-identatum

That would also work, and we could do some simple filtering to filter those out later.

Note, I don't care about infixes in the sense of word parts that surround something else, like stufffoo...barstuff, where foo...bar wraps something. I just care about the word parts which are repeated, such as prefixes, suffixes, and stuff in the middle of words.

Solution

This is an interesting problem, and I have a sketch of a solution, with runnable code and somewhat reasonable -- but far from perfect -- output. It's easy, if not quick, to play with variants.

The idea is to first run through all the words, splitting them in every possible way, then to count the appearances of each prefix, infix, and suffix across all the words, and finally to use that information, together with a scoring function, to choose the best representation of each word.

The scoring functions I've tested involve combinations of the length of the prefix, the count of that prefix across all words, and the same factors for the suffix and affix. Generally I weigh the lengths much higher than counts, and I focus for now on the prefixes and only slightly weigh the suffixes.

Running this takes a handful of minutes, but more memory than Node gets by default. I run it as

node --max-old-space-size=8192 index

and that seems to be enough. I haven't tried it with 4GB.

My code looks like this, with the most recent (and so far my favorite) scoring function:

const {readFile, writeFile} = require ('fs') .promises
 
const range = (lo, hi) =>
  Array .from ({length: hi - lo}, (_, i) => i + lo)
 
const chooseTwo = (n) =>
  range (0, n) .flatMap (i => range (i + 1, n + 1) .map (j => [i, j]))
 
const maximumBy = (fn) => (xs) =>
  xs .reduce ((max, x) => {
    const xScore = fn (x);
    return xScore > max .score ? {score: xScore, val: x} : max;
  }, {score: -Infinity}) .val
 
const breakdown = (word) => {
  const len = word.length;
  const ranges = chooseTwo (len);
  return [
    ... ranges .map (([i, j]) => ({p: word .slice (0, i), i: word .slice (i, j), s: word .slice (j)})),
    ... range (0, len - 1) .map (i => ({p: '', i: word .slice (0, i), s: word .slice (i)})),
  ];
}

const score = (counts) => ({p, i, s}) =>
  Math .max (1, Math .sqrt (1 + counts .prefixes [p]) * p .length ** 2) *
  // Math .max (1, counts .infixes [i] * i .length ) *
  Math .max (1, counts .suffixes [s] * s .length)
 
const process = (words) => {
  const breakdowns = words .map (breakdown)
  const counts = breakdowns .reduce (
    (all, breakdown) => breakdown .reduce (
      (all, {p, i, s}) => {
        all .prefixes [p] = (all .prefixes [p] || 0) + 1;
        all .infixes [i] = (all .infixes [i] || 0) + 1;
        all .suffixes [s] = (all .suffixes [s] || 0) + 1;
        return all;
      },
      all
    ),
    {prefixes: {}, infixes: {}, suffixes: {}}
  )
 
  return breakdowns .map (maximumBy (score (counts)))
}
  
readFile ('./all.csv', 'utf8')
  .then (s => s.split ('\n'))
  .then (process)
  .then (breakdowns => breakdowns .map (({p, i, s}) => `${p ? `(${p}-)` : ''}(${i})${s ? `(-${s})` : ''}`))
  .then (words => writeFile ('./res.csv', words .join ('\n')), 'utf8')
  .then (() => console .log ('Result written'))

The first important function is breakdown, which, for instance, turns 'horse' into:

(h)(-orse)
(ho)(-rse)
(hor)(-se)
(hors)(-e)
(horse)
(h-)(o)(-rse)
(h-)(or)(-se)
(h-)(ors)(-e)
(h-)(orse)
(ho-)(r)(-se)
(ho-)(rs)(-e)
(ho-)(rse)
(hor-)(s)(-e)
(hor-)(se)
(hors-)(e)
()(-horse)
(h)(-orse)
(ho)(-rse)
(hor)(-se)
(h-)(orse)
(ho-)(rse)
(hor-)(se)
(hors-)(e)

which is stored internally with p, i, and s properties, for prefix, infix, and suffix, so it actually looks like this:

[
  {p: '', i: 'h', s: 'orse'},
  {p: '', i: 'ho', s: 'rse'},
  {p: '', i: 'hor', s: 'se'},
  {p: '', i: 'hors', s: 'e'},
  {p: '', i: 'horse', s: ''},
  {p: 'h', i: 'o', s: 'rse'},
  {p: 'h', i: 'or', s: 'se'},
  {p: 'h', i: 'ors', s: 'e'},
  {p: 'h', i: 'orse', s: ''},
  {p: 'ho', i: 'r', s: 'se'},
  {p: 'ho', i: 'rs', s: 'e'},
  {p: 'ho', i: 'rse', s: ''},
  {p: 'hor', i: 's', s: 'e'},
  {p: 'hor', i: 'se', s: ''},
  {p: 'hors', i: 'e', s: ''},
  {p: '', i: '', s: 'horse'},
  {p: '', i: 'h', s: 'orse'},
  {p: '', i: 'ho', s: 'rse'},
  {p: '', i: 'hor', s: 'se'},
  {p: 'h', i: 'orse', s: ''},
  {p: 'ho', i: 'rse', s: ''},
  {p: 'hor', i: 'se', s: ''},
  {p: 'hors', i: 'e', s: ''},
]

breakdown is built on two trivial functions: range creates an integer range, inclusive on the start, exclusive on the end, so that range (3, 12) yields [3, 4, 5, 6, 7, 8, 9, 10, 11]. And chooseTwo finds all pairs of distinct integers between 0 and n.

Our second main function is process, which does the algorithm described above using breakdown and maximumBy, which we use to choose the maximum valued breakdown using the score function. In between, we simply count the parts used.

This is all infrastructure. The important work is in score. You can alter this in so many ways. If it wasn't holiday time, I would love to play around with variants of this. But when you do, you should note that although it's easy to play with a small subset of the data and get reasonable-looking results, that doesn't always scale so reasonable to the complete data. So you will need to run the full code with various functions.

One thing I would suggest investigating is whether there is a reasonably accurate predictive hyphenating tool for English -- not dictionary-based, but either the result of reasonable first principles or of some machine learning runs. A good hyphenation decision might help you write a better score function.

If you want to see this in action in a small subset of your data, you can expand the following snippet:

const range = (lo, hi) =>
  Array .from ({length: hi - lo}, (_, i) => i + lo)
 
const chooseTwo = (n) =>
  range (0, n) .flatMap (i => range (i + 1, n + 1) .map (j => [i, j]))
 
const maximumBy = (fn) => (xs) =>
  xs .reduce ((max, x) => {
    const xScore = fn (x);
    return xScore > max .score ? {score: xScore, val: x} : max;
  }, {score: -Infinity}) .val

const breakdown = (word) => {
  const len = word.length;
  const ranges = chooseTwo (len);
  return [
    ... ranges .map (([i, j]) => ({p: word .slice (0, i), i: word .slice (i, j), s: word .slice (j)})),
    ... range (0, len - 1) .map (i => ({p: '', i: word .slice (0, i), s: word .slice (i)})),
  ];
}

const score = (counts) => ({p, i, s}) =>
  Math .max (1, Math .sqrt (1 + counts .prefixes [p]) * p .length ** 2) *
  // Math .max (1, counts .infixes [i] * i .length ) *
  Math .max (1, counts .suffixes [s] * s .length)
 
const process = (words) => {
  const breakdowns = words .map (breakdown)
  const counts = breakdowns .reduce (
    (all, breakdown) => breakdown .reduce (
      (all, {p, i, s}) => {
        all .prefixes [p] = (all .prefixes [p] || 0) + 1;
        all .infixes [i] = (all .infixes [i] || 0) + 1;
        all .suffixes [s] = (all .suffixes [s] || 0) + 1;
        return all;
      },
      all
    ),
    {prefixes: {}, infixes: {}, suffixes: {}}
  )
 
  return breakdowns .map (maximumBy (score (counts)))
}

const words = ["cristata", "cristatella", "cristatellidae", "cristatellus", "cristaticeps", "cristaticollis", "cristatiforme", "cristatifrons", "cristatigena", "cristatipes", "cristatispinosa", "cristatissimus", "cristatogobius", "cristatoides", "cristatolabra", "cristatopalpus", "cristatula", "cristatum", "cristatus", "cristavarius", "cristellaria", "cristeremaeus", "cristi", "cristianalemani", "cristiani", "cristibrachium", "cristicauda", "cristiceps", "cristicola", "cristicollis", "cristidigitus", "cristifer", "cristifera", "cristiferus", "cristiformis", "cristifrons", "cristigera", "cristiglans", "cristiloba", "cristimanus", "cristina", "cristinae", "cristipalpis", "cristipes", "cristirhizophorum", "cristis", "cristispira", "cristiverpa", "cristobal", "cristobala", "cristobalensis", "cristobalia", "cristoides", "cristonothrus", "cristophylla", "cristovalensis", "cristovaoi", "cristula", "cristulata", "cristulatum", "cristulatus", "cristuliflora", "cristulifrons", "cristulipes", "cristulum", "cristus", "crisulipora", "critchleyi", "critesion", "crithagra", "crithionina", "crithmifolia", "crithmoides", "critho", "crithodium", "crithopyrum", "critica", "criticum", "criticus", "critola", "critolaus", "critomolgus", "criton", "critonia", "crittersius", "crius", "crivellarii", "crnobog", "crnri", "croasdaleae", "croatanensis", "croatania", "croatanica", "croatica", "croaticum", "croaticus", "croatii", "crobylophorus", "crobylura", "crocaceae", "crocale", "crocallata", "crocallis", "crocana", "crocanthemum", "crocata", "crocatum", "crocatus", "crocea", "croceareolata", "crocearia", "croceata", "croceater", "croceator", "croceatus", "croceguttatus", "croceibacter", "croceicauda", "croceicincta", "croceicoccus", "croceicollis", "croceicornis", "croceiflorus", "croceipennis", "croceipes", "croceitalea", "croceitarsis", "croceithorax", "croceiventre", "croceiventris", "croceoida", "croceoides", "croceoinguinis", "croceola", "croceolanata", "croceomaculatus", "croceopodes", "croceosignatus", "croceovittata", "croceovittatus", "croces", "croceum", "croceus", "croci", "crociaeus", "crocias", "crocidema", "crocidium", "crocidolomiae", "crocidopoma", "crocidura", "crocidurae", "crocidurai", "crocidurinae", "crociduroides", "crocidurus", "crocifera", "crocigrapha", "crocina", "crocinae", "crocineus", "crocinitomix", "crocinopterus", "crocinosoma", "crocinubia", "crocinum", "crocinus", "crocisa", "crocisaeformis", "crockerella", "crockeri", "crockeria", "crockeriana", "crockerinus", "crockettorum", "crococephala", "crocodila", "crocodilensis", "crocodili", "crocodilia", "crocodilichthys", "crocodilinus", "crocodill", "crocodillicola", "crocodilorum", "crocodilosa", "crocodilurus", "crocodilus", "crocodyli", "crocodylia", "crocodylidae", "crocodylus", "crocogaster", "crocolita", "croconota", "croconotus", "crocopeplus", "crocopygia", "crocopygius", "crocorrhoa", "crocosema", "crocosmia", "crocosmiiflora", "crocostethus", "crocota", "crocothemis", "crocotia", "crocotila", "crocoturum", "crocotus", "crocro", "crocus", "crocusella", "crocuta", "crocutasis", "crocutella", "crocynia", "crocyniaceae", "croeciclava", "croeseri", "croesia", "croesioides", "croesus", "croftia", "croftiae", "croftii", "croftoni", "croftus", "crogmaniana", "croicensis", "croilia", "croisseti", "croix", "croizati", "croizatii", "crokeri", "cromagnonensis", "crombiei", "crombota", "cromeria", "cromerus", "cromileptes", "cromion", "cromis", "cromwellii", "cromyorhizon", "cronadun", "cronartiaceae", "cronartium", "cronebergi", "cronebergii", "croni"]
 
Promise .resolve (words)
  .then (process)
  .then (breakdowns => breakdowns .map (({p, i, s}) => `${p ? `(${p}-)` : ''}(${i})${s ? `(-${s})` : ''}`))
  .then (words => console .log (words .join ('\n')))

.as-console-wrapper {max-height: 100% !important; top: 0}

The format I use to display these is slightly different than suggested, as I wanted to allow for versions without prefixes or without suffixes but still be quite readable and unambiguous. Thus (crist-)(atellid)(-ae) should be quite clear. Each of the three sections is surrounded by parentheses. The prefix ends with a hyphen and the suffix begins with one. This is the format in the output file, but it would be trivial to change that -- just adjust the function supplied to breakdowns .map () in the last block.

A fascinating problem, and I hope I get some time next week to look at it more carefully.