The news rattling the bars of the Voynich research cage loudest right now is surely the publication of a paper by Marcelo Montemurro and Damián H. Zanette called Keywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis, deftly summarized in New Scientist as New signs of language surface in mystery Voynich text.
M&Z’s abstract brings out a lot of what they were trying to do – and also points exactly to their mistakes.
“Here we analyse the long-range structure of the manuscript using methods from information theory. We show that the Voynich manuscript presents a complex organization in the distribution of words that is compatible with those found in real language sequences. We are also able to extract some of the most significant semantic word-networks in the text. These results together with some previously known statistical features of the Voynich manuscript, give support to the presence of a genuine message inside the book.“
Central Assumption: the authors implicitly hypothesize that they can get meaningful results for long-range comparisons because Voynichese is homogeneous across all its sections.
…The Problem: this assumption is false (or very nearly so), because there are significant macro-level differences in the way the language in different sections works (Currier A, Currier B, labels) as well as many mid-level differences (Herbal-A, Q13-ese, etc).
Central Conclusion: the authors believe that their language-centric statistical machinery has identified “The thirty most informative words in the Voynich manuscript”.
…The Problem: I’m pretty sure that the authors have in fact very probably identified arguably the thirty least informative words in the Voynich Manuscript. (That may be an independently useful result, but it’s probably not really what they were hoping for.)
Voynichese is extremely predictable at a letter-level: it has many rigid letter-level adjacency rules (’4′ is almost always followed by ‘o’, etc) and position rules (4o- is consistently word-initial, -89 is consistently word-final, etc) and a high level of letter-context predictability.
Yet at the same time, it also has a very large dictionary relative to its text size. I often criticize Gordon Rugg for suggesting historically incorrect Cardan grille-like tables (i.e. they’re a century too late for the Voynich’s construction dating) and for inappropriately back-projecting his modern CompSci mindset onto the early Renaissance (i.e. it’s 500+ years too early for the kind of table-driven hackery he proposes). However, he is absolutely right that a reconstructed Voynichese “dictionary” would, to a modern computer scientist’s eyes, look very much as if it had been generated or permuted by some means.
The paradox is therefore that these two apparently opposite aspects of Voynichese are able to coexist: how on earth can we reconcile its letter rigidity & predictability with its wild word variability?
I think the key to resolving this is to grasp that there is some kind of generative or confounding principle at work within a rigidly predictable framework. That is, that even though there are lots of rules, these rules act as a kind of “container” for semantic or cryptographic variability to exist within.
Hence I believe that Montemurro’s statistical machinery is identifying “words” that fall within the container layer rather than in the confounded content layer. Hence these are arguably the thirty least informative words in the Voynich Manuscript.
It’s a hard point to understand, let alone accept: the confounding trick (some kind of transposition cipher? some kind of paper cipher machinery? some kind of cipher wheel?) driving Voynichese’s inherent variability remains as profoundly unreachable now as it has been for over 500 years.
My apologies to Montemurro and Zanette, but the central challenge we face isn’t to find new language-based statistical tests to apply to the Voynichese corpus, however clever they may be. Rather, it is to find ways of resolving the Voynich Manuscript’s central paradox: how is it that Voynichese is both letter-rigid and word-variable at the same time?
Incidentally, M & Z conclude in their paper that results point to a semantic link between the Recipe and Astro sections, and between the Herbal and Pharma sections. Actually, had they been more aware of the codicology analyses that have been done, they would have seen that their results are consistent with the writing phase order.
In fact, there are many indications that what I call Voynichese’s ‘container’ layer above evolved during the writing, with the most obvious evolution being between Currier A and Currier B. I suspect that what their statistical machinery has imperfectly captured is therefore simply a snapshot into the evolution of the container layer, and not anything ‘semantic’ as such.
In short, the aspect of Voynichese that is most nearly homogeneous across all its sections is its “container” layer: so what Montemurro and Zanette have done is make long-range comparisons between evolutions of the container layer. Currently, my best guess is that these are likely to be almost entirely composed of cipher system meta-tokens (shorthand tokens, transposition cipher placeholders, etc) rather than the semantic contents, which appear instead to have been confounded by some means.
So, rather than finding a “genuine message” (as New Scientist put it), perhaps they have instead found a “genuine container” for the message? This may prove to be a very useful result in its own right, but it’s probably not the smoking gun linguistic proof they were hoping to use to discredit Rugg’s tables.