|
Post by Rob Hicks on Apr 24, 2014 4:32:03 GMT
The final version of the transcription alphabet and protocols can now be found below: CONTRAPTION.pdf (836.17 KB) Any further changes to this document will be made simply to clean up explanations or fix typos. Rob
|
|
job
New Member
Posts: 2
|
Post by job on Apr 28, 2014 9:52:32 GMT
Rob, your proposal for a consensus driven transcription makes sense to me. I've recently generated a data set describing the coordinates and dimensions of each word in the manuscript, and was thinking along the same lines of providing some form of crowdsourced transcription. You can see the project site here: code.google.com/p/voynichese/It includes an EVA based query tool that operates on the spatial data and renders results on top of the folio images. The following page describes the data set, which is also available for download: code.google.com/p/voynichese/wiki/DataSetsWe can use this data to assemble a transcription tool that presents the user with the image of a single word and captures the provided transcription - almost like a captcha. The word images would look more or less like this: The intent is to show some of the surrounding context while still maintaining the user's focus on the target word. Note that this would use the word-boundaries provided by existing EVA transcriptions. We would provide a separate mode for specifying word boundaries - which could operate via a voting system. The user would enter the transcription by clicking a predefined set of symbols, or pressing the corresponding key - the available EVA fonts make this pretty easy. Given that we have an EVA transcription for each word area, we would also be able to generate and pre-populate a transcription value that matches the new transcription alphabet - which the user can accept or modify. Any thoughts?
|
|
|
Post by Rob Hicks on Apr 28, 2014 15:42:19 GMT
Wow, that looks to be a phenomenal resource! I'll have a good look at it tomorrow, but I can immediately see some fantastic ways that this could be used. Fine work!
Rob
|
|
|
Post by Rob Hicks on Apr 29, 2014 6:37:25 GMT
Job, I've now taken some time to explore your wonderful resource, and I have to say it is extremely impressive. Of course, its value is somewhat dependent on a quality transcription, and this is, of course, the issue that this project is trying to resolve. Can I ask which transcription you have used to produce the data sets for your resource?
A few ideas:
1. I wonder if the resource could be adapted to include queries on a sentence level, with spaces (and uncertain spaces) used as wildcards? One of the most contentious issue in transcriptions is the position of word breaks.
2. Rosette?
3. If this system is used to crowd-source transcriptions, it needs to be able to generate a machine-readable text including all transcription contributions and a set of 'best guess' transcriptions.
This is truly amazing work. Is it widely known about? I keep my ear to the Voynich ground and must have missed it! Rob
|
|
job
New Member
Posts: 2
|
Post by job on Apr 29, 2014 14:07:24 GMT
Rob, i've used the EVA transcription by Takeshi Takahashi, obtained from the VIB. Here are some answers to your other questions: 1. The query tool currently operates at a word level for a few reasons, one of which is that we don't have coordinates and dimensions for individual characters, so result overlays for partial word matches would be off. We could estimate character positions with the current data, but i'd like to try to get definite values before taking that route. I'm exploring some options to extract individual characters. Here's a recent attempt using simple blob detection: If we can extract a sufficient number of character blobs, we can use that as training data for a proper OCR library, like tesseract, to extract the other ones. 2. The EVA transcription i started with did not include the rosette page. I don't currently have datasets for that page, i have it on my todo list. I'll also need to think about how to properly display that folio. For example, it might need to be broken into six folios or have its own section - this is also an issue with a few other foldouts listed on this page. 3. That should be fine, using some specification like the one you are currently producing. We'd just need to ensure that the input page can properly capture information such as uncertainty and alternative character values. BTW, i completely agree that word boundaries are particularly error prone, i've seen quite a few examples of that. A process as simple as up-voting/down-voting word breaks could improve that data considerably.
|
|
bc
New Member
Posts: 1
|
Post by bc on Jul 8, 2014 10:29:26 GMT
Hello Rob, I've read your PDF file. It's an interesting idea. Here are my thoughts. Is there an abbreviation for the CONTRAPTION name? Are you aware that there are more modern sources than just the high-res color scans? For example UV of f1r and f17r. CONTRAPTION is very comprehensive with e.g. certainty and ligatures. Theoretically that's good but practically the amount of notations to learn might scare off volunteers. Looks like great minds think alike - I was directed here as a result of my thoughts at briancham1994.wordpress.com/2014/07/06/review-of-the-voynich-manuscript-transcription-file/ . I think your proposal on romanization and Job's idea on crowdsourcing are quite developed. But as for structure and data, I hold that XML is the way to go. It is easily readable by both humans and computers, the tags describe themselves, it is already standardized, it is already common and it already deals with the structures directly. As mentioned near the end of my review, it becomes trivial to specify and add data without coming up with new notations. For example I hadn't considered weirdos, color data and circle bearings until I read your pdf. If not using EVA, weirdos could be implemented intuitively in a self-closing tag like <weirdo shape_id="004" />. If someone wanted to include things not considered in the original XML (color data and circle bearings) they could make their own XML document only including the relevant units like this: <additions> <line id="f1r-p1-l2" color="red" /> <line id="f1r-p3-l3" color="red" /> <line id="f54v-p1-l2" start="NE" /> </additions> (assume these are the only lines to write about) Then because the ids match between the original and new files, and because they both use XML, it would be trivial to automatically merge the new attributes into the old file. This makes it incredibly forward-compatible without specifying or reserving anything beforehand. I'm rambling about this synchronicity because I think our proposals can be merged with good discussion. Between your transcription, my data structure and Job's crowdsourcing site, we might have a lovely trifecta Let me know what you think. On a slightly related note, a few months ago I think I finally figured out the elusive rule system to make Voynichese words. I haven't had time to properly verify this with a computer but looking through images, I have never found anything contrary. I'm mentioning this because if my system is correct, then it completely determines the word boundaries. Useful, huh? But there is a catch 22 - I need a good transcription of each word to check my theory and then determine word boundaries, but that transcription would require word boundaries that are already correct. Hmmm... :S
|
|