Digitize any Book in the Public Domain
The article is about a mass scale digitization project for Kannada WikiSource by two Wikipedians.
This was originally published in Opensource.com on March 27, 2014.
A form of poetry in India called Vachana sahitya is part of the popular Indian language, Kannada. It evolved in the 11th century and flourished in the 12th as part of the religious Lingayatha movement. Since that time, more than 259 Vachana writers, called Vachanakaru, have compiled over 11,000 Vachanas (verses).
21,000 of these verses were digitally published into 15 volumes, called Samagra Vachana Samputa, by the government of Karnataka. These volumes were then turned into a standalone project called Vachana Sanchaya; this project was taken on by two Kannada Wikimedians, a Kannada linguist, and the author O. L. Nagabhushana Swamy—to enrich the Kannada WikiSource. This team used Unicode, a standard of consistency for converting text (and code) into a new format.
Swamy was trying to access these poems, and was having trouble because it was in ISCII, an Indian character encoding standard. We began writing scripts to make the Vachanas (poems) searchable by an index. But, in order to do that well, we had to build a platform for everyone to use: the linguistic researchers, students, and the public at large who are interested in reaching this literature.
Omshivaprakash, a Kannada Wikimedian, worked on the architecture of the platform, decided the infrastructure requirements, and chose the open source software tools to use. I was involved in providing critical hacks for digitization and valuable inputs through suggestions, feedback, and quality assurance.
At present, our repository, Vachana Sanchaya, has around 200,000 unique words that were derived from these poems. The public has been using our repository and accessing Vahanas (poems) from our Facebook, Twitter, and Google+ profiles. There are thousands of people now who read a Vahana as part of their daily routine. Vachana Sanchaya is not just meant for reading the poems, it is also meant for research. So, we have added a way for researchers to help us review the content and will be adding references from various research papers.
Above: A screenshot of the Vachana sahitya page |
The most commonly searched words are:
-
ಕರ್ಮ(Karma: English: work/deed)
-
ಸತ್ಯ(SathyaEnglish: truthfulness )
-
ನದಿ(Nadī: English:river)
ಆಂಗೀರಸ, ಪುಲಸ್ತ್ಯ, ಪುಲಹ, ಶಾಂತ,
ದಕ್ಷ, ವಸಿಷ್ಠ, ವಾಮದೇವ, ನವಬ್ರಹ್ಮ, ಕೌಶಿಕ, ಶೌನಕ, ಸ್ವಯಂಭು, ಸ್ವಾರೋಚಿಷ, ಉತ್ತಮ, ತಾಮಸ, ರೈವತ,
ಚಾಕ್ಷಷ, ವೈವಸ್ವತ, ಸೂರ್ಯಸಾವರ್ಣಿ,ಚಂದ್ರಸಾವರ್ಣಿ, ಬ್ರಹ್ಮಸಾವರ್ಣಿ,
ಇಂದ್ರ ಸಾವರ್ಣಿ ಇವರು ಇಪ್ಪತ್ತು ಮಂದಿ ಪ್ರಪಂಚ ನಿರ್ಮಾಣ ಸಹಾಯ[ದ]ವರು. ಹತ್ತೊಂಬತ್ತು ಎಂದರೆ ಪುಣ್ಯನದಿಗಳು.
ಅದು ಎಂತೆಂದಡೆ: ಗ್ರಂಥ
All of the content is currently available to the public through the OpenData API, and once the reviewing the work is complete, it will be distributed in the public domain through WikiSource. This will open up the system for students, developers, researchers, and anyone interested in building linguistic tools for Kannada and other Indic languages. Users will be able to use our code to digitize any book available in the public domain. Early literature in any language is well-respected, so making it available via an open platform allows for reuse of the content for research, publication, and other documentation work.
We encourage other projects of this kind to follow our method and use any part of our process that is helpful.
Going foward, we would like to:
- Initiate Natural Language Processing (NLP) projects if more researches help to tag words and grow the glossary
- Continue work on subsequent, similar projects for Sarvagnana Vachanagalu and Dāsa Sanchaya (work has begun) and Vyasa and Muddann (work not yet started)
- Extend this platform to other the contemporary literature works available in the public domain.