How open access content helps fuel growth in Indian-language Wikipedias

Mobile Internet connectivity is growing rapidly in rural India, and because most Internet users are more comfortable in their native languages, websites producing content in Indian languages are going to drive this growth. In a country like India in which only a handful of journals are available in Indian languages, open access to research and educational resources is hugely important for populating content for the various Indian language Wikipedias.

Indian-language Wikipedias and open access

Most commonly spoken Indian languages have had Wikipedia projects for almost a decade. Languages like Konkani and Tulu are new entrants in the Wikipedia family, and currently there are 23 Indian language Wikipedias. One example of high-quality open access content is the Open Textbook of Medicine, an offline encyclopedia consisting of Wikipedia articles related to medicine, which was created by a group of dedicated volunteer medical professionals that happen to be Wikipedia editors. There is enormous potential to grow Wikipedia in multiple languages with high-quality, open content like this.

To help fuel the growth of Wikipedia and its various projects, such as the Indian-language Wikipedias, the Wikipedia community has created an ecosystem with Wikimedia chapters and other affiliates, which are run by both volunteers and paid staff from the Wikimedia Foundation, an organization responsible for fundraising, technical, and community support. In India, Wikimedia India, the Centre for Internet and Society’s Access to Knowledge program (CIS-A2K), and Punjabi Wikimedians are three such official affiliates working on catalyzing the growth of the content and the communities.

Whereas Wikimedia India focuses on expanding all the Indian-languages content, Punjabi Wikimedians focus on Punjabi language content (in both Gurmukhi and Shahmukhi scripts), and CIS-A2K focuses on five languages: Kannada, Konkani, Marathi, Odia, and Telugu.

Indian-language Wikipedia projects can only grow with the help of volunteers editing their own language Wikipedias and adding missing information from a reliable sources, which is where open access content can help.

Open in action

The 2016 International Open Access Week will be held October 24-30, 2016. The theme this year is Open in Action. The announcement explains, “International Open Access Week has always been about action, and this year’s theme encourages all stakeholders to take concrete steps to make their own work more openly available and encourage others to do the same. From posting preprints in a repository to supporting colleagues in making their work more accessible, this year’s Open Access Week will focus on moving from discussion to action in opening up our system for communicating research.”

Indian contributors show the spirit of Open in Action as they help add content to the various Indian-languages Wikipedias. They depend on open access to research and other publications to help millions of people, including those living in rural areas, who are joining us online.

Vachana Sanchaya: Bringing access to 11th century Kannada literature

During early 11th century a form of spiritual Kannada language poetry in the Indian state of Karnataka called Vachana sahitya became quite popular. It started flourishing in the 12th century by a religious movement called Lingayatha movement. More than 259 Vachana writers, called Vachanakaru, compiled over 11,000 vachanas (verses). 21,000 of these verses in 15 volumes were published by the Government of Karnataka into an online portal called Samagra Vachana Samputa. Two Wikimedians along with two linguists brought these verses on a standalone project called Vachana Sanchaya. Kannada Wikimedians, Pavithra Hanchagaiah and Omshivaprakash HI  along with Kannada linguist O. L. Nagabhushana Swamy converted the font to Unicode to make the verses searchable on this project. The entire collection is now ready to enrich the Kannada WikiSource.

The text in Samagra Vachana Samputa were typed using fonts of ISCII, an Indian character encoding standard. Indic characters generally replace Latin ones inside the font that makes them completely useless when someone does not have the particular font installed in the computer. This is a typical problem with non-Latin fonts, especially Indic typefaces. In case of this particular publication, there were more than 5 ISCII standards which made searching and reusing content completely impossible. Hanchagaiah and Omshivaprakash started writing scripts to make the Vachanas searchable through an index. This demanded a user friendly platform for the linguistic researchers, students, and the public interested in accessing this literature.

Omshivaprakash worked on designing the architecture for this platform using open source software tools. Hanchagaiah was involved in providing critical hacks for digitization and valuable inputs through suggestions, feedback, and quality assurance.

At present, Vachana Sanchaya project has around 200,000 unique words that were derived from these verses. The public has been using the repository and accessing vachana from Facebook, Twitter, and Google+ profiles. There are thousands of people now who read a Vahana as part of their daily routine. Vachana Sanchaya is not only a gateway for reading the literature, but also a research platform for Kannada language and literature. It has options for researchers to help in reviewing content which in turn will help to add references from research papers.

All of the content is currently available to the public through the OpenData API, and once the reviewing the work is complete, it will be distributed in the public domain through WikiSource. This will open up the system for students, developers, researchers, and anyone interested in building linguistic tools for Kannada and other Indic languages. Users will be able to use our code to digitize any book available in the public domain. Early literature in any language is well-respected, so making it available via an open platform allows for reuse of the content for research, publication, and other documentation work.

Other similar projects could take help from this project and use any part of the processes.

Plans going foward:

  • To initiate Natural Language Processing (NLP) projects if more researches help to tag words and grow the glossary.
  • To continue work on subsequent, similar projects for Sarvagnana Vachanagalu and D?sa Sanchaya (work has begun) and Vy?sa and Muddann (work to be started)
  • To extend this platform to other the contemporary literature works available in the public domain.

Authored by Pavithra HanchagaiahOmshivaprakash HI and Subhashish Panigrahi. Draws inspiration from another article published on under CC-BY-SA 4.0

Image credit: Vachana Sanchaya website screenshot by Omshivaprakash HI.