Centre for Internet & Society

Modi's effort to promote the use of Hindi and e-governance has given hope to those who want to see more vernacular content online, but many challenges have to be overcome.

Rohan Venkataramakrishnan's blog post was published in Scroll.in on August 29, 2014. Sunil Abraham gave his inputs.


For most of its short history, the internet has been the English speaker’s playground. Though English is the world’s third-most spoken language (after Mandarin and Spanish), it is by far the most commonly used language on the internet. If you wanted to make sense of most of what’s on the World Wide Web, you had to be able to read and write English.

This is slowly changing. The launch of Devanagari script web addresses on Sunday, allowing people to use  .भारत domain names, was another step in the slow effort to bring about a multilingual Web. Already, Indian languages like Hindi – one of the most commonly-spoken languages on Earth – lag far behind. The move gels well with the new government’s effort to promote the use of Hindi, and its push to increase digital services available to all citizens. The next few years could well see a spurt in vernacular content online.

But first many challenges have to be overcome. “At present, not a single Indian language figures in the top 10 languages prevalent on the Internet, though Chinese, Arabic and Russian feature in the list,” said a McKinsey report on the internet's impact on India. “The next wave of internet adoption in India will be dominated by local language speakers, which underscores the need for much more content and applications to be offered in local languages.”

Vernacular internet
Early studies of the internet attempted to quantify how much of the web was in English. A 1997 estimate put the number at 80% of all websites, while the Online Computer Library’s study in 2003 concluded that 72% of all online content was in English. Today that number is much lower.

Language Usage

W3Techs, which conducts surveys of the internet, now estimates that about 55% of content on the Internet is in English, followed by German, Russian and Japanese. Indian languages don’t crack the top 35.

The analysis is by its nature imprecise. The internet is vast and mostly uncharted. Estimates suggest search engines have indexed only 40% of Web content, leaving much off the mainstream radar. Measuring language becomes even harder because, in the early years, when fonts were harder to render, most non-English content on the internet was spelt out in Roman letters.

Indian Wiki
T
he rise of multilingual scripts has changed that, and made it easier to evaluate the diversity of the internet. Yet even the best approach relies more on sampling than measurement. There is one section of the Web, however, that does allow for comparisons of absolute numbers.

Wikipedia Articles

Relative to other tongues, Indian language-articles still comprise a minuscule portion of Wikipedia. English, Spanish and French are perhaps expected, but even languages like Vietnamese have nearly 10 times the number of pages that Hindi does. Waray-Waray, the fifth-most commonly spoken language in the Philippines, appears to be an outlier because of an automated translation method that creates pages in that language.

Hindi content has been growing on the internet encyclopedia, from no pages in 2003 to more than one lakh in 2011, but it still falls far behind the languages that are spoken as commonly as it, like Spanish and Arabic, let alone those with much smaller reach. Of course in many countries English is not spoken at all, so Internet users need web pages in their own language. In India, because of the language-class association, the majority of Internet users are at least conversant in English.

Hindi Pages

Obstacle Course
The impediments to further growth are all too apparent. For one, internet infrastructure still leaves much to be desired. Though India has the third-largest internet user-base in the world, only 10% of the country is actually online. Even by 2015, when internet access is expected to reach 28% of the population, the equivalent rural figure is likely to be just 9%, according to estimates.

“A lot of the core infrastructure that is necessary for language computing is missing,” said Sunil Abraham, executive director of the Centre for Internet & Society. “There’s no mandate by the government that these languages must be supported, no comprehensive dictionaries, no thesauri, no machine translation capabilities, no optical character recognition capabilities. Because our market is so insignificant for proprietary software makers, they haven’t done enough to develop these. Meanwhile, the free software community is too small and mostly English-speaking.”

The government has launched some initiatives in this regard, like a National Translation Mission aimed at machine translating text from English into Indic languages, as well as banks of fonts that are free to use. But Abraham said that while the government is clear this should be a priority area, it underestimates the scale of the problem.

“We need large scale investment by the government into each language,” he said. “We’re looking at maybe even Rs 100 crore per language, to bring each of our traditional languages into the internet age.”