We know that English has always dominated the internet and it continues to do so even now that the World Wide Web has been around for more than 3 decades. English was almost definitely the first language used on the web and it continues to be disproportionately represented even now that internet adoption is much more widespread around the globe.
Of all pages whose content language can be identified, more than 61% are in English. This makes English by far the dominant language for content on the internet. Yet Statista data from early this year finds that only 25% of internet users were English language users. This means that if you can operate online in English you have a wider array of content available to you than speakers of any other language.
Data from W3Techs also finds that in 2021 only 1.3% of internet content is in Chinese. Yet Chinese language speakers represent close to 20% of the internet population. This means the share of internet content available to Chinese speakers is much smaller than their online population might suggest.
We know Chinese has some of the fastest-growing internet content so we might see this change but there’s still a lot of catching up to do before the share of content online becomes proportionate to the huge size of this online language body.
The same picture is seen for other big language populations online. Close to 8% of internet users speak Spanish, yet less than 4% of content is in this language. Around 5% of internet users speak some form of Arabic, yet only 1% of content is written in Arabic. And some of the world’s most widely spoken languages (offline at least) have a vanishingly small slice of content available to them online. These include Hindi, Portuguese and Bengali.
Unfair distribution
In 2008, UNESCO found that 98% of pages on the internet were written in just 12 world languages. Whilst that picture has probably changed somewhat since then, the fact remains that a small number of languages continue to dominate the internet and to punch well above their weight in terms of content availability.
Another way of looking at this skewed representation is in geographical terms. Although 76% of internet users live in Asia, Africa, Latin America, the Middle East and the Caribbean, most of the content comes from elsewhere. At the end of 2020 over half of the world’s online population lived in Asia, yet only 14% and less than 7% in Europe and North America respectively. In the world’s main hub of conversation, the talk is essentially being dominated by a vocal minority.
Wikipedia is a good illustration of this inequality of content. The huge free online encyclopaedia describes its ambition to ‘create a world in which everyone can freely share in the sum of all knowledge’, yet it’s a world that’s very skewed towards English-majority countries.
A full 80% of its content comes from North America and Europe. Although the internet is seen as a repository of human knowledge, it’s a resource that’s dominated by only a minority of those humans.
Big hitters
At this phase of internet development, we see a lot of big platforms dominating the landscape. The bigger platforms tend to dictate the direction of internet culture and can be highly influential socially – take Facebook’s influence on election outcomes as just one example.
If you take the top 10 million websites by share of internet traffic, you find content is even more focused on major languages. More than 60% of them are in English, despite the fact only around 16% of the world’s population speaks this language. China has the greatest number of internet users and a thriving online landscape yet still only 1.4% of the world’s biggest sites are in Chinese.
Russian is better represented online than its relatively small share of the world population would indicate. Over 8% of the biggest internet locations are in Russian despite the fact only just over 3% of internet users speak it.
Dominant internet locations can really kick off the generation of huge volumes of content, particularly user-generated content in the languages those platforms support. The current maturity of successful web platforms in particular languages probably goes some way to explain why there’s such a lot of content in those languages.
So the fact that Russian speakers have access to the Russian-language social networking site, VK (previously known as VKontakte) and speakers of an equivalent sized language such as Bengali don’t have the equivalent platforms in their own language explains some of the gulf in online representation.
It’s true that internet adoption may have initially started in North America and Europe. This partly explains why some of the more popular sites in the world still reflect the languages of the world’s first internet users.
READ MORE: How Much Does Google Really Support Other Languages?
There’s been a huge rise in adoption across Asia since then, particularly in China which is now starting to catch up in content terms. We are now seeing tremendous growth in adoption in areas such as India and Africa. But the latter locations are linguistically much more complex compared to North America or China.
There’s probably also a relationship between start-up culture and language dominance online. Language groups that support innovation online are more likely to see dedicated language platforms emerge to meet their needs. Internet adoption isn’t necessarily a straightforward path to better content representation for speakers of smaller languages.
The unrepresented
There are many, many languages that just aren’t represented online at all. Of the nearly 7,000 languages thought to be spoken around the world, only 7% have any identified content online. It’s not just a case of lack of internet access.
For many people, the technology just doesn’t support their language coming online. Keyboards tend to be designed for dominant languages and so does a host of other critical technology that lets a language live online.
As the internet’s inception language, English has the advantage of using Latin script. From the very start of the internet, this meant several dozen other languages that also use the Latin script could share the platform.
But other scripts are only used by a single language. Smaller language groups tend to wield less economic and political power compared to languages that share the Latin script (the languages of Western Europe, for example). They are less likely to get their technology needs met than larger, more influential language groups.
Many major platforms don’t accommodate minority languages well. Facebook is considered to perform relatively well when it comes to accommodating language diversity – 111 languages are now supported by the platform.
But that still excludes thousands of world languages. Unicode, the dominant information technology standard for text representation online, puts language elements such as alphabets into scripts that digital platforms can use. At present Unicode supports 154 scripts and while it adds more each year this is still an extremely small slice of world languages.
There are also a lot of thriving languages that don’t have a dedicated writing system. Navajo is one of the estimated 4,000 languages in the world that remain unwritten. Other languages may have a writing system but a large proportion of their population are illiterate.
Ethiopia has one of the highest illiteracy rates in the world at over 30%, and around 100 languages are spoken in the country. This is likely to mean that many languages suffer from a disproportionately high level of illiteracy, small population size and there’s less chance of those languages getting the technology and support needed to transition into an online environment.
Where major tech companies are making moves to support language diversity online, they aren’t doing them for obscure Ethiopian languages of poor herdsmen. They’re investing to bring the languages of relatively wealthy Indian middle classes online as there’s profit motive in catering to these audiences.
Impact on minority languages
When humanity’s key shared repository of information, of socialising and participating in society isn’t available to you in your own language, that has to reduce the value of your language to both yourself and others.
That’s why there’s concern that lack of language representation on the internet may threaten the survival of minority languages. Speakers of minority languages are already incentivised to learn dominant languages in order to participate more in wider society, and the internet is reflecting this.
Ethiopia is probably a good illustration of this. It’s unlikely that all of those hundred languages are going to be supported to thrive online anytime soon. What’s more likely to happen is that a small elite that can speak English or French will be able to access content online in those languages, increasing the value of those languages to that community.
Although Ethiopia operates under the principle of recognising all languages spoken in the state, the two more widely spoken languages of Amharic and Oromo have greater official use and support.
As the language used by the government, there’s likely to be an online resource available in Amharic; things such as school exam information or train timetables might be available in this language.
As these are the languages that are likely to have a greater amount of content available to them online, their value is likely to increase to the average speaker. As a result of this, other languages diminish in relative terms. This is likely to enhance social inequalities between speakers of different language groups and probably contribute to minority language decline.
The future of language diversity online
Surging internet adoption, particularly in emerging markets such as India, is probably going to improve the representation of some languages for online content. We’re likely to see thriving, economically empowered languages such as Chinese represent a larger share of online content because of the pace of growth of Chinese language content.
However, this improvement isn’t likely to be seen universally across all languages and the rise in online representation could disadvantage some smaller languages at the same time as empowering other more muscular ones.
Whilst Chinese may be smaller than English in terms of online content, it’s still a huge world language with enormous economic, scientific and political clout. Significantly, the technology already supports the practicalities of rendering the Chinese language online.
Chinese speakers are highly active online, their participation is growing, and their content is therefore catching up with languages that had a head start on the internet. There’s also a thriving entrepreneurial culture and capital to introduce new Chinese-language innovation in online content and internet technology, which will support the existence of the Chinese language online.
By contrast, languages that are smaller and have a less influential speaker population are less likely to see improvement in their representation online. Many of these languages, even ones with tens of millions of speakers, may not gain the necessary technological elements that mean their language can even make it online.
Unfortunately, the languages making the transition to online spaces tend not to be the most vulnerable. For commercial reasons, platforms such as Facebook tend to focus on adding languages that are likely to be most lucrative to them.
These will inevitably be the more economically powerful languages, spoken by groups that are mostly literate and possibly geographically concentrated. World languages that don’t share these characteristics are often the ones most at risk of fading away. When it comes to online representation, there really are winners and losers among language groups.