So you just wrote a beautiful essay on James Joyce's Ulysses - in Irish Gaelic. Will Yahoo, Google, Microsoft and Ask recognize it as Gaelic, hosted as it is on your co.uk domain? Can Be. But you can give them a hint!
The trick is to use all HTTP and HTML code settings available to your advantage to make sure your documents are not misidentified. This article considers HTTP and HTML aspects of website internationalization for search engine optimization.
Why is language recognition of a problem?
Search engines try to match the language of a web searcher (based on ip geo location recognition or user specified preferences) to Web documents when determining the best matches a search query. In some cases, a user could specify that the results are limited to a specific language. Left to their own devices, search engines have some clues to determine the human language of a document:
The trick is to use all HTTP and HTML code settings available to your advantage to make sure your documents are not misidentified. This article considers HTTP and HTML aspects of website internationalization for search engine optimization.
Why is language recognition of a problem?
Search engines try to match the language of a web searcher (based on ip geo location recognition or user specified preferences) to Web documents when determining the best matches a search query. In some cases, a user could specify that the results are limited to a specific language. Left to their own devices, search engines have some clues to determine the human language of a document:
- The site area of the country
- The country where the site is hosted
- the language of documents linking the document.
- A text pattern analysis of the document.
Each approach is fraught with difficulties. Consider a few:
Country domain suffix of a website: Although it is likely that a site with a .de extension is in German, there is always the possibility that the German company has published the contents in other languages for international audience. Some areas of the country, such as .ch for Switzerland, are used by countries with several official languages, in this case, German, French, Italian and French-speaking Switzerland.
When the site is hosted: Many sites host in geographic areas far from their target audience due to cheap hosting options.
The language of linked documents: While the Internet is indeed a set of hubbed networks, it is quite common for web pages to cite an authority, even if the authority is in another language (English, for example)
text pattern analysis: This is probably the most accurate method, especially for longer documents. While search engines do not reveal their approach (es) consider the perl Lingua: Identify module which currently recognizes 33 languages. Lingua: Identify uses a combination of methods corresponding to four text patterns; Here we quote the perl Lingua: Identify the documentation:
Small Word Technique
The "Small Word Technique" searches the text for the most common words of each active language. These words are usually articles, pronouns, etc., that happen to be (usually) the shortest words in the language; hence the name of the method. This is usually a good method for large texts.
Prefix Analysis
This method analyzes text for common prefixes of each active language.
Suffix analysis
Similar to the analysis but the analysis prefix common suffixes.
N-gram Categorization
N-grams are sequences of tokens. You can think of them as syllables, but they are also more than that because they are not only consist of the characters, but also by the spaces (or separation defining words). N-gram data is available from Google.
N-grams are a very good way to identify languages, as the most common of each language are usually not very common in others.