For notifications of new posts on language, translation, energy and culture, you can subcribe to our blog, with comments and discussion of articles very welcome on our social media accounts.

Translating web pages – easy snap or tempting translation trap?

Could you just translate this web page please? Well…

Spider in spiderwebIt’s easy you say, but a simple request that sounds like a snap can turn into a translation trap. Web pages are where we read these days, so why not start the job there and just translate what you see on the website? Well, web pages are actually made up of not just the words and images you see on the surface, but also technical code you don’t see, and styling you do, so you may regret your words when you find yourself swimming in a simmering sea of alphabet soup. And what if the result can’t be served up in a way that can be readily consumed? So before just jumping in and translating web pages, let’s look at what really is on a web page and how the text there might, or might not, mesh with the professional translation process to deliver a successful result – in a final format translator and client can readily use.

Typically to get something translated a Word, Excel or Powerpoint document is supplied, though preferably not PDF, and you may find it helpful to read our top tips to prepare for a successful translation.  The translator will normally deliver a finished translation in the same format, though they may before that have imported the document into a technical translation program and then exported it from there when finished. So to translate a web page, can’t we just copy, paste, translate, then paste back onto the website? Well, even assuming the web page’s content hasn’t changed while the translator was working on their own version, it just is not that easy, and so here we’ll look at why, while covering these key points:

What is a web page?

An individual page on a website usually has unique content, and also elements that are repeated across most or all pages, such as a header, footer, sidebar, navigation, and possibly elements that hover or pop up. Content on a page may be dynamic, meaning it varies, or becomes visible or hidden, depending on user input or browser type, or on data being updated automatically in the background, such as prices or locations. So what do you want translated? Most of those repeating site-wide elements should really be translated separately and entered just once, by the technical administrator in the website’s CMS for reasons of efficiency and sanity.

To translate the unique content of a single web page, you need either the separate source document which it is based on, if there is one, or to grab or export the contents of that page to give to a translator.  Your CMS may allow export of single, multiple or all pages, but the export may also contain HTML code and other hidden code and formatting. WordPress, for example, has a stock feature which lets you export all, or selected, content to a text file in standard XML format – which also provides an easy backup of the site’s text content:

Wordpress export controls

WordPress content export to XML file

Selecting content on a web page and pasting into Microsoft Word will probably also bring along most of the  hidden code, though not necessarily consistently. Simply copying and pasting direct from a web page to a simple .txt file such as in Notepad will probably eliminate all formatting other than line breaks, which may actually simplify more than you want to, particularly where images and captions are concerned.

What is a CMS ?

A CMS is a Content Management System for a website, allowing the non-technical user to create, edit and manage the text and image content on pages. Widely used CMS systems include WordPress, Drupal, and Joomla, and they typically let users paste in, and format, text content without having to wrestle with HTML code. A CMS may also have features for importing and exporting website content, and it may add its own unique hidden code to pages in addition to the stock HTML that makes up web pages.

How do you get a word count on a website?

Assuming you want a sense of scale or cost, you want a word count from your web page or website, but only of the unique, non-repeating contents of pages. If the website content was input from separate source documents, your word count is there.  Simply doing a word count on text copied and pasted from the web page might miss text, or equally might inflate the count by including HTML and other unseen code. Your CMS should, however, easily give you a word count for each page’s unique content via the admin and editing interface.

Wordpress wordcount in Dashboard

Wordcount in WordPress Dashboard

In the WordPress CMS for example, the word count is shown in the edit screen for that page, and site-wide counts can be obtained by use of a free WordPress plugin which can also give counts per content type and author, and similar functionality is available in Drupal. In WordPress the paid multilingual plugin WPML can also provide more fine-grained and translation-specific wordcounts. In the absence of a CMS or source documents, a web developer might have to download and export content from a copy made of your website with a tool such as HTTRACK, adding time and uncertainty to the process.

What is HTML ?

HTML is the simple code that places background structure around the text of a web page, analogous to the way that paragraphs, headings, line breaks and spaces give visible structure to printed texts. HTML is the basis of all web pages, and in most browsers you can easily see the HTML via a setting or option such as View:Source or View:Page Source.

HTML code

HTML source code of a web page

When you look at the HTML source code of a web page, you see the text content wrapped in various tags contained in angled brackets such as <p> to start a paragraph and </p>  to end a paragraph.  Without the HTML tags the text on a web page would just display as a continuous stream with no paragraphs, headings, or line breaks, and no CSS styling.

What is a translation program?

A professional translator typically uses a sophisticated translation memory program to manage and support the translation process, bringing consistency and efficiency gains while avoiding the imperfections and pitfalls of machine translation. Such programs are referred to as CAT tools, or Computer-Aided Translation tools, and will import and export texts in certain specified file formats such as Word, Excel and Powerpoint, but by no means in all existing formats and generally not raw web pages. Ideally the CAT tool can also identify and handle any HTML contained in imports as the  Déjà Vu CAT tool does, as well as managing word counts and re-export to the desired final format.

How does a website handle multiple languages?

A multilingual website needs to structure its content so that everything that needs to be translated gets translated and is displayed in the right place, and is findable. Some sites separate languages using subdomains, for example deutsch.example.com or fr.example.com, while others use subdirectory urls like example.com/de or example.com/english/uk, and some companies even use separate domains such as example.co.uk and example.jp. Navigation and allowing users to switch language or country is important, though the common convention of using flag symbols can be confusing and inappropriate.

Editing of individual page content actually proceeds in the usual way as it would on a monolingual site, with site-wide text such as navigation being supplied separately by the CMS in the appropriate language.  Planning, strategy and ongoing admin are needed for a multilingual site, which can be done in the WordPress CMS by adding additional plugins, some of them purchased, whereas the Drupal CMS is built to be multilingual right from the core and handles this really well.

What is CSS ?

Cascading Style Sheets, or CSS, is simple code in text format that assigns visual style to HTML tags in web pages, for example making a heading red in colour or a paragraph text aligned left. CSS adds additional code rules on top of the base HTML tags, and this CSS code may be in the web page or a separate file. Without CSS you would have only black text on white, all the same size and with no layout.

For example, here we have HTML code for a paragraph assigned to the class “socialist,” and a separate CSS rule which could be in that page or referenced from elsewhere to make all paragraphs of the class “socialist” appear in red text:

<p class =”socialist”>A Universal Basic Income creates a strong and stable society.</p>

p.socialist {color:red;}

Outputs this: A Universal Basic Income creates a strong and stable society.

What could go wrong?

If someone agrees to translate content right from a website, what could go wrong? A lot, really, as we’ve seen.  So let’s just summarise the risks:

  • The page content may have changed in the meantime, so when you paste in or import the translated version, it is out of date and inaccurate.
  • Hidden formatting, such as HTML, CSS, or CMS code, may be lost or altered so that the words are there but the styling or layout of the translated web page is now incorrect.  Then a web administrator or technical expert has to repair the coding of each and  every newly translated page, comparing against the original source language page.
  • Irrelevant content gets translated or distorts word counts when repeated or dynamic parts of web pages are not separated from the unique text to be targeted for translation on a given web page.
  • Time is wasted and frustration mounts as the translator wrestles with issues that are out of scope and the website administrator has to fix niggling problems that no one anticipated, and no one wants the blame.

Instead, be sensible, talk to a translator and agree an appropriate format – and words will serve you well.

Still tempted to try translating from and to a web page to save time and money… don’t do it!