Actually, this section is not really part of this document. Instead, it is placed in each module's documentation. This helps ensuring that the information is up to date by keeping the documentation and the code together.
The perception of this situation by the open-source actors did dramatically improve recently. We, as translators, won the first battle and convinced everybody of the translations' importance. But unfortunately, it was the easy part. Now, we have to do the job and actually translate all this stuff.
Actually, open-source software themselves benefit of a rather decent level of translation, thanks to the wonderful gettext tool suite. It is able to extract the strings to translate from the program, present a uniform format to translators, and then use the result of their works at run time to display translated messages to the user.
But the situation is rather different when it comes to documentation. Too often, the translated documentation is not visible enough (not distributed as a part of the program), only partial, or not up to date. This last situation is by far the worst possible one. Outdated translation can turn out to be worse than no translation at all to the users by describing old program behavior which are not in use anymore.
Thanks to this, discovering which parts of the document were changed and need an update becomes very easy. Another good point is that the tools will make almost all the work when the structure of the original document gets fundamentally reorganized and when some chapters are moved around, merged or split. By extracting the text to translate from the document structure, it also keeps you away from the text formatting complexity and reduces your chances to get a broken document (even if it does not completely prevent you to do so).
Please also see the FAQ below in this document for a more complete list of the advantages and disadvantages of this approach.
man
The good old manual pages' format, used by so much programs out there. The po4a support is very welcome here since this format is somewhat difficult to use and not really friendly to the newbies. The Locale::Po4a::Man(3pm) module also supports the mdoc format, used by the BSD man pages (they are also quite common on Linux).
pod
This is the Perl Online Documentation format. The language and extensions themselves are documented that way, as well as most of the existing Perl scripts. It makes easy to keep the documentation close to the actual code by embedding them both in the same file. It makes programmer life easier, but unfortunately, not the translator one.
sgml
Even if somewhat superseded by XML nowadays, this format is still used rather often for documents which are more than a few screens long. It allows you to make complete books. Updating the translation of so long documents can reveal to be a real nightmare. diff reveals often useless when the original text was re-indented after update. Fortunately, po4a can help you in that process.
Currently, only the DebianDoc and DocBook DTD are supported, but adding support to a new one is really easy. It is even possible to use po4a on an unknown SGML DTD without changing the code by providing the needed information on the command line. See Locale::Po4a::Sgml(3pm) for details.
TeX / LaTeX
The LaTeX format is a major documentation format used in the Free Software world and for publications. The Locale::Po4a::LaTeX(3pm) module was tested with the Python documentation, a book and some presentations.
texinfo
All the GNU documentation is written in this format (that's even one of the requirement to become an official GNU project). The support for Locale::Po4a::Texinfo(3pm) in po4a is still at the beginning. Please report bugs and feature requests.
xml
The XML format is a base format for many documentation formats.
Currently, the DocBook DTD is supported by po4a. See Locale::Po4a::Docbook(3pm) for details.
others
Po4a can also handle some more rare or specialized formats, such as the documentation of compilation options for the 2.4.x kernels or the diagrams produced by the dia tool. Adding a new one is often very easy and the main task is to come up with a parser of your target format. See Locale::Po4a::TransTractor(3pm) for more information about this.
There is a whole bunch of other formats we would like to support in po4a, and not only documentation ones. Indeed, we aim at plugging all ``market holes'' left by the classical gettext tools. It encompass package descriptions (deb and rpm), package installation scripts questions, package changelogs, and all specialized file formats used by the programs such as game scenarios or wine resource files.
Note that master.doc is taken as an example for the documentation to be translated and translation.doc is the corresponding translated text. The suffix could be .pod, .xml, or .sgml depending on its format. Each part of the picture will be detailed in the next sections.
master.doc | V +<-----<----+<-----<-----<--------+------->-------->-------+ : | | : {translation} | { update of master.doc } : : | | : XX.doc | V V (optional) | master.doc ->-------->------>+ : | (new) | V V | | [po4a-gettextize] doc.XX.po--->+ | | | (old) | | | | ^ V V | | | [po4a-updatepo] | V | | V translation.pot ^ V | | | doc.XX.po | | | (fuzzy) | { translation } | | | | ^ V V | | {manual editing} | | | | | V | V V doc.XX.po --->---->+<---<---- doc.XX.po addendum master.doc (initial) (up-to-date) (optional) (up-to-date) : | | | : V | | +----->----->----->------> + | | | | | V V V +------>-----+------<------+ | V [po4a-translate] | V XX.doc (up-to-date)
On the left part, the conversion of a translation not using po4a to this system is shown. On the top of the right part, the action of the original author is depicted (updating the documentation). The middle of the right part is where the automatic actions of po4a are depicted. The new material are extracted, and compared against the exiting translation. Parts which didn't change are found, and previous translation is used. Parts which where partially modified are also connected to the previous translation, but with a specific marker indicating that the translation must be updated. The bottom of the figure shows how a formatted document is built.
Actually, as a translator, the only manual operation you have to do is the part marked {manual editing}. Yeah, I'm sorry, but po4a helps you translate. It does not translate anything for you...
To begin a new translation using po4a, you have to do the following steps:
$ po4a-gettextize -f <format> -m <master.doc> -p <translation.pot>
<format> is naturally the format used in the master.doc document. As expected, the output goes into translation.pot. Please refer to po4a-gettextize(1) for more details about the existing options.
The actual translation can be done using the Emacs' or Vi's PO mode, Lokalize (KDE based), Gtranslator (GNOME based) or whichever program you prefer to use for them (e.g. Virtaal).
If you wish to learn more about this, you definitively need to refer to the gettext documentation, available in the gettext-doc package.
$ po4a-translate -f <format> -m <master.doc> -p <doc.XX.po> -l <XX.doc>
As before, <format> is the format used in the master.doc document. But this time, the PO file provided with the -p flag is part of the input. This is your translation. The output goes into XX.doc.
Please refer to po4a-translate(1) for more details.
$ po4a-updatepo -f <format> -m <new_master.doc> -p <old_doc.XX.po>
(Please refer to po4a-updatepo(1) for more details)
Naturally, the new paragraph in the document won't get magically translated in the PO file with this operation, and you'll need to update the PO file manually. Likewise, you may have to rework the translation for paragraphs which were modified a bit. To make sure you won't miss any of them, they are marked as ``fuzzy'' during the process and you have to remove this marker before the translation can be used by po4a-translate. As for the initial translation, the best is to use your favorite PO editor here.
Once your PO file is up-to-date again, without any untranslated or fuzzy string left, you can generate a translated documentation file, as explained in the previous section.
The key here is to have the same structure in the translated document and in the original one so that the tools can match the content accordingly.
If you are lucky (i.e., if the structures of both documents perfectly match), it will work seamlessly and you will be set in a few seconds. Otherwise, you may understand why this process has such an ugly name, and you'd better be prepared to some grunt work here. In any case, remember that it is the price to pay to get the comfort of po4a afterward. And the good point is that you have to do so only once.
I cannot emphasis this too much. In order to ease the process, it is thus important that you find the exact version which were used to do the translation. The best situation is when you noted down the VCS revision used for the translation and you didn't modify it in the translation process, so that you can use it.
It won't work well when you use the updated original text with the old translation. It remains possible, but is harder and really should be avoided if possible. In fact, I guess that if you fail to find the original text again, the best solution is to find someone to do the gettextization for you (but, please, not me ;).
Maybe I'm too dramatic here. Even when things go wrong, it remains ways faster than translating everything again. I was able to gettextize the existing French translation of the Perl documentation in one day, even though things did went wrong. That was more than two megabytes of text, and a new translation would have lasted months or more.
Let me explain the basis of the procedure first and I will come back on hints to achieve it when the process goes wrong. To ease comprehension, let's use above example once again.
Once you have the old master.doc again which matches with the translation XX.doc, the gettextization can be done directly to the PO file doc.XX.po without manual translation of translation.pot file:
$ po4a-gettextize -f <format> -m <old_master.doc> -l <XX.doc> -p <doc.XX.po>
When you're lucky, that's it. You converted your old translation to po4a and can begin with the updating task right away. Just follow the procedure explained a few section ago to synchronize your PO file with the newest original document, and update the translation accordingly.
Please note that even when things seem to work properly, there is still room for errors in this process. The point is that po4a is unable to understand the text to make sure that the translation match the original. That's why all strings are marked as ``fuzzy'' in the process. You should check each of them carefully before removing those markers.
Often the document structures don't match exactly, preventing po4a-gettextize from doing its job properly. At that point, the whole game is about editing the files to get their damn structures matching.
It may help to read the section Gettextization: how does it work? below. Understanding the internal process will help you to make this work. The good point is that po4a-gettextize is rather verbose about what went wrong when it happens. First, it pinpoints where in the documents the structures' discrepancies are. You will learn the strings that don't match, their positions in the text, and the type of each of them. Moreover, the PO file generated so far will be dumped to gettextization.failed.po.
Likewise, two paragraphs may get merged together in POD when the separating line contains some spaces, or when there is no empty line between the =item line and the content of the item.
This unfortunate situation happens when the same paragraph is repeated over the document. In that case, no new entry is created in the PO file, but a new reference is added to the existing one instead.
So, when the same paragraph appears twice in the original but both are not translated in the exact same way each time, you will get the feeling that a paragraph of the original disappeared. Just kill the new translation. If you prefer to kill the first translation instead when the second one was actually better, remove the second one from where it is and put the first one in the place of the second one.
In the contrary, if two similar but different paragraphs were translated in the exact same way, you will get the feeling that a paragraph of the translation disappeared. A solution is to add a stupid string to the original paragraph (such as ``I'm different''). Don't be afraid, those things will disappear during the synchronization, and when the added text is short enough, gettext will match your translation to the existing text (marking it as fuzzy, but you don't really care since all strings are fuzzy after gettextization).
Hopefully, those tips will help you making your gettextization work and obtain your precious PO file. You are now ready to synchronize your file and begin your translation. Please note that on large text, it may happen that the first synchronization takes a long time.
For example, the first po4a-updatepo of the Perl documentation's French translation (5.5 Mb PO file) took about two days full on a 1Ghz G5 computer. Yes, 48 hours. But the subsequent ones only take a dozen of seconds on my old laptop. This is because the first time, most of the msgid of the PO file don't match any of the POT file ones. This forces gettext to search for the closest one using a costly string proximity algorithm.
It may help the comprehension to consider addenda as a sort of patches applied to the localized document after processing. They are rather different from the usual patches (they have only one line of context, which can embed Perl regular expression, and they can only add new text without removing any), but the functionalities are the same.
Their goal is to allow the translator to add extra content to the document which is not translated from the original document. The most common usage is to add a section about the translation itself, listing contributors and explaining how to report bug against the translation.
An addendum must be provided as a separate file. The first line constitutes a header indicating where in the produced document they should be placed. The rest of the addendum file will be added verbatim at the determined position of the resulting document.
The header has a pretty rigid syntax: It must begin with the string PO4A-HEADER:, followed by a semi-colon (;) separated list of key=value fields. White spaces ARE important. Note that you cannot use the semi-colon char (;) in the value, and that quoting it doesn't help.
Again, it sounds scary, but the examples given below should help you to find how to write the header line you need. To illustrate the discussion, assume we want to add a section called ``About this translation'' after the ``About this document'' one.
Here are the possible header keys:
This line is called position point in the following. The point where the addendum is added is called insertion point. Those two points are near one from another, but not equal. For example, if you want to insert a new section, it is easier to put the position point on the title of the preceding section and explain po4a where the section ends (remember that position point is given by a regexp which should match a unique line).
The localization of the insertion point with regard to the position point is controlled by the mode, beginboundary and endboundary fields, as explained below.
In our case, we would have:
position=<title>About this document</title>
Since we want the new section to be placed below the one we are matching, we have:
mode=after
When mode=after, the insertion point is after the position point, but not directly after! It is placed at the end of the section beginning at the position point, i.e., after or before the line matched by the ???boundary argument, depending on whether you used beginboundary or endboundary.
In our case, we can choose to indicate the end of the section we match by adding:
endboundary=</section>
or to indicate the beginning of the next section by indicating:
beginboundary=<section>
In both case, our addendum will be placed after the </section> and before the <section>. The first one is better since it will work even if the document gets reorganized.
Both forms exist because documentation formats are different. In some of them, there is a way to mark the end of a section (just like the </section> we just used), while some other don't explicitly mark the end of section (like in man). In the former case, you want to make a boundary matching the end of a section, so that the insertion point comes after it. In the latter case, you want to make a boundary matching the beginning of the next section, so that the insertion point comes just before it.
This can seem obscure, but hopefully, the next examples will enlighten you.
PO4A-HEADER: mode=after; position=About this document; endboundary=</section> PO4A-HEADER: mode=after; position=About this document; beginboundary=<section>
.SH "AUTHORS"
you should put a position matching this line, and a beginboundary matching the beginning of the next section (i.e., ^\.SH). The addendum will then be added after the position point and immediately before the first line matching the beginboundary. That is to say:
PO4A-HEADER:mode=after;position=AUTHORS;beginboundary=\.SH
PO4A-HEADER:mode=after;position=Copyright Big Dude, 2004;beginboundary=^
PO4A-HEADER:mode=after;position=<title>About</title>;beginboundary=FakePo4aBoundary
In any case, remember that these are regexp. For example, if you want to match the end of a nroff section ending with the line
.fi
don't use .fi as endboundary, because it will match with ``the[ fi]le'', which is obviously not what you expect. The correct endboundary in that case is: ^\.fi$.
If the addendum doesn't go where you expected, try to pass the -vv argument to the tools, so that they explain you what they do while placing the addendum.
More detailed example
Original document (POD formatted):
|=head1 NAME | |dummy - a dummy program | |=head1 AUTHOR | |me
Then, the following addendum will ensure that a section (in French) about the translator is added at the end of the file. (in French, ``TRADUCTEUR'' means ``TRANSLATOR'', and ``moi'' means ``me'')
|PO4A-HEADER:mode=after;position=AUTEUR;beginboundary=^=head | |=head1 TRADUCTEUR | |moi
In order to put your addendum before the AUTHOR, use the following header:
PO4A-HEADER:mode=after;position=NOM;beginboundary=^=head1
This works because the next line matching the beginboundary /^=head1/ after the section ``NAME'' (translated to ``NOM'' in French), is the one declaring the authors. So, the addendum will be put between both sections.
The po4a(1) program was designed to solve those difficulties. Once your project is converted to the system, you write a simple configuration file explaining where your translation files are (PO and POT), where the original documents are, their formats and where their translations should be placed.
Then, calling po4a(1) on this file ensures that the PO files are synchronized against the original document, and that the translated document are generated properly. Of course, you will want to call this program twice: once before editing the PO file to update them and once afterward to get a completely updated translated document. But you only need to remember one command line.
It is also possible to customize a module or new / derivative / modified modules by putting a module in lib/Locale/Po4a/, and adding lib to the paths specified by the PERLLIB or PERL5LIB environment. For example:
PERLLIB=$PWD/lib po4a --previous po4a/po4a.cfg
Note: the actual name of the lib directory is not important.
More formally, it takes a document to translate plus a PO file containing the translations to use as input while producing two separate outputs: Another PO file (resulting of the extraction of translatable strings from the input document), and a translated document (with the same structure than the input one, but with all translatable strings replaced with content of the input PO). Here is a graphical representation of this:
Input document --\ /---> Output document \ TransTractor:: / (translated) +-->-- parse() --------+ / \ Input PO --------/ \---> Output PO (extracted)
This little bone is the core of all the po4a architecture. If you omit the input PO and the output document, you get po4a-gettextize. If you provide both input and disregard the output PO, you get po4a-translate.
TransTractor::parse() is a virtual function implemented by each module. Here is a little example to show you how it works. It parses a list of paragraphs, each of them beginning with <p>.
1 sub parse { 2 PARAGRAPH: while (1) { 3 $my ($paragraph,$pararef,$line,$lref)=("","","",""); 4 $my $first=1; 5 while (($line,$lref)=$document->shiftline() && defined($line)) { 6 if ($line =~ m/<p>/ && !$first--; ) { 7 $document->unshiftline($line,$lref); 8 9 $paragraph =~ s/^<p>//s; 10 $document->pushline("<p>".$document->translate($paragraph,$pararef)); 11 12 next PARAGRAPH; 13 } else { 14 $paragraph .= $line; 15 $pararef = $lref unless(length($pararef)); 16 } 17 } 18 return; # Did not got a defined line? End of input file. 19 } 20 }
On line 6, we encounter <p> for the second time. That's the signal of the next paragraph. We should thus put the just obtained line back into the original document (line 7) and push the paragraph built so far into the outputs. After removing the leading <p> of it on line 9, we push the concatenation of this tag with the translation of the rest of the paragraph.
This translate() function is very cool. It pushes its argument into the output PO file (extraction) and returns its translation as found in the input PO file (translation). Since it's used as part of the argument of pushline(), this translation lands into the output document.
Isn't that cool? It is possible to build a complete po4a module in less than 20 lines when the format is simple enough...
You can learn more about this in Locale::Po4a::TransTractor(3pm).
Original Translation chapter chapter paragraph paragraph paragraph paragraph paragraph chapter chapter paragraph paragraph paragraph
For that, po4a parsers are used on both the original and the translation files to extract PO files, and then a third PO file is built from them taking strings from the second as translation of strings from the first. In order to check that the strings we put together are actually the translations of each other, document parsers in po4a should put information about the syntactical type of extracted strings in the document (all existing ones do so, yours should also). Then, this information is used to make sure that both documents have the same syntax. In the previous example, it would allow us to detect that string 4 is a paragraph in one case, and a chapter title in another case and to report the problem.
In theory, it would be possible to detect the problem, and resynchronize the files afterward (just like diff does). But what we should do of the few strings before desynchronizations is not clear, and it would produce bad results some times. That's why the current implementation don't try to resynchronize anything and verbosely fail when something goes wrong, requiring manual modification of files to fix the problem.
Even with these precautions, things can go wrong very easily here. That's why all translations guessed this way are marked fuzzy to make sure that the translator reviews and checks them.
Even with these advantages, some people don't like the idea of translating each paragraph separately. Here are some of the answers I can give to their fear:
Paragraphs are by definition longer than sentences. It will hopefully ensure that having the same paragraph in two documents will have the same meaning (and translation), regardless of the context in each case.
Splitting on smaller parts than the sentence would be very bad. It would be a bit long to explain why here, but interested reader can refer to the Locale::Maketext::TPJ13(3pm) man page (which comes with the Perl documentation), for example. To make short, each language has its specific syntactic rules, and there is no way to build sentences by aggregating parts of sentences working for all existing languages (or even for the 5 of the 10 most spoken ones, or even less).
That's why the debconf developer decided to implement another solution, where translations are placed in the same file than the original. This is rather appealing. One would even want to do this for XML, for example. It would look like that:
<section> <title lang="en">My title</title> <title lang="fr">Mon titre</title> <para> <text lang="en">My text.</text> <text lang="fr">Mon texte.</text> </para> </section>
But it was so problematic that a PO-based approach is now used. Only the original can be edited in the file, and the translations must take place in PO files extracted from the master template (and placed back at package compilation time). The old system was deprecated because of several issues:
If several translators provide a patch at the same time, it gets hard to merge them together.
How will you detect changes to the original, which need to be applied to the translations? In order to use diff, you have to note which version of the original you translated. I.e., you need a PO file in your file ;)
This solution is viable when only European languages are involved, but the introduction of Korean, Russian and/or Arab really complicate the picture. UTF could be a solution, but there are still some problems with it.
Moreover, such problems are hard to detect (i.e., only Korean readers will detect that the encoding of Korean is broken [because of the Russian translator])
gettext solves all those problems together.
It can only handle XML, and only a particular DTD. I'm quite unhappy with the handling of lists, which end in one big msgid. When the list become big, the chunk becomes harder to shallow.
The main advantages of po4a over them are the ease of extra content addition (which is even worse there) and the ability to achieve gettextization.
Another important point is that each translated file begins with a short comment indicating what the file is, how to use it. This should help the poor developers flooded with tons of files in different languages they hardly speak, and help them dealing correctly with it.
In the po4a project, translated documents are not source files anymore. Since SGML files are habitually source files, it's an easy mistake. That's why all files present this header:
| ***************************************************** | * GENERATED FILE, DO NOT EDIT * | * THIS IS NO SOURCE FILE, BUT RESULT OF COMPILATION * | ***************************************************** | | This file was generated by po4a-translate(1). Do not store it (in VCS, | for example), but store the PO file used as source file by po4a-translate. | | In fact, consider this as a binary, and the PO file as a regular source file: | If the PO gets lost, keeping this translation up-to-date will be harder ;)
Likewise, gettext's regular PO files only need to be copied to the po/ directory. But this is not the case of the ones manipulated by po4a. The major risk here is that a developer erases the existing translation of his program with the translation of his documentation. (Both of them can't be stored in the same PO file, because the program needs to install its translation as an mo file while the documentation only uses its translation at compile time). That's why the PO files produced by the po-debiandoc module contain the following header:
# # ADVISES TO DEVELOPERS: # - you do not need to manually edit POT or PO files. # - this file contains the translation of your debconf templates. # Do not replace the translation of your program with this !! # (or your translators will get very upset) # # ADVISES TO TRANSLATORS: # If you are not familiar with the PO format, gettext documentation # is worth reading, especially sections dedicated to this format. # For example, run: # info -n '(gettext)PO Files' # info -n '(gettext)Header Entry' # # Some information specific to po-debconf are available at # /usr/share/doc/po-debconf/README-trans # or http://www.debian.org/intl/l10n/po-debconf/README-trans #
- http://kv-53.narod.ru/kaider1.png - http://www.debian.org/intl/l10n/
But everything isn't green, and this approach also has some disadvantages we have to deal with.
One of my dreams would be to integrate somehow po4a to Gtranslator or Lokalize. When an SGML file is opened, the strings are automatically extracted. When it's saved a translated SGML file can be written to disk. If we manage to do an MS Word (TM) module (or at least RTF) professional translators may even use it.
Denis Barbier <barbier,linuxfr.org> Martin Quinson (mquinson#debian.org)