English-stripping

From Dreamwidth Notes
Revision as of 19:31, 8 April 2012 by Pauamma (Talk | contribs)

Jump to: navigation, search

English-stripping a page refers to the process of taking out hardcoded English text from the BML pages, giving them an ID you can use to refer to the string, and then putting the original English text in a lookup file. In this way, you're stripping the BML files of any English text, hence the name. This is useful because by doing this, it's easy to support multiple languages; the text for different languages is held in the database and can be looked up by the aforementioned ID.

Although Dreamwidth Studios itself won't be supporting any language other than English, it's still important to learn how to English-strip pages as it means our Site Copy team can change text as necessary on the site without having to go through the code, and also because we want other users of the code to be able to implement other languages if they want to with the minimum of hassle. (for both of these reasons, we're also going to be replacing the current translation system with something better - although to be perfectly honest, that's not going to be too hard.)

Glossary

First, a bit of explanation about some of the terms we're going to use:

  • String: this refers to a piece of text. For example, this sentence can be considered a string. We'll normally use this when referring to the text that a multi-language ID refers to (defined below).

The Anatomy of a Multi-Language ID

The IDs that replace English text in a BML page are referred to as Multi-Language IDs. There are two types of ID - global IDs (which can be used by any BML page) and page-specific IDs (which are only valid on one page). When you English-strip a file, you will almost always be using page-specific IDs, but it's helpful to know about global IDs anyway.

Global IDs

Global IDs are, for the most part, defined in bin/upgrading/en.dat. However, for features specific to Dreamwidth Studios (and unusable by any other site using our code), any corresponding global IDs will be defined in bin/upgrading/en_DW.dat instead. For example, the Tropospherical sitescheme strings are stored in en_DW.dat since Tropospherical is specific to DWS, and because the strings appear in every page, page-specific IDs can't be used.

A global ID looks something like this:

date.month.december.short

You'll notice this ID is split into several parts with dots. This helps to know precisely how the string is being used; ideally, each separate part should be a subset of the part before it in some way. In this example, month is part of date; december is a month, and short means the short version of how to say this month. (in this example, it's "Dec" in the English text; there's a corresponding long version too, which is simply "December").

Each section name should be lower-case and use only letters, digits, and the underscore and hyphen characters. (There aren't actually any set rules for the characters you can use in IDs in the code, but this is how it's been done so far.) The number of sections in an ID is arbitrary, as are the section names themselves. However, you should always have at least two sections in an ID for ease of use.

Page-specific IDs

Page-specific IDs are defined in a file of the same name as the page it applies to with the additional extension .text. For example, for a page htdocs/login.bml, the corresponding page-specific ID file will be htdocs/login.bml.text.

Page-specific IDs begin with a dot, and thereafter follow the same rules as global IDs. For example, one of the strings in the htdocs/login.bml.text file in dw-free has this ID:

.createaccount.header

Generally, in page-specific IDs, the names you'll use for your sections will correspond to the sections of the page in question. So this ID, for example, refers to the header of the section that invites the user to create an account if they don't already have one.

Again, the actual names and number of sections is arbitrary, but you should always have your IDs follow the structural flow of the content of the page for ease of use.

The Anatomy of a .text File

There isn't too much to learn about how a .text file works - it's pretty straightforward. For each ID referenced in the page, you put the name of the ID, an equals sign (=), then the English text stripped from the file. (We'll talk about how precisely to do that in the next few sections.) Ideally, you should have one string correspond to one unbroken line of English. (This doesn't mean just one sentence - it's perfectly valid to have whole paragraphs under one ID. Just make sure you don't have any HTML in a string, unless it's part of a sentence. (ie, don't include wrapping <p> tags, etc.)

For example, the .createaccount.header page-specific ID referred to in the last section is defined in the .text file like so:

.createaccount.header=Not a <?sitename?> member?

(the <?sitename?> part of this is a BML tag; for more information on these, see the linked page.)

It's possible to have a multi-line string in a .text file. You should never need to do this in a page-specific ID, but if you do, you simply replace the equals sign with two less-than signs (<<), and end the string with a dot on its own line. For example, here's the definition for the global ID email.invitecoderequest.accept.body:

email.invitecoderequest.accept.body<<
Your request for invites has been granted. You can view all your invite codes here:

  [[invitesurl]]
.

All IDs should be listed in alphabetical order, if possible.

English-stripping

You might think, after learning the above, that English-stripping a page is fairly easy - and in theory, it is. In practice, however, you need to know at least something about how both Perl and HTML work.

How to English-strip: The Theory

In theory, English-stripping a page in BML is easy. BML has specific tags for English-stripping, which means that in a normal BML page you would normally be able to follow a simple set of steps:

  • Identify the text to be stripped. For example, you may have a line in your BML page that says:
<p>Enter your invite code below:</p>
In this part, the English text to be stripped is "Enter your invite code below:".
  • Cut and paste this out, and replace it with an <?_ml ... _ml?> tag, where "..." represents a multi-language ID, as described above. Remember, page-specific IDs always begin with a dot.
For example, this one might be .createaccount.enter_invite_code, in which case your replaced line will be:
<p><?_ml .createaccount.enter_invite_code _ml?></p>
  • Put the line into the corresponding .text file, as described above:
.createaccount.enter_invite_code=Enter your invite code below:
  • You're done. Rinse and repeat.

How to English-strip: The Reality

Unfortunately, English-stripping is rarely as easy as the theory goes, for several reasons:

  • Most of the Dreamwidth BML files are actually glorified Perl scripts and have virtually nothing in them that isn't Perl code of some description, and the above may not work.
  • Even when this isn't the case, sometimes you'll want to be able to specify parts in the text that you don't know what the value will be when you're doing the stripping - for example, the username of the currently logged-in user. This isn't possible using the <?_ml ... _ml?> tag, and needs to be done with Perl.
  • Some things that look like English text that should be stripped are actually signals to the code to take a certain action. For example, you can safely assume that any value within a hidden INPUT tag is probably not one that should be stripped, regardless of how English it looks. (You can always ask one of our resident code gurus in IRC if you're not sure, though.)

So it's pretty much assured that in order to be able to English-strip, you need to know a little about how Perl works. (Not too much!) I'll go over these basics here.

A basic example

In Perl, literal text strings (that is, text which is mostly left unchanged) are represented by surrounding quote marks. For example:

"Enter your invite code below:\n"

The "\n" in this example is called a 'newline', and signals to Perl that it should start a new line when it encounters it. (This won't appear as a new line in a browser unless a tag like <p> or <br> is used, but can be helpful to keep the source tidy.)

The string itself may be surrounded on the same line by other Perl code, such as:

$ret .= "Enter your invite code below:\n";

In these examples, the string is highlighted in red. Your aim here is to get this string English-stripped.

Now, in cases like this, where the entire string is English text and is an unbroken line, you can normally actually do this according to the theory:

$ret .= "<?_ml .createaccount.enter_invite_code _ml?>\n";


Be sure to keep the quote marks and newline around it; these are important, and you shouldn't add any of these to your string in the .text file. Then, you can add the line to the .text file as described in the theory example, and you're done.

(note that while this is generally okay, there are some cases where it may not work. In those cases, you should use the Perl function described in the next section instead.)

The BML::ml Perl function

If you try the above on a piece of code and it doesn't work, you may need to use the BML::ml Perl function instead. This performs much the same task as the <?_ml ... _ml?> tag, but in Perl.

The way you'd use the function for the above string is as follows:

BML::ml( ".createaccount.enter_invite_code" )

The function itself isn't used in a literal Perl string, so it doesn't need quotes around it. However, the newline "\n" character *does* need to be in a literal string with quotes around it, which means you need to combine the two. This is done using a dot - . - which is how you tell Perl to combine a literal string and something else. This is how it would end up:

$ret .= BML::ml( ".createaccount.enter_invite_code" ) . "\n";

Unknown data

Sometimes, there will be parts of a string which contain information that you can't specifically know when you're English-stripping, such as the username of the logged-in user. For example, the code might say something like:

$ret .= "$u->{user}, enter your $LJ::SITENAMESHORT invite code:\n";

In the actual HTML output, this might look something like:

sophie, enter your Dreamwidth invite code:

Even though the "$u->{user}" and "$LJ::SITENAMESHORT" parts in this example are highlighted in red, you can tell they're pieces of data by the dollar sign; anything in a string that starts with a dollar sign is data that needs to be kept somehow. You don't need to understand what the names mean in order to English-strip them, just the way they're used. (Of course, if you do understand them, it'll be easier to give them meaningful labels.)

(yes, the above example is rather contrived, since if you're logged in you wouldn't be entering an invite code anyway. Just bear with me on this one, it's only an example.)

In order to use a piece of data in a multi-language string, you need to assign it a label. For example, for the first piece, let's call it "username". Then, you take the data part exactly as written (including the dollar sign), and combine the two with =>:

username => $u->{user}

The above means that the label 'username' should have the value of whatever $u->{user} comes out to be.

If you have multiple pieces of data, as above, you can use commas to separate them; simply copy the above format and separate them with a comma. For example, let's assign the site name a label of "sitename":

username => $u->{user}, sitename => $LJ::SITENAMESHORT

You then surround the whole thing with braces:

{ username => $u->{user}, sitename => $LJ::SITENAMESHORT }

You then need to use the BML::ml function, described above, and in addition to giving it the multi-language ID, you need to also give it the data itself:

$ret .= BML::ml( ".createaccount.enter_invite_code",
                   { username => $u->{user}, sitename => $LJ::SITENAMESHORT }
               ) . "\n";

Note that in this example, I've split the line into three lines in the middle of the line. Perl is perfectly happy with this as long as you do it in the right place - for example, not in the middle of a literal string. You could write the above as a single line if you wanted:

$ret .= BML::ml( ".createaccount.enter_invite_code", { username => $u->{user}, sitename => $LJ::SITENAMESHORT } ) . "\n";

...but it's not really very easy to read, so for this example I've split it.

The key thing to note here is that after the multi-language ID, I've put a comma, then pasted the notation we constructed above after it, while still inside the parentheses of the BML::ml() function. After that, we continue on as normal - the closing parenthesis, and the newline character.

We're now done for the Perl part of it, and we now need to add the text to the .text file. Fortunately, this is a lot easier; when you need to put data in a string, simply refer to its label surrounded by two square brackets. The above string would be represented in the .text file as follows:

.createaccount.enter_invite_code=[[username]], enter your [[sitename]] invite code:

We're done. Yay!

Plurals and numbers

These are described in excruciating detail in [Embedding plural forms into translations], but here's a quick example:

$ret .= "You have $num message" . ( ( $num != 1 ) ? 's' : '' ) . " in your inbox.";

You could use a variable for the plural, like this:

$ret .= BML::ml( ".inbox.num_msgs",
                   { num => $num, plural => ( $num != 1 ) ? 's' : '' }
               );
.inbox.num_msgs=You have [[num]] message[[plural]] in your inbox.

However, this would still be baking English into the source code - not actual English text in this case, but English grammar in the form of singular and plural inflections. Instead, you can use:

$ret .= BML::ml( ".inbox.num_msgs", { num => $num );
.inbox.num_msgs=You have [[num]] [[?num|message[messages]] in your inbox.

This takes care of applying the rules for English plurals for you, and lets translators (with help from some magic in LiveJournal and Dreamwidth source code) handle it appropriately by just specifying a text string for their language, without having to muck around in the Dreamwidth source code - which is, after all, the goal of the translation system.

Split strings

If you've been paying attention, you might notice that the original string above:

$ret .= "$u->{user}, enter your $LJ::SITENAMESHORT invite code:\n";

...could just have easily have been written like this:

$ret .= $u->{user} . ", enter your " . $LJ::SITENAMESHORT . " invite code:\n";

...that is, the actual data is no longer part of the literal text string, but joined. Perl will accept either way of doing it (with a few exceptions that aren't important here), and as such it's important to note that sometimes you'll come across examples of the latter sort. These are still all parts of the same string, and should be handled in exactly the same way as above. In particular, you should try to avoid having parts of sentences in a multi-language string. For example, you should not do this in your .text file:

.createaccount.enter_invite_code.part1=, enter your

.createaccount.enter_invite_code.part2= invite code:

This is not only because it looks ugly, but also because some (human) languages don't work that way; for example, in some languages the name of the person you're addressing may need to come at the end, after the site name. By doing it this way, that becomes impossible. We should be as flexible as possible when it comes to where different terms may come in a sentence.

Heredocs

Sometimes, you'll come across Perl constructs that look something like this:

$ret .= <<HTML;

...followed by a block of text that isn't Perl code, followed by an "HTML" on its own line (or whatever was after the << in the original line). This format of text is known in Perl parlance as a "heredoc". If you can replace any text in there with <?_ml ... _ml?> tags, and it works, you should do so. However, if any need the use of the BML::ml function, it's probably best to have someone look at it who codes Perl, since these require care in order to get right. (Of course, if you know Perl and know how to fix it, go ahead; otherwise, just make a note of it and move on.)

HTML forms

Sometimes, you'll come across text that needs to be stripped in HTML forms. For example, in the following example the contents of the <p> tag and the value of the submit button need to be English-stripped:

<form method="post" action="create.bml">
    <input type="hidden" name="mode" value="codesubmit">
    <p>Enter your invite code:</p>
    <input type="text" name="invite" size="20" maxlength="20">
    <input type="submit" value="Create Account">
</form>

Notice that we do *not* want to English-strip:

  • name="invite": The word "invite", although English, is being used here as a field name, and is not shown to the user. The code will be looking out for a field called "invite" no matter what language the user is using, so this must not be changed.
  • value="codesubmit": Same as above - this value is used by the code.

However, the value of a submit button (in this case, "Create Account") *is* shown to the user (as this is what's shown on the button itself), and thus needs to be English-stripped, despite being in a value attribute.

There is one exception to this. If you find a submit button with a "name" attribute, check to see if there are any more. If any two submit buttons have the same 'name' attribute, such as:

<input type="submit" name="action" value="Rename">
<input type="submit" name="action" value="Delete">

...then do not English-strip it, and make a note for whoever reviews your patch that this will need to be fixed. (This is because the code will be checking for the value of whatever button is clicked on, and that prohibits English-stripping from taking place; if you were to English-strip the value here, it would no longer work for non-English users.)

If you do not find two submit buttons with the same 'name' attribute, but there is nonetheless still a 'name' attribute on at least one, make a note for the reviewer that this is the case, but go ahead and English-strip it as normal. (The reviewer will check to see whether the code is actually checking for this value.)

Fin

That's basically the guide for how to English-strip a page. Don't be too afraid of messing things up; a reviewer will tell you if you have anything wrong.