Knuth linebreaking elements for Formatting Objects

Revision History
Revision 1.02006-04-01
First publication
Revision 1.12006-04-29
Small updates, together with a new release of the software
Revision 1.32006-05-03
Added box-penalty; added section on the basic building block; changed ‘generalized’ to ‘FO’
Revision 1.42006-05-27
Bells and whistles, and text-align implemented
Revision 1.52006-06-24
Text-align-last, a complex index, documentation of the implementation

Abstract

In this article I develop a new abstract representation of a paragraph of text for the purpose of line breaking. This reprepresentation is a reformulation of the abstract representation of Knuth and Plass, adapted to the features of FO texts.


Table of Contents

Introduction
The basic building block
Building blocks for FO texts
Text alignment
A superset of the original Knuth elements
Examples
A normal paragraph
A paragraph with borders at the start and end of each line
A paragraph consisting of a single non-breaking space
Spaces before and after a border
A complex index
The FO linebreaking algorithm
Changes compared to Knuth and Plass
Implementation
Bells and Whistles
The test class
The print representation of the test result
The complex index in the test class
Acknowledgements
Bibliography

The line breaking algorithm of Donald E. Knuth and Michael F. Plass [KP] takes as its input an abstract representation of the text. The building blocks of this abstract representation are box, glue and penalty elements. A word or part of a word is represented by a box, and a white space character or a series of white space characters is represented by a glue element. A glue element is a legal linebreak if it immediately follows a box. Linebreak opportunities other than white space are represented by penalties. Boxes have a width, glue elements have an elastic width, which consists of a width, a stretch and a shrink, and penalties have a penalty value and also a width. In addition to this abstract representation, Knuth and Plass formulated rules to determine the amount of demerits of a line. This amount is based on the penalty value of the break point, on the amount of stretching or shrinking of the line, on the class of the line compared to that of the surrounding lines, and on the number of consecutive hyphenated lines. With this representation and these rules, they were able to formulate a total-fit algorithm which determines very good line breaks for a paragraph of text.

The linebreaking algorithm of Knuth and Plass, together with its abstract representation of the text using its three building blocks, has been applied in the Formatting Object Processor FOP, generally with good results. FOP not only uses this algorithm to determine its linebreaks but also to determine its page breaks. However, a more refined line breaking process must take into account the precise requirements of the XSL-FO specification for whitespace treatment and of the Unicode Standard Annex 14 for linebreaking opportunities. Efforts to achieve this have run into difficulties.

FOP's developers have identified the following problem cases, which they could fit into the Knuth linebreak algorithm only with great difficulty or not at all:

  1. …<fo:inline border="…">some text </fo:inline> more … (note the spaces before and after the end of the inline element)

  2. <fo:block>&nbsp;</fo:block>

  3. table header and footer repetition around line breaks

  4. white-space-treatment policy (preserve/ignore-if-before/surrounding/after-linefeed) and suppress-at-line-break property

  5. letter spacing

  6. The Unicode Annex 14 (UAX#14) specifies that linebreaks come between characters. According to the XSL-FO specification by default white space is suppressed around line breaks. In the Knuth algorithm, linebreaks come at the space, and the space is suppressed. Normally, the two are equivalent, but not always. For example, when white-space-collapse="preserve", the two are not equivalent.

Most of these problem cases can be treated with specially crafted sequences of elements. But the suppression of white space before a line break turns out to be inachievable. Knuth's algorithm suppresses whitespace and penalties after a linebreak but not before. Indeed, this non-suppression of whitespace before a linebreak has been turned into a feature, upon which the procedure to obtain centered and ragged-left texts relies. For the usual suppression of white space around a linebreak, the algorithm relies on the fact that a consecutive series of white space characters can be represented as a single glue element, which is dropped when it is the selected linebreak. This condition is not always satisfied in the texts of a FO processor, see item 1 above.

Every linebreak can be represented by the triple of ‘pre-break text’, ‘post-break text’, ‘nobreak text’. Note that this is exactly TeX's \discretionary control sequence. A paragraph of text can be represented by a series of such triple elements, provided their texts do not contain other linebreak opportunities than the one they represent.

There are a few extreme cases in which the triple elements reduce to simpler elements:

Knuth and Plass designed their set of building blocks for the types of text they were dealing with: Western text with a single white space character between words. The XSL-FO specification addresses a much wider range of texts, among others non-Western texts. These texts come with new features: non-collapsible sequences of white space characters, suppression of characters before, after or around line breaks. For such texts a modified set of building blocks is required.

Here I propose such a set:

Penalties and box-penalties are legal breakpoints. Boxes are not.

An elastic width is a set of (width, stretch, shrink). In FO terms this would be (optimum, maximum, minimum), where stretch = maximum − optimum, shrink = optimum − minimum.

The actual treatment of boxes with suppress-at-linebreak=true is subject to the value of the policy property white-space-treatment. Its possible values are:

Penalties are treated like boxes with suppress-at-linebreak=true.

In the following text I will denote these building blocks for FO texts as FO building blocks or FO elements.

FO defines the usual four values for text alignment: start, center, end and justify. start and end are also known as ragged right or flush right and ragged left or flush left. These alignments can easily be produced by inserting penalties at every linebreak opportunity. For start we insert a penalty whose width before the linebreak is zero with a non-zero stretch value. For center we insert a penalty whose widths before and after the linebreak are zero with a non-zero stretch value. For end we insert a penalty whose width after the linebreak is zero with a non-zero stretch value. For justify we do nothing special. Some linebreak opportunities may already have a penalty for other reasons, for example a hyphen. Then the stretch value for text alignment has to be added to the widths of that penalty.

The stretch values are in principle arbitrary. Knuth and Plass use a value of 18 machine units = 1 em. For the stretch values in the center case, Knuth and Plass use the same value before and after the linebreak. I prefer to use half that value, so that the total stretch added to a line due to text alignment is the same in the three non-justified cases. This has the consequence that the linebreak calculations in the three cases are identical. Only the stretch is distributed differently.

FO allows the user to specify a different value for the alignment of the last line. If the user does not specify a value, then it is equal to the text alignment value. The case of justified text alignment is an exception. In this case the default value for the alignment of the last line is start, which corresponds to the usual incomplete last line.

A different value for the alignment of the last line cannot be represented in the abstract representation of the paragraph, because the building blocks have not yet been assigned to a line. It is possible to take the alignment of the last line into account in the algorithm, as follows. All linebreak opportunities have a penalty whose width after the linebreak contains a stretch component corresponding to the text alignment value. For the linebreak opportunity of the last line but one, the value of this stretch component needs to be changed to that corresponding to the value for the alignment of the last line. I store the difference between these two values in the final penalty as the last-line-additional-before-stretch member.

When in the main loop the current linebreak opportunity is the last linebreak opportunity, I use this value. For each active node that is considered in the inner loop, I increase the line length with this additional stretch value. This provides the linebreak calculation with the desired stretch for the last line. After the main loop, when the best node for the paragraph has been determined, I modify the penalty of its previous node, and add the additional stretch value to its width after the linebreak. In this way, the following typesetting phase sees an abstract representation that has all the correct stretch values.

There is one feature that the above trick cannot get right. In the justified case it is a good idea to use stretchable boxes for white space characters. In the non-justified case it is better to use fixed width boxes for them, because all the stretch is placed at the start and/or end of the line. The stretch of the white space character boxes in the last line corresponds to the setting of the text alignment of the body of the paragraph and not to that of the last line. As a consequence, this method will usually not find a linebreaking solution for non-justified paragraphs with a justified last line. In this case the last line will often not have any stretch, and unless it happens to fit exactly, it will be infinitely bad. In my opinion such combinations are unrealistic anyway.

It is possible to extend the trick so that also the stretch in the last line agrees with the text alignment for that line. To achieve that one assigns an alignment dependent expression to the stretch of white space characters. The linebreaking calculation must know the values of the text alignment of the body of the paragraph and of that for the last line. During the linebreak calculations the alignment dependent expression is evaluated according to whether the linebreak opportunity is the last one of the paragraph or not. I will not explore this complicated possibility.

When the parameters of the building blocks are made line-dependent, one should take care that the dynamic programming “principle of optimality” remains valid. This is the case here because we modify the parameters only in the last line.

The FO building blocks presented above form a superset of the original building blocks of Knuth and Plass. That means that any set of Knuth elements can be mapped to a set of our FO elements, with the same widths, elastic or not, and the same legal linebreak points and penalty values.

The mapping is as follows. We map each Knuth box to an FO box with the same width. The box has no stretch and shrink. We map each glue to an FO box with the same elastic width. The box's property suppress-at-linebreak is true. If the glue is a legal linebreak, that is, if it immediately follows a box, we insert a penalty before the corresponding FO box. The penalty has zero value and widths. Each Knuth penalty is mapped onto an FO penalty with the same penalty value. The width before the break is equal the original width. The width after the break is zero. Because the Knuth approach only suppresses glues and penalties after a linebreak, we use the whitespace treatment policy ignore-if-after-linefeed.

It is obvious that the above mapping produces the same legal linebreaks: Each glue that is a legal linebreak and each penalty are mapped onto an FO penalty, and no other FO penalties are generated. Each glue that is a legal linebreak is also mapped onto a suppressible FO box following an FO penalty. Therefore, if that FO penalty is a chosen linebreak, the following FO box is suppressed, which is equivalent to the dropping of the glue when it is a chosen linebreak.

After a linebreak a consecutive series of glues and penalties are dropped, up to the first Knuth box. Such a series is mapped onto a series of suppressible FO boxes and penalties. The glues do not generate FO penalties in the mapping, because none of them follows a box, and therefore none of them is a legal linebreak. Because our whitespace treatment policy is ignore-if-after-linefeed, all such suppressible FO boxes and penalties are suppressed, up to the first non-suppressible FO box, corresponding to the first Knuth box. This shows that also the suppression of elements around linebreaks is equivalent in the two cases.

This proves that any sequence of building blocks of Knuth and Plass is equivalent to a sequence of FO building blocks, and that the FO building blocks form a superset of the original building blocks. Our FO approach can therefore deal with any situation which can be dealt with using the approach of Knuth and Plass.

Before closing this section, I want to point out a different mapping between the two sets of building blocks. Its validity is limited to the case of a paragraph consisting of words and single white space characters. The words may be hyphenated. The alignment policy for the paragraph is justified. This is the most common case of western text. In this case the abstract representation of Knuth and Plass consists of a series of boxes interrupted by a penalty for a hyphen or a glue for a word space.

We map the boxes to FO boxes and the penalties to FO penalties, as before. However, now we map the glues (which are all legal linebreaks) to a suppressible FO box followed by an FO penalty. And our whitespace treatment policy is ignore-if-surrounding-linefeed. When a hyphenation penalty is a chosen linebreak, both cases are identical. When a glue is a chosen linebreak in the Knuth case, the corresponding FO penalty is the chosen linebreak in the FO case, and the preceding suppressible box is suppressed. This makes the behaviour in the two cases identical.

This shows that for default text, the Knuth case can also be mapped to the default whitespace treatment policy of XSL-FO and to linebreak opportunities following spaces, as prescribed in the Unicode Annex 14. It also shows that a Knuth glue can be thought of as a combination of a suppressible box for the space with a penalty for the linebreak opportunity. Only in special cases, such as centered text alignment, does the approach by Knuth and Plass deviate markedly from the XSL-FO and Unicode defaults.

The list of elements around the border is:

There is no linebreak opportunity before the border because it may be considered as a close operator in the sense of UAX#14. There is no linebreak opportunity after the border, because it is forbidden by the following space. If the penalty shown is chosen as a linebreak and the whitespace treatment policy is ignore-if-surrounding-linefeed or ignore-if-before-linefeed, both suppressible boxes shown will be suppressed. The line will end in a word followed by the border.

This is the only example that can really not be solved with the approach of Knuth and Plass. That is so because there is no way to turn the space before the border into a glue. Therefore it must be suppressed before a linebreak, which the approach of Knuth and Plass does not provide.

Under this title Knuth and Plass present their showcase, a complex index with precise typesetting requirements. Can the complex index be modelled with the FO building blocks? Above I have proven than any situation which can be modelled with the building blocks of Knuth and Plass, can be modelled with the FO building blocks. Therefore, the answer is yes, the complex index can be modelled by mapping Knuth and Plass' solution according to the above derived mapping recipe. But in this section I will model it by constructing a building block sequence that produces the same results in the breaking and non-breaking cases as the sequence of Knuth and Plass, with the default whitespace treatment policy ignore-if-surrounding-linefeed.

Each index entry represents a separate paragraph for the linebreaking algorithm. It consists of two parts, a names and a references part.

The names part is modelled as follows.

When the white space character is the chosen linebreak, its box is suppressed, and the penalty's width-before is inserted. These results agree with those of Knuth and Plass:

The references part is modelled as follows.

When the white space character is the chosen linebreak, its box is suppressed, and the penalty's width-after is inserted at the start of the next line. These results agree with those of Knuth and Plass:

The transition part is modelled as follows.

When none of the linebreak opportunities is the chosen linebreak, the penalties are dropped, and the result is as follows:

This is also the result of Knuth and Plass.

When the first linebreak opportunity is the chosen linebreak, the white space character is suppressed, and the penalty's width-before is inserted, and the result is as follows:

This is also the result of Knuth and Plass.

When the second linebreak opportunity is the chosen linebreak, the result is as follows:

This is also the result of Knuth and Plass.

The conclusion is that the complex index can easily be modelled in terms of the FO building blocks with the default white space treatment policy. The resulting sequences are simpler than those of Knuth and Plass. This is the result of the property that a penalty may contain a width after the linebreak.

Together with this essay I publish a simple implementation of the FO linebreaking algorithm. It acts on an abstract representation of a paragraph consisting of FO building blocks. It suppresses elements both before and after a linebreak, according to the white-space-treatment policy set on the paragraph. Its nodes are modified to satisfy the above tracking needs.

The implementation follows the algorithm of Knuth and Plass closely. It is, however, a completely new implementation, programmed in an object oriented style. Knuth and Plass describe their algorithm in great detail. Here I will highlight the main points.

Before I start the description of the algorithm, I will define a few terms.

The core of the algorithm consists of two nested loops. The main loop is a loop over the linebreak opportunities of the paragraph. For each linebreak opportunity, a loop is made over the nodes in the active list. For each active node, the line between the linebreak opportunity to which the node refers, and the current linebreak opportunity is evaluated. If the main loop has so far progressed through the paragraph that the line from the active node to the current linebreak opportunity has become too long and must be shrunken beyond its minimum length, the active node is deactivated. Otherwise, if the line from an active node to the current linebreak opportunity falls within the tolerance, a node is created for it. At the end of the nested loop all nodes created for this linebreak opportunity are compared to each other. The best node represents the best layout of the paragraph up to that linebreak opportunity. It is added to the active list.

When the main loop has reached the end node of the paragraph, the best node represents the best layout of the whole paragraph. It is the result of the linebreaking calculation. Through its reference to the previous node, i.e. the node describing the linebreak of the preceding line, and of all nodes to their previous nodes, the program is able to trace the linebreaks for all lines of the paragraph.

During the main loop it is possible that the active list becomes empty, because nodes are removed and insufficient new nodes are added. That means that the paragraph has no possible layout within the requested tolerance. In this situation my implementation gives up. Typesetting systems must define a strategy for such cases, e.g. they accept the last considered line although it exceeds the tolerance.

A node contains the following information:

There are three items before the linebreak, and three corresponding items after the linebreak. They hold calculated values so that these need not be recalculated each time they are needed. The last two items are needed by the ‘Bells and Whistles’ which will be described below.

The line classes in item 3 are called fitness classes. When two consecutive lines differ by more than one fitness class, they are considered contrasting lines and get an extra amount of demerits. The adoption of fitness classes makes that the inner loop no longer results in a single best node. There may be a best node for each fitness class. The algorithm applies a small optimization: If the minimum amount of demerits of a certain fitness class differs from the minimum amount of demerits of all classes by more than the extra amount of demerits for contrasting lines, the best node of that class will never be part of the best layout. Therefore it need not be remembered.

Item 2 adds some complexity to the algorithm. The inner loop now may only compare nodes for the same line number with each other. Therefore it must only run over active nodes with the same line number. It may result in a best node for each line number for which the linebreak opportunity is a feasible linebreak.

Knuth and Plass implement this by keeping the nodes in the active list ordered on line number. They let the inner loop run over a part of the active list whose nodes have the same line number. I have chosen a more explicit mechanism. The program holds a list of active lists. The inner loop runs over one active list. The best node found by the inner loop is added to the next active list. This inner loop is repeated for all active lists.

There are always a limited number of line lengths; let us call this number L. Because the lines with line number L and higher have the same length, their nodes may be compared with each other. Therefore we may put the active nodes with line number L − 1 and higher together.

The program holds a list of L active lists. The first list contains only the start node, which pseudo-ends line number −1. The following property holds: At any time during the process each active list contains only nodes with the same line number. At the start this is trivially true because it is true for the first active list. The algorithm adds to each active list only nodes whose line number is 1 higher than the line number of the preceding active list. This keeps the above property true. The last active list is an exception. In the last active list we collect the active nodes with line number L − 1 and higher.

When the main loop has reached the final penalty, all best nodes are compared together, whether we have reached line L or not. As before, the algorithm results in a single best node.

For item 4 the algorithm needs to know the line number of the final nodes. When it has determined the best node, it applies the required looseness by selecting the best node whose line number is ‘looseness’ higher or lower than that of the overall best node. Such a node is not always available, because there may not be a tolerable paragraph layout with that number of lines. Then the best node is selected whose line number is 1 closer to that of the overall best node, etc.

The implementation reuses the mechanism of variable line lengths. But there is no highest line number L. We keep the active nodes with different line numbers separate for all line numbers, and the number of active lists is unbounded. This ensures that we always select the best node for a certain line number, and in the end know the best final node for each line number.

The implementation contains a test class which can be used from the command line: java nl.leverkruid.spepping.gkplinebreaking.XMLTestCase file.xml. Here file.xml is an XML file which contains p elements in the no-name namespace. These p elements may have one or more of the following attributes:

  1. tolerance, with an integer value indicating the desired value for the tolerance. The default value is 5.

  2. white-space-treatment, with the values listed above. The default value is ignore-if-surrounding-linefeed. The value ignore is not implemented, because the test class does not suppress white space during building of the abstract representation.

  3. linewidth, with an integer value indicating the desired value for the linewidth. Each character gets a width of 1000. Therefore a linewidth of 50000 is equal to the width of 50 characters. If there is no linewidth attribute, the linewidths attribute is used. If that is also absent, the default value of 50000 is used.

  4. linewidths, with a space-separated list of integer values. The first value is the desired linewidth of the first line, and so on. The last value is used for all remaining lines. The linewidths attribute is not used if there is also a linewidth attribute.

  5. looseness, with an integer value indicating the desired number of lines by which the paragraph should be loosened (positive value) or tightened (negative value). The default value is 0. Due to Java, positive values must be entered without a plus sign.

  6. text-align, with the same values as the FO attribute of the same name. In the centered case, Knuth and Plass use the same stretch value as in the ragged-left and -right cases, but apply it both to the start and to the end of the line. I apply half that value to both ends. As a result, the stretch value in a line is the same in all three non-justified cases.

  7. text-align-last, with the same values as the FO attribute of the same name.

  8. ragged-stretch, with an integer value. Knuth and Plass use a value of 1 em, which corresponds to 3 times their width of a hyphen, for the stretch of the linebreak penalty for the non-justified cases. I use a similar default value, viz. 3 times 1000. But this attribute allows one to experiment with other values. A high value (up to Java's Integer.MAX_VALUE) produces a markedly different result.

The p elements may of course contain text. The text may contain span elements in the no-name namespace. These span elements may have three attributes:

The span elements may of course contain text. It is a limitation of the implementation of the test class that a span element should not be followed immediately by another span element.

Any other elements and attributes will be ignored. The p elements mimick fo:block elements, and the span elements mimick fo:inline elements in the fo:block element.

Linefeeds are treated according to the setting linefeed-treatment = "space". White space is treated according to the setting white-space-collapse = "false". white-space-treatment = "ignore" is not applied.

Linebreaks are determined by a LineInstance of class BreakIterator of the ICU library. The ICU4J library must be in the classpath. It can be obtained from http://icu.sourceforge.net/. Hyphenation must be inserted into the test file explicitly, by use of hard (U+2D) or soft (U+AD) hyphens. Hard hyphens will of course always be printed.

It is hoped that this arrangement allows easy testing of a wide range of texts.

The test class has additional provisions to allow index test cases similar to the complex index example. Each index entry consists of a names and a references element. It is good XML practice to embed a pair of these elements in an entry element, and to use an index element as the container of the entry elements. But the test class does not enforce that. It acts on each names and each references element. It will only produce good results when one names element is followed by a references element. A references element finishes the entry.

The names element recognizes two attributes: linewidth or linewidths, and tolerance, with the same values as for the paragraph element. These values are also applied to the following references element. The index example has a large number of parameters, but the test class has no provisions to modify them in the test file by attributes. All parameters are coded as static values of the test class.

Due to a special treatment of infinite stretch values in the width calculations, the test class is able to use a leader box with an infinite stretch. In the test print-out the leaders are represented by an ellipsis …. Note that there is no space between the leaders and the following text, but there is a linebreak opportunity at that point.

There are two possible ways to implement hanging indentation: by means of a width-after in the penalties, of by means of different line lengths. The test class does not implement the penalties method. The user has to indicate the desired amount of indentation by indicating a second linewidth which is smaller than the first by that amount. The test print-out does not know that the shorter linewidth is an indentation. The linebreak calculations, however, do use the desired line lenghts.

This work would not have been possible if I had not been a member of the development team of Apache FOP. Through the work of the team I learnt much about programming of digital printing. My work directly builds on the work of Luca Furini, who brought the linebreaking algorithm of Knuth and Plass to FOP, and provided its implementation both in linebreaking and in page breaking. I was much inspired by the work of Jeremias Märki, who implemented much of FOP, partially based on Luca's work. My work was most directly prompted by the work of Manuel Mall, who tirelessly insisted on the remaining problems in FOP's white space handling, and by that of Andreas Delmelle, who together with Manuel worked on a solution for these problems.

Finally, I owe many thanks to Donald E. Knuth and Michael F. Plass for their linebreaking algorithm. Their clear description of it made much of my implementation easy; I simply followed their rules.