Jump to content

Line wrap and word wrap: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Alphanis (talk | contribs)
m 'kinsoku shori' only means Japanese rules.
Line 25: Line 25:
Most existing word processors and [[typesetting]] software cannot handle either of the above scenarios.
Most existing word processors and [[typesetting]] software cannot handle either of the above scenarios.


[[CJK]] punctuation may or may not follow rules similar to the above-mentioned special circumstances; such rules are usually referred to by the Japanese term [[kinsoku shori]] ([[:ja:禁則処理|禁則処理]], literally “prohibition rule handling”).
[[CJK]] punctuation may or may not follow rules similar to the above-mentioned special circumstances. It is up to [[Line breaking rules in East Asian language|line breaking rules in CJK]].


A special case of kinsoku shori, however, always applies: line wrap must never occur inside the CJK dash and ellipsis. Even though each of these punctuation marks must be represented by two characters due to a limitation of all existing [[character encoding]]s, each of these are intrinsically a single punctuation mark that is two [[em (typography)|em]]s wide, not two one-em-wide punctuation marks.
A special case of line breaking rules in CJK, however, always applies: line wrap must never occur inside the CJK dash and ellipsis. Even though each of these punctuation marks must be represented by two characters due to a limitation of all existing [[character encoding]]s, each of these are intrinsically a single punctuation mark that is two [[em (typography)|em]]s wide, not two one-em-wide punctuation marks.


==Algorithm==
==Algorithm==

Revision as of 10:45, 29 October 2010

In text display, line wrap is the feature of continuing on a new line when a line is full, such that each line fits in the viewable window, allowing text to be read from top to bottom without any horizontal scrolling.

Word wrap is the additional feature of most text editors, word processors, and web browsers, of breaking lines between and not within words, except when a single word is longer than a line.

It is usually done on the fly when viewing or printing a document, so no line break code is manually entered, or stored.[citation needed] If the user changes the margins, the editor will either automatically reposition the line breaks to ensure that all the text will "flow" within the margins and remain visible, or provide the typist some convenient way to reposition the line breaks.

A soft return is the break resulting from line wrap or word wrap, whereas a hard return is an intentional break, creating a new paragraph.

Word boundaries, hyphenation, and hard spaces

The soft returns are usually placed after the ends of complete words, or after the punctuation that follows complete words. However, word wrap may also occur following a hyphen.

Word wrap following hyphens is sometimes not desired, and can be avoided by using a so-called non-breaking hyphen instead of a regular hyphen. On the other hand, when using word processors, invisible hyphens, called soft hyphens, can also be inserted inside words so that word wrap can occur following the soft hyphens.

Sometimes, word wrap is not desirable between words. In such cases, word wrap can usually be avoided by using a hard space or non-breaking space between the words, instead of regular spaces.

Word wrapping in text containing Chinese, Japanese, and Korean

In Chinese, Japanese, and Korean, each Han character is normally considered a word, and therefore word wrapping can usually occur before and after any Han character.

Under certain circumstances, however, word wrapping is not desired. For instance,

  • word wrapping might not be desired within personal names, and
  • word wrapping might not be desired within any compound words (when the text is flush left but only in some styles).

Most existing word processors and typesetting software cannot handle either of the above scenarios.

CJK punctuation may or may not follow rules similar to the above-mentioned special circumstances. It is up to line breaking rules in CJK.

A special case of line breaking rules in CJK, however, always applies: line wrap must never occur inside the CJK dash and ellipsis. Even though each of these punctuation marks must be represented by two characters due to a limitation of all existing character encodings, each of these are intrinsically a single punctuation mark that is two ems wide, not two one-em-wide punctuation marks.

Algorithm

Word wrapping is an optimization problem. Depending on what needs to be optimized for, different algorithms are used.

Minimum length

A simple way to do word wrapping is to use a greedy algorithm that puts as many words on a line as possible, then moving on to the next line to do the same until there are no more words left to place. This method is used by many modern word processors, such as Open Office and Microsoft Word. This algorithm is optimal in that it always puts the text on the minimum number of lines. The following pseudocode implements this algorithm:

SpaceLeft := LineWidth
for each Word in Text
    if Width(Word) > SpaceLeft
        insert line break before Word in Text
        SpaceLeft := LineWidth - Width(Word)
    else
        SpaceLeft := SpaceLeft - (Width(Word) + SpaceWidth)

Where LineWidth is the width of a line, SpaceLeft is the remaining width of space on the line to fill, SpaceWidth is the width of a single space character, Text is the input text to iterate over and Word is a word in this text.

Minimum raggedness

A different algorithm, used in TeX, minimizes the square of the space at the end of lines to produce a more aesthetically pleasing result. The algorithm above is not optimal with respect to this, as the following example demonstrates:

aaa bb cc ddddd 

If the cost function of a line is defined by the remaining space squared, the greedy algorithm would yield a sub-optimal solution for the problem (for simplicity, consider a fixed-width font):

------    Line width: 6
aaa bb    Remaining space: 0 (cost = 0 squared = 0)
cc        Remaining space: 4 (cost = 4 squared = 16)
ddddd     Remaining space: 1 (cost = 1 squared = 1)

Summing to a total cost of 17, while an optimal solution would look like this:

------    Line width: 6
aaa       Remaining space: 3 (cost = 3 squared = 9)
bb cc     Remaining space: 1 (cost = 1 squared = 1)
ddddd     Remaining space: 1 (cost = 1 squared = 1)

The difference here is that the first line is broken before bb instead of after it, yielding a better right margin and a lower cost 11.

To solve the problem we need to define a cost function that computes the cost of a line consisting of the words to from the text:

Where typically is oder . There are some special cases to consider: If the result is negative (that is, the sequence of words cannot fit on a line), the cost needs to reflect the cost of tracking or condensing the text to fit; if that is not possible, it needs to return .

The cost of the optimal solution can be defined as a recurrence:

This can be efficiently implemented using dynamic programming, for a time and space complexity of .

See also

Knuth's algorithm