Hyphens, minus, and dashes in Debian man pages

Hyphens, minus, and dashes in Debian man pages

Did you know…?

LWN.net is a subscriber-supported publication; we rely on subscribers
to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the
net.

By Jonathan Corbet
October 23, 2023

It is probably fair to say that most Linux users spend little time thinking
about the troff typesetting program, despite that application’s
groundbreaking role in computing history. Troff (along with nroff) is
still with us, though, even if they are called groff these days, and every
now and then they make their presence known. A recent groff change created
a bit of a tempest within the Debian community, and has effectively been
reverted there. It all comes down to the question of what, exactly, is the
character used to mark command-line options on Unix systems?

Last July, Sven Joachim filed a
bug report regarding a change in groff, and in how it renders man pages
for terminals in particular. A change to the handling of the character
often referred to as “hyphen”, “minus”, or “dash” (“-“) made many
man pages rather harder to work with. To understand the problem, it’s
worth noting that Unicode provides a plethora of similar characters, some
of which are:

NameCodepoint
Hyphen-Minus002D-
Hyphen2010‐
En Dash2013–
Em Dash2014—
Minus Sign2212−

There are many more — Unicode is nothing if not generous in this regard.
The term “dashes” will be used to refer to this class of glyphs here.

The specified behavior of groff is that an ASCII
“-” (Hyphen-Minus) in the input becomes a Hyphen in the output.
If the desire is to use Hyphen-Minus in the output, then the input should
use the sequence “-” instead. If the author of a man page types
“–frobnicate” as an option name, the output will read
“‐‐frobnicate” (with Hyphen) rather than
“–frobnicate” (with Hyphen-Minus). The two look the same, but
there is a crucial difference. A user who searches for
“–frobnicate” in a man page will not find it if the wrong type of
dash is used and, if that user cuts-and-pastes an example with the wrong
dash, it will not work.

As an example, one can try pasting these two lines into a shell:

/usr/bin/echo –help
/usr/bin/echo ‐‐help

The results from one will be rather more helpful than from the other. Use
of the wrong type of dash can also break URLs and corrupt file names.

Developers of free software are, of course, diligent about writing man
pages; they do the job promptly, take their time to get every detail right,
and can be expected to use the right kind of dash in every situation, even
though the output from using the wrong kind looks exactly the same. They
will surely not be bothered by the fact that a format designed to document
command-line options contains a trap whereby the failure to add backslashes
silently introduces problems for users who are distant in time and space.

Shockingly, this turns out not to be the case, and Linux man pages are
overflowing with unescaped dashes. Years ago, the Debian project tried to
address this problem by adding a check to its Lintian tool that would issue a
warning when unescaped dashes were used. That check was dropped in
2015, though, after Niels Thykier concluded that it was simply being
ignored: “The tag has existed since 2004 (commit fb2e7de). To date
there are still 2000 packages with the issue.” Since then, there has
been no warning shown to Debian developers when man pages contain unescaped
dashes.

Given the prevalence of this problem, it would arguably make sense to apply
a fix at the processing level. And, indeed, groff has, for many years,
duly remapped the Hyphen-Minus character (and a few others) in the man-page
macros, making dash characters simply work as many would expect. That
helpful behavior ended with the groff
1.23.0 release in July:

The an (man) and doc (mdoc) macro packages no longer remap the -,
‘, and ` input characters to Basic Latin code points on UTF-8
devices, but treat them as groff normally does (and AT&T troff
before it did) for typesetting devices, where they become the
hyphen, apostrophe or right single quotation mark, and left single
quotation mark, respectively. This change is expected to expose
glyph usage errors in man pages. See the “PROBLEMS”
file for a recipe that will conceal these errors. A better
long-term approach is for man pages to adopt correct input
practices

Problems were indeed exposed, and users began to complain; bugs were filed
and the topic showed
up on the debian-devel mailing list as well. G. Branden Robinson, the
upstream maintainer of groff and author of this change, defended
the new behavior:

Mapping all hyphens and minus signs to a single character, as
people whose blood pressure spikes over this issue tend to promote
as a first resort, is an ineluctably information-discarding
operation. In my opinion, man page source documents are not the
correct place to discard that information.

Among other things, the information being discarded by this change includes
whether line-breaking is allowed; Hyphen-minus does not allow it, while
Hyphen does.

Others disagreed with Robinson’s position; Russ Allbery said:

My opinion is that the world of documents that are handled by man
do not encode meaningful distinctions between – and -, and man
should therefore unify those characters.

Colin Watson, who maintains Debian’s groff package, admitted
that he had overlooked this problem when he updated Debian to the 1.23.0
release:

I was aware of the change, but it somehow fell off my list of
things to make a positive decision about when packaging 1.23.0.
I’m rather inclined to revert this by adding the rest of the recipe
above to debian/mandoc.local (while I agree with the idealized
typographical point being made, I have approximately negative
appetite for the Sisyphean task of fixing an entire distribution’s
manual pages in practice).

A few weeks later, he said
that his plan was to leave the change in place during the current
Debian 13 (“Trixie”) development cycle, but then to revert it prior to
the pre-release freeze to avoid inflicting problems on Debian’s users.
That would, in theory, give developers time to fix as many of the problems
as possible. After the discussion went on for a while, though, he changed his
mind, stating that he was unwilling to have his inbox filled with this
discussion for the next year. So the remapping of “-” has been
reinstated into Debian’s version of groff.

This little episode may well be repeated in other distributions as they
catch up with the groff 1.23.0 release. It also is probably not finished
within Debian. This situation brings together the problems of
documentation writing, typographic correctness, and Unicode look-alike code
points, all of which are fertile ground for disagreement. The hopes that
removing the remapping in groff would lead to the fixing of all those man
pages may have been dashed, but that does not bar another attempt in the
future.

(Log in to post comments)

>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : Hacker News – https://lwn.net/Articles/947941/

Exit mobile version