~~~barnold's tilde.club page~~~

Counting characters

2024-03-19

Recently I tried to learn myself a little UTF-8. My guide was Markus Kuhn's FAQ. Its discussion of "combining characters" made sense to me. These are "code points", in UTF-8 speak, that identify a sort of decoration applied to the preceding character. The FAQ compared two examples. The first is "LATIN CAPITAL LETTER A WITH DIAERESIS". This is a "precomposed character", i.e. you get the capital A and its little dotted hat together as a single unit.

The second example is the same character conceptually, but represented by two code points: "LATIN CAPITAL LETTER A" followed by "COMBINING DIAERESIS". The first one gives you a plain capital A and the second one means "go back to the last character and put two little dots on top, kthxbai". This combining form is apparently to be preferred because of its greater flexibility. You don't need to define every possible combination of plain letter plus decorator (or "diacritical mark" as the jargon has it).

Here's a summary of the code points, their encoding in UTF-8 and the result as rendered by your browser.

Code point name, value Bytes (hex) Rendering
LATIN CAPITAL LETTER A WITH DIAERESIS, U00C4 xc3, x84 Ä
LATIN CAPITAL LETTER A, U0041 x41
COMBINING DIAERESIS, U0308 xcc, x88

Under that Rendering column, you should see the same characters as below (shown as an image in case your browser renders the characters differently),

except I added single quotes around each character, just to show there was no peculiar white space appearing. I typed those printf commands in a urxvt terminal emulator running on my laptop, while connected to a bash shell in tilde.club. The "\xNN" in printf is a handy sequence to output a byte with the hex value of NN.

What I'm trying to get at with the table and the image is that you should end up with the self same visible character, or glyph in UTF-8 speak, whichever of the two methods you use. In theory you shouldn't be able to tell apart the "precomposed" (one code point) character from the "composed" (two code point) character, short of running od(1) or the like.

Theory and practise are a little different.

barnold@tilde$ printf "\xc3\x84" | wc --chars
1
barnold@tilde$ printf "\x41\xcc\x88" | wc --chars
2

Though the two forms of "A with a diaeresis" are in principle one and the same character, wc(1) thinks that the combining form has two characters, not just one. According to Markus Kuhn's FAQ, "A combining character is not a full character by itself" so we have a contradiction here. (You might wonder, if it isn't a character why did they call it a "combining character"? I have no answer to that.)

The maintainers of GNU coreutils don't regard wc's count of 2 as a bug (I asked on the mailing list) so it's unlikely to change. After decades of effort in computer science the question "what's the character count of this string?" doesn't necessarily have a clear answer.

Best commit message of the year (so far)

2024-01-06

I am proud to have inspired this fine commit message.

slrn post signature

2023-05-17

My web searches showed me a way to generate a dynamic signature in my posts and followups. I put these in my ~/.slrnrc:

set signature ".slrn/signature.usenet"
set post_editor_command "echo barnold > ~/.slrn/signature.usenet; \
   fortune -n 140 -s >> ~/.slrn/signature.usenet; emacsclient -t +9:0"
(The second set was actually one line, I split it here for easier reading.)

It didn't seem to work on the first attempt but then I did see fortunes in my signatures. However I noticed an odd thing: while editing a post, the fortune in my signature.usenet was different from the one in my post. The cause was obvious when I realised: slrn puts the signature into the draft post before calling the post_editor_command. It also explained why it didn't work the first time.

That doesn't matter for a fortune but won't work if you want something more current, say a weather report. The only way to do that seemed to be to disregard the signature file and edit the draft directly to add the "something". This works for me:

set signature ""
set post_editor_command "~/.slrn/post-edit-cmd.sh"

And using a script provides a lot more breathing room than trying to squeeze everything into one line of configuration. This is my post-edit-cmd.sh:

#!/bin/bash
#
# Generate a signature before invoking the editor.
# slrn provides one argument, the name of the file to be edited.

# To avoid printf interpreting "--" as introducing an option,
# we instead make it an 'argument' to printf's format string.
#
printf "\n%s \nbarnold\n" "--" >> "$@"

# Try to avoid fortunes that take up too many lines.
fortune -n 140 -s >> "$@"

exec emacsclient -t +9:0 "$@"
#
# Should be unreachable.
That's working nicely for me.

Project Gutenberg

2022-10-18

I'm not affiliated with that good project but I have put up a site to do with it. The site shows PG's catalog (or most of it) with forms for searching by title or author. The source code for it is on tildegit.

Since anything on the public internet comes under siege from scripts and bots I don't know how well it'll survive. Give it a try if you feel like a free e-book.

Signed up with tildegit

2022-05-03

The captchas on the signup form were too difficult for me. I began to suspect that I might really be a bot that think it's human. But within minutes of my asking about those captchas on irc, ben helped me by removing the captcha from the form!

So now I have my little space on tildegit. My thanks to ben and the tildeverse in general.

Make safer

2022-02-11

Here is a toy Makefile:

reboot-universe
	@echo "Abolishing spacetime..."
	@echo "Rebooting..."

Not very safe. If you make that target by mistake, it says

$ make reboot-universe
Abolishing spacetime...
Rebooting...

and where are you now? There is one way to make the Makefile a little safer, like this:

dangerous:
ifndef DANGEROUS
	$(error Refusing to continue without DANGEROUS set.)
endif

reboot-universe: dangerous
	@echo "Abolishing spacetime..."
	@echo "Rebooting..."

Now you have to try harder to destroy the Universe.

$ make reboot-universe
Makefile:4: *** Refusing to continue without DANGEROUS set..  Stop.
$
$ make DANGEROUS=y reboot-universe
Abolishing spacetime...
Rebooting...

I've found it useful in stopping me running destructive targets by mistake, e.g. to drop a database or wipe out its data. You can make "dangerous" a dependency of as many make targets as you like. If there's an easier or better way, let me know!

Advanced temporal sensing

2021-11-18

A more ambitious cgi page for your viewing pleasure. It uses the latest technology to see the future!

Fun with CGI

2021-06-11

Coding like it's 1995, I added a toy cgi page: your ip address.

Hosting a git repo on a less than fully trusted host

2020-08-25

Have just discovered git-remote-gcrypt or "gcrypt" as I'll call it here. So far it's working well for me at solving this problem: you have something you want under source code control, you want to push it to a remote frequently for safety* but it contains secrets that shouldn't ever leave the host it's on.

If you have a PGP key pair, gcrypt resolves these conflicting objectives by encrypting the repository before pushing. The remote host only sees crypt text, no use to an attacker unless maybe it's the NSA. If your working copy is lost you can get it back provided you still have your ssh and pgp keys.

* There is a wise saying which from memory goes "if you've only saved it to one hard disk you haven't saved it." One of git's major benefits is saving to another hard disk is only a 'git push' away.


Thanks to the tilde contributors for providing tilde.club.

[Previous page] [Random page] [Next page]
How to join this webring

E-Mail Club Badge