UnixReview.com
June 2007
In part 2 of his healing process, Royce Williams continues his unrelenting assault on files containing "funny characters".
Littera Delenda Est: on the Removal of Files With Unusual Characters in Their Filenames, Part Two
by Royce Williams
In last month's column, we learned that unusual characters in filenames fall into some discrete categories, and started to tackle the hypothetical example of a directory full of them. The techniques already covered include escaping, workarounds for special shell characters, and dramatic overuse of battle metaphors. We continue in similar fashion, attacking the remaining character types in rough order of increasing difficulty.
After our first round of skirmishes, let's see what work remains to delete these unusual files.
We also take this opportunity to remove those files that have already served their purpose as obstacles (so that we can better see what it is that we're having trouble seeing). Drawing on what we've learned so far, here is how to perform that removal, followed by a list of the survivors:
admin@unixlike$ rm -- -keeper \!keeper 0 A Z a z \~keep-me-too admin@unixlike$ /bin/ls -lA -rw-r----- 1 admin admin 22 May 5 2006 ? -rw-r----- 1 admin admin 27 May 5 2006 ? -rw-r----- 1 admin admin 20 May 5 2006 ? -rw-r----- 1 admin admin 26 May 5 2006 ? -rw-r----- 1 admin admin 32 May 5 2006 ? -rw-r----- 1 admin admin 23 May 5 2006 -rw-r----- 1 admin admin 29 May 5 2006 ? -rw-r----- 1 admin admin 29 May 5 2006 -rw-r----- 1 admin admin 21 May 5 2006 ?
Up to this point, we've been going after what we might call the "low-hanging fruit": those characters that have been easy to identify. The survivors have been better at protecting themselves — they're not just waiting out in the open for us to pick them off, so to speak. We'll have to drop back and perform some additional reconnaissance to determine our next strategy.
Our remaining tasks are also complicated by the fact that differences in the various platforms are going to become more significant. It's almost as if the lesser-used features in a given operating system are more likely to be diverse. As we explore those approaches to revealing unprintable characters that are available on most Unix-likes, we'll try to take these differences into account. We'll also defer doing any more deleting until we've learned a few new identification methods.
cat
and Mouse
The -v
option to cat
, available on many
systems, can reveal more about which characters are in play. It
replaces many characters that are usually non-printing with either
C escape codes or other human-readable representations of their
control functions, depending on platform.
To make some of our impending under-the-hood work easier, let's switch
to single-column mode for ls
with the -1
option (that's the number one). Piping the output of through
cat -v
yields the following:
admin@unixlike$ /bin/ls -lA | cat -v ^G ^H ^L ^M ^M M-% admin@unixlike$
From the output above, you can see that our stragglers aren't all the same after all, that some of them are still not visible, and that there's a "^M" with some whitespace in front of it.
To gather more intel (pun intended), you can also use cat -e
to mark the end of each line with a dollar sign. On most platforms, using
-e
implies -v
.. Most of the platforms tested
produce similar output:
admin@unixlike$ /bin/ls -lA | cat -e ^G$ ^H$ $ ^L$ ^M$ $ ^M $ $ M-%$
With the trailing dollar signs, we can now see the boundaries of any
whitespace. Since the third entry in our reduced listing was
represented by ls
as a single character, but its width
now appears to be quite a few columns wider than that, we might
guess at this point that we're dealing with a tab. There's also some
additional whitespace trailing our floating ^M, and we're not sure
yet what M-%
means.
That's a much better mugshot lineup than we've had before, but before we start deleting again, let's explore what other discovery methods are available in other circumstances (and on other platforms).
cat
's away
If your cat
doesn't support -v
, or if you find
that its output isn't giving you enough information, then your ls
may be able to help. Most ls
implementations support some
kind of -b
or -B
(binary output)
option, which renders unprintable characters in some visible way. Some
variants of this option output C escape codes, while others display the
octal value of the underlying ASCII character. Some flavors support both.
Even the output of the same rendering type varies across platforms.
Solaris and HP-UX only support octal dumps with -b
:
admin@flare$ /bin/ls -1A -b \007 \010 \011 \014 \015 \015 \245 admin@flare$
NetBSD's -b
substitutes escape codes:
admin@ofcourse$ /bin/ls -1A -b \a \b \f \r \r \245 admin@ofcourse$
NetBSD's -B
output gives the same octal output as
Solaris and HP-UX -b
:
admin@ofcourse$ /bin/ls -1A -B \007 \010 \014 \015 \015 \245 admin@ofcourse$
FreeBSD can display one more of our test files than NetBSD can:
admin@beastie$ /bin/ls -1A -b \a \b \t \f \r \r \245 admin@beastie$ /bin/ls -1A -B \007 \010 \011 \014 \015 \015 \245 admin@beastie$
Mac OS X appears to be missing an entire file (whatever \245 is), not merely failing to display it:
admin@sonofnext$ /bin/ls -1A -B \007 \010 \011 \014 \015 \015 admin@sonofnext$
There's a significant difference in the last output example. It turns out that the Mac filesystem (HFS+) doesn't appear to allow single-byte characters higher than ASCII 127, which is why our \245 is conspicuously absent. The HFS+ documentation that I could find said that all Unicode characters were supported in filenames, but my testing script showed that all simple 8-bit ASCII with the high bit set was refused:
[ Output for 1 through 125 omitted ...] Trying to create ASCII 126 (hex 7e, tilde) Trying to create ASCII 127 (hex 7f, delete) Trying to create ASCII 128 (hex 80) Could not create ASCII 128 (hex 80): Invalid argument Trying to create ASCII 129 (hex 81) Could not create ASCII 129 (hex 81): Invalid argument Trying to create ASCII 130 (hex 82) Could not create ASCII 130 (hex 82): Invalid argument [ ... output for 131 through 255 omitted ]
On some Linux flavors, -b
has the most useful information
so far. Unlike all preceding examples, it produces visible indicators
for every line of output. Our previously unseen characters
look like whitespace of some kind, and there are three whitespace
"somethings" in the next-to-the-last filename:
admin@emperor$ /bin/ls -1A -b \a \b \t \f \r \ \ \r\ \ \ \ \245 admin@emperor$
With cat -e/v
and ls -b
/-B
, we
now know a lot more than we did before. Some pockets of resistance
remain — but the arsenal isn't empty yet.
A powerful weapon at our disposal is the venerable od
(which stands for "octal dump," though it dumps other
formats as well). od
is available on many Unix-likes.
Piping the output of our ls
command to od
, and
using the -c
option (character output),
you can see that the output of the ls -1A
command here
contains some familiar escapes:
admin@unixlike$ /bin/ls -1A | od -c 0000000 \a \n \b \n \t \n \f \n \r \n \n \r \n 0000020 \n 245 \n 0000026
If you expect a lot of characters that have no C escapes, using the
-x
option (hex dump) will render the
output as columns of hex, with starting offsets listed in the first
column:
admin@unixlike$ /bin/ls -1A | od -x 0000000 0a07 0a08 0a09 0a0c 0a0d 0a20 0d20 0a20 0000020 2020 0a20 0aa5 0000026
Reading and using this output takes a little bit of parsing, since it
is showing you the underlying codes instead of using them to alter the
appearance of your terminal (newlines, whitespace, etc.). Here is the
od -x
output arranged so that you can compare it to the
Linux ls -b
output:
0a \a 07 0a \b 08 0a \t 09 0a \f 0c 0a \r 0d 0a \ 20 0d 20 0a \ \r\ 20 20 20 0a \ \ \ 20 0a \ a5 \245
Note that on some systems, od
appears to have been
deprecated in favor of hexdump
, which usually accepts all
of the flags demonstrated.
We've produced some numbers and escape codes for our characters, but
what do they mean
, and how can we use them?
With your favorite search engine, we can figure out how the listings and dumps above correspond to the values and characters underneath. Combining the results from searching for things like "character codes", "ASCII table" and "C escape sequences", and using those references to look up the characters in the filenames that remain, yields the information represented in the following table:
ASCII name | Decimal | C escape | mnemonic | octal | hex | Control key |
BEL | 7 | \a | alarm | \007 | 0x07 | ^G |
BS | 8 | \b | backspace | \010 | 0x08 | ^H |
HT | 9 | \t | tab | \011 | 0x09 | ^I |
LF | 10 | \n | newline | \012 | 0x0a | ^J |
FF | 12 | \f | formfeed | \014 | 0x0c | ^L |
CR | 13 | \r | carriage return | \015 | 0x0d | ^M |
SP | 32 | n/a | space | \040 | 0x20 | spacebar |
n/a | 165 | n/a | yen | \245 | 0xa5 | n/a |
That last one isn't strictly part of the original 7-bit ASCII character set. Depending on the character set and your selected locale, this character could be represented any number of ways. In ISO 8859-1, it's the yen symbol.
Looking back at our directory dump, what at first appeared to be a floating question mark is now clearly a carriage return (^M, 0x0d) with a space (0x20) on either side:
20 0d 20 0a
The file following it is three spaces in a row:
20 20 20 0a
Using commonly available tables and references, the other characters can also be easily looked up. Our bad characters are in serious trouble.
Now that we know exactly what these characters are, how can we delete them?
Knowing their corresponding Control keys will help. It turns out that
most of the characters represented by ^
(Control) followed
by a key on the keyboard can actually be typed just as they appear ...
if you know the secret handshake.
Let's select one of our single-character-long filenames: ^M
.
If we try to remove it by hitting Control-M, the system will respond
as if we had pressed the Enter key:
### Key sequence here is r,m,space,press-and-hold Control,M,release Control admin@unixlike$ rm rm: not enough arguments usage: rm [-blah] admin@unixlike$
How can we get our shell to interpret these keystrokes as characters, instead of carrying out their usual functions? It turns out that we can do so using the relatively unknown Control-V shell feature, which tells many shells to interpret the next input character as a literal character (instead of as a control character):
### The key sequence here is l,s,[space],^V,^M admin@unixlike$ /bin/ls ^M ? admin@unixlike$
That question mark is our listing of the file named "^M
".
To verify that the whitespace really corresponds to what we think
that we're typing, we can throw in an od -x
or one of
the other tools that we've covered. Here, we first test our keystroke
by echoing it, and then verify that it matches our filename:
### Each control sequence here is immediately preceded by typing Control-V admin@unixlike$ echo ^M | od -x 0000000 0d0a 0000002 admin@unixlike$ /bin/ls ^M | od -x 0000000 0d0a 0000002
Now that we know how to reveal, type and verify some of our characters, we can make short work of them:
### Each control sequence here is immediately preceded by typing Control-V admin@unixlike$ rm ^M admin@unixlike$ rm ^G # That beeping should go away now. admin@unixlike$ rm ^L # Form feed clears the screen; fixed. admin@unixlike$ rm ^H
For our files that are just one or more spaces, we must escape them:
### The key sequence here is l,s,[space],backslash,space admin@unixlike$ /bin/ls \ admin@unixlike$ rm \ ### Three escaped spaces here. admin@unixlike$ rm \ \ \
For our ^M
surrounded by spaces, we use Control-V, and then
simply escape the spaces:
### The key sequence here is l,s,[space],backslash,space,^V,^M,backslash,space admin@unixlike$ /bin/ls \ ^M\ ? admin@unixlike$ rm \ ^M\ admin@unixlike$
Our file that is just a tab (0x09, ^I) is a little different. Because
some shells interpret a tab as a Control key of sorts (for command and
filename completion), we must both escape the tab and use the
Control-V trick to type it. You can type the actual character by either
using the Control key for tab (^I
) or simply by pressing
the Tab key itself:
### The key sequence here is l,s,[space],backslash,^V,^I admin@unixlike$ /bin/ls \ ? ### The key sequence here is r,m,[space],backslash,^V,[tab] admin@unixlike$ rm \ admin@unixlike$
The same holds true for the backspace - it must be escaped, and either typing Control-H or the backspace key itself will do.
For the most part, we've taken care of all of the characters that we can type on the command line, directly or indirectly. Only one file remains:
admin@unixlike$ /bin/ls -1A | cat -ev M-%$
But for any characters that have no known keystroke, or if our shell does not support the handy Control-V feature, are there other options?
It would be nice if we knew of some way to enter a character with the keyboard using its decimal, octal or hex value. Surprisingly, some systems support this by using the Alt key and the numeric keypad.
If you have access to a Microsoft machine, give it a try: Open a Command Prompt window and type the following, remembering to use the numeric keys, and making sure that you prepend zeroes to make the field four digits long:
### Key sequence here is Alt-down,numeric 0,1,6,5,Alt-up: C:\>¥
Padding with zeroes appears to invoke ISO-8859-1 on some systems, while not padding invokes another (sometimes the so-called ASCII-II that contains a number of line-drawing characters).
In my testing, the Knoppix system actually supported 8-bit ASCII,
including a proper rendering of the yen symbol, as long as I was connected
via SSH or on the physical, non-X console. Most of the other Unix-likes
properly accepted these keystrokes as long as they were 7-bit ASCII (under 127),
and attempts to enter values above 127 were turned into their 7-bit equivalents
by stripping their high bit (subtracting 128, and turning 129 into 1, 130 into
2, etc.) This sheds some light on why some ls
implementations
render this as M-%
("Meta-%").
Here's a small demonstration in which I type "HI" followed by Enter (decimal 72, 73, and 13) using nothing but the numeric keypad. I then attempt to type the yen (165), but product a percent sign (decimal 37, or 165 minus 128) instead:
### The key sequence here is: ### Alt-down,7,2,Alt-up,Alt-down,7,3,Alt-up,Alt-down,1,3,Alt-up admin@unixlike$ HI -bash: HI: command not found ### The sequence here is: ### Alt-down,0,1,6,5,Alt-up admin@unixlike$ %
If we cannot use any of these arcane keyboard tricks to get at the
character we want, we'll have to try other angles. Unfortunately,
figuring out some way to generate the right characters on the command
line and only delete the desired files isn't as easy as it might be.
printf(1)
, for example, is very good at making things
human-readable, but is not as keen on rendering human-readable numbers
into their character equivalents, so it's not useful for our
purposes.
But there are a number of utilities available on many systems that
allow you to perform character substitution. If your version of tr
supports replacing characters with their control equivalents, then you're
in luck. We can accomplish our mission in a roundabout way with tr
by transforming an arbitrary single character (here, a 'T') into the
one that we need (in this case, the yen):
admin@unixlike$ /bin/ls -1A `echo T | tr 'T' '\245'` ? admin@unixlike$ rm `echo T | tr 'T' '\245'` admin@unixlike$ /bin/ls -1A `echo T | tr 'T' '\245'` : No such file or directory
A Perl one-liner also allows for a similar trick:
admin@unixlike$ /bin/ls -lA `perl -e 'print "\245";'` -rw-r----- 1 admin admin 0 May 5 2007 ?
If you're desperate — for example, if you're in a barren wasteland
in which neither tr
nor Perl are available — then
there are a couple of techniques of last resort that can be used even
if you cannot determine what your strange characters are.
If you can't refer to the files by name, you can get a handhold using
the file's inode. Many Unix-like ls
variants support the
i
option, revealing the inodes of any files listed:
admin@unixlike$ ls -li 17878240 -rw-r--r-- 1 admin admin 0 May 5 2007 ?
... which can then be passed to find
, as long as
your find
also has an -i
option:
admin@unixlike$ find . -inum 17878240 -exec ls -la '{}' \; -rw-r----- 1 admin admin 0 May 5 2007 ? admin@unixlike$ find . -inum 17878240 -exec rm '{}' \; admin@unixlike$
But the true approach of last resort that still allows you some precision (and a shred of dignity) involves dumping the contents of the directory itself to a temporary file, editing it as needed, and then deleting every file listed in the temporary file. This is about as elegant as a drunk rhinoceros, but it gets the job done. (In the following example, note that you must leave out the name of your tempfile itself, because it is created very early on, with zero length, as part of the command pipelining process):
admin@unixlike$ /bin/ls -1 |grep -v temp.list >temp.list admin@unixlike$ cat -v temp.list M-%
To verify that you have the right filenames captured, you can script a quick one-liner to list each file:
admin@unixlike$ while read file; do ls -lA $file; done < temp.list -rw-r----- 1 admin admin 21 May 5 2007 ?
.... and then modify that one-liner slightly to delete them:
admin@unixlike$ while read file; do rm $file; done < temp.list admin@unixlike$ /bin/ls -1 temp.list
At long last, our character assassination is complete. You should now have the power to delete many files that would have stumped you before.
You've probably also figured out that your new deletion powers
could just as easily be applied to creating unusual filenames.
To quote sudo
, "With great power comes great responsibility."
Don't be tempted to torture the new junior sys admin. Save it for your
peers.
I hope that I've given you the ability to handle files with unusual filenames in a more "sysadminly" fashion. If you can figure out exactly what the unusual characters are, then you can delete them with precision. When you have a big directory full of bad filenames mixed in with ones that you need to keep, these methods can help you avoid having to resort to moving files out of the way in wildcarded groups or the slash-and-burn technique of deleting entire directories. Good luck!
I tried most of these techniques on the following platforms
(listed here by the output of uname -srm
).
While I also mentioned Cygwin in this column, it is so different from the other flavors that it will have to wait for another day.
Most sessions were conducted using PuTTY and bash v3, and most of the systems were using the ISO-8859-1 character set. While the specifics vary, these techniques should at least give users of different locales a starting point.
All accessed April 2007 unless otherwise indicated.
alt.folklore.computers
in 1994 by
Dr. Adrian P. R. Whichello.
Royce Williams is a Unix-like systems administrator for an Alaskan telecommunications company. He was included in the package when they acquired the first Alaskan ISP. When not flushing bad characters to ground, Royce likes watching indie movies and trying to put FreeBSD on ancient hardware. He also has an Alaskan license plate problem. You can reach him at royce@tycho.org.
Copyright © 2007 Royce D. Williams. All rights reserved.