I needed to list all files in a directory, but ls
, find
, and os.listdir
all hung. This is my story.
NOTE: there is no good reason that you should ever have 8 million files in the same directory, but if you do, this is your solution :-).
TLDR: Write a C program that calls the syscall getdents directly, with a large buffer size, ignore entries with inode == 0.
Why doesn’t ls work?
ls
and practically
every other method of listing a directory (including python os.listdir, find
.) rely on libc readdir()
. However readdir()
only reads 32K of directory
entries at a time, which means that if you have a lot of files in the same
directory (i.e. 500M of directory entries) it is going to take an insanely
long time to read all the directory entries, especially on a slow disk. For
directories containing a large number of files, you’ll need to dig deeper than
tools that rely on readdir(). You will need to use the getdents()
syscall
directly, rather than helper methods from libc.
How to quickly list a directory with 8 million files
The trick is to understand getdents()
, the
low level system call that reads directory entries from disk, and returns a
directory entry (dirent) data structure.
GETDENTS (filehandle for directory entries, *directory entry pointer, number of bytes to read)
Luckily the man page has a lot of detail, and provides C source code to list all the files in a directory. Read it here and copy the source code to a file, like listdir.c. There are two modifications you will need to do in order quickly list all the files in a directory. First, increase the buffer size from X to something like 5 megabytes.
#define BUF_SIZE 1024*1024*5
Then modify the main loop where it prints out the information about each file in the directory to skip entries with inode == 0. I did this by adding
if (dp->d_ino != 0) printf(...);
In my case I also really only cared about the file names in the directory so I also rewrote the printf() statement to only print the filename.
if(d->d_ino) printf("%sn ", (char *) d->d_name);
Compile it (it doesn’t need any external libraries, so it’s super simple to do)
gcc listdir.c -o listdir
Now just run
./listdir [directory with insane number of files]
Presto, you should see all the files in your insanely large directory :-). In my case I did
./listdir [directory with insane number of files] > output.txt
and then used the contents of output.txt to process the files that were previously “unlistable”.
Why did we run into this problem in the first place?
Without going into too many details, we needed a simple datastore
where we could find an entry based on a key, append values to each key, and
expire keys that were older than a certain threshold, while performing a
cleanup operation on the stored values. The filesystem actually works really
well for this cases as long as there aren’t that many keys active at the same
time. (i.e. on the file system listing all keys is the slowest operation, so
if there aren’t many keys there’s nothing to worry about). Something got
screwed up, and caused the expiration action to fail. So instead of keeping a
small number of files in the directory, we started generating a lot of entries
that were not being cleaned up. The next time the cleanup operation ran, it
could no longer list all the keys in a reasonable amount of time (reading 32K
of directory entries at a time). This compounded the problem and lead to a
state where new files were being created, but old files were not cleaned up.
Soon we had 8 million files in a single directory. We were able to fix the
root cause fairly quickly by resolving a bug in our “sweeper” and creating a
new cache directory, but we still have 8 million files that needed to be
processed. Keep in mind the timescale here is a matter of hours. (Luckily this
problem happened on a weekend, was caught fast, and had very little effect on
our customers). One important thing to keep in mind, is at this point we had
no idea how many files needed to to be processed, all we knew was that ls
and os.listdir()
were both hanging when trying to list all the files in a
directory. I asked in IRC, and tried searching google without much luck. Other
people had run into this problem before, but no one had a good solution. In
fact I recall running into a similar problem listing a mailqueue in Qmail many
years ago, but I have no idea how we solved it. (If you like solving this sort
of problem, we are hiring: jobs@olark.com)
Diagnosing the unknown
Whenever I see something hang without debugging
output I turn to strace
. strace lets you watch the system calls made during
program execution and let’s you see what’s going on when no output is being
printed. I tried strace find .
strace ls
and strace python import os
os.listdir(".")
All of these functions produced basically the same strace
output: First they open the file containing information about a directory
open(".", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
Then they make the getdents syscall which returns directory entries
getdents(5, /* 586 entries */, 32768) = 32752
It took me a while to figure this out, but basically the
getdents
call http://www.kernel.org/doc/man-
pages/online/pages/man2/getdents.2.html looks like this:
int getdents(unsigned int fd, struct linux_dirent *dirp, unsigned int count);
where count is really size of buffer. (in this case 32K) Of course I didn't notice this until someone in IRC mention that you could do
ls -dl
to see the size of the file storing directory entries. In my directory I got:
drwxr-xr-x 2 root root 537919488 Jul 29 04:55 . (513M)
Putting two and two together I could see that the reason it was taking forever
to list the directory was because ls
was reading the directory entries file
32K at a time, and the file was 513M. So it would take around 16416 system
calls of getdents()
to list the directory. That is a lot of calls,
especially on a slow virtualized disk. This lead me to the solution above,
increase the read buffer size for getdents() to decrease the number of system
calls and speed up all the disk access :-).
Takeaways:
Appendix:
After all this I was still a little curious about where the 32K buffer size in ls
came from so I downloaded to source to
coreutils (contains ls
) and poked around. (NOTE: this is mostly just some
notes I took, and not a detailed analysis) Search through the code, and you’ll
find that ls
called readdir()
with a directory pointer.
while (1) {
/* Set errno to zero so we can distinguish between a readdir failure and when readdir simply finds that there are no more entries. */
errno = 0;
next = readdir (dirp);
if (next)
readdir()
is defined in libc. So I continued my
search and downloaded the source to libc to figure out how the buffer size for
readdir()
was determined. getdents() is called inside of readdir() (as
expected)
bytes = __GETDENTS (dirp->fd, dirp->data, maxread);
if (bytes <= 0)
And the byte size comes from the size of the directory entry struct, or
the dirp->allocation
. Given that we were reading multiple entries, I am
pretty sure the maxread variable was being set from dirp->allocation
.
/* Fixed-size struct; must read one at a time (see below). */
maxread = sizeof *dp;
#else maxread = dirp->allocation;
#endif
If you poke around in
sysdeps/unix/dirstream.h sysdeps/unix/opendir.c
You can see where this value gets set. However, it certainly doesn’t appear to changed based on the size of the directory entry file. (Perhaps the buffer should be dynamically set based on the size of the directory entry file) Finally, to learn about the trick for skipping deleted files, I noticed a line:
if (dp->d_ino == 0) ..
inside of ls
, which is used to filter out deleted files from appearing when
a directory is listed.