Efficient File Enumeration

MS-DOS_iconDelphi offers two ways of enumerating files in a directory and its sub-directories, the first is the classic (and buggy) FindFirst/FindNext, the second is IOUtils TDirectory.GetFiles and not very efficient.

Here is why and how I implemented DWScript‘s dwsXPlatform.CollectFiles, and a tip about getting a small system-wide boost as a bonus.

8dot3 file naming

The old 8dot3 naming convention dating back to the DOS ancestry of Windows has been obsolete for a while, but it’s still likely to cost you time… or trouble.

It affects both Delphi methods negatively, because of the underlying Windows API function they use (FindFirstFile) is obsolete as well, and obsolete in two ways:

  • it spends time returning both the regular (long) file name and the 8dot3 name (which can means extra lookups in the file system), even though it’s not used by Delphi.
  • it doesn’t filter extension appropriately (for compatibility with 8dot3 names)

In the case of FindFirst, it means that if you search for ‘*.dpr’, you’ll get .dproj files as well.

TDirectory.GetFiles solves the filtering by doing it Delphi-side with TMask from the Masks unit. TMask uses a quite efficiently implemented state machine, but IOUtils invokes it through the MatchesMask function, which creates and destroys a TMask every single time…

IOUtils internal logic is also quite complex and heavy-weight (with anonymous procedures, implicit exception frames, implicit conversion and generally redundant code), and the GetFiles implementation doesn’t scale well as it relies on a dynamic array as return value (FastMM mitigates the issue, but not entirely).

So in practice, if you’ve got a fast SSD or if everything is in the Windows file system memory cache, IOUtils will be the bottleneck, not the file system.

Next: Getting around the 8dot3 names

7 thoughts on “Efficient File Enumeration

  1. FIND_FIRST_EX_LARGE_FETCH should help if you access a network share and the directory has many entries. But it only works for Windows 7 / 2008R2 or newer.

  2. For what it may be worth, I just tried using FSUTIL to strip 8dot3names from partitions on an external hard drive on Win7 64. It was busy for a time, then I got a dialog reporting it had stopped working. Happened in all attempts, on four different partitions, on two drives. And yes, I ran CMD as administrator. I do not have time to pursue this at the moment, but just wanted to let you know that there may be issues….

  3. For me using a lot of threads (> number of available cores) to crawl directories was beneficial. Currently I use up to
    (number of available cores) * 4 threads.
    For e.g. finding all *.pas files on my disk the 8dot3 names didn’t really make a significant difference, but for sure any improvement is welcome…

  4. @Andreas Hausladen I suppose network performance comes into play, I tested on a 1 Gb LAN and large fetches weren’t beneficial there either.

    @Bill Meyer I had that happen on my main system drive, but data drives were okay once applications got closed.

    @Andreas Dorn how many pas and folders do you have? using “dir *.pas /s” f.i. should give you that at the end, here I’m testing against 6400 files in 2500 folders (for the main branch).

  5. The threading and caching wobble probably skewed my unscientific measurments… I tested with 10000/3500 and 1000/100 pas-files/folders. Now that I took some more time to test:

    For 10000 files removing 8dot3 takes me down from about 250 ms to 220 ms, so it’s definitely a measurable improvement. The 1000 files are very fast, so it’s difficult for me to measure something noticable there.

    Searching still takes a lot of time, I hope there is some more room for improvements…

  6. @Eric These were data drives.So unless it’s a requirement that no apps at all be open, I am still at a loss.

  7. Wouldn’t it be (much) faster if instead of using a TMask instance, you used : “if (ExtractFileExt(filename) = MaskFileExt)”, whereby “MaskFileExt := ExtractFileExt(filemask)” is calculated once beforehand ?

    (It won’t work when there’s a wildcard in the extension – perhaps use your implementation for that scenario?)

Comments are closed.