There is often a situation where a corpus is available as a large number of documents in a directory or directory three where the filename and/or path in the tree conveys important information for filtering or selecting documents, e.g. the filename may contain a year, a topic, a classification label etc.
It would then be extremely useful if we could specify a regexp to match the path names to import, where pathnames would maybe best be represented as URLs (so that subdirectory separators would always be slashes, even on Windows, and not backslashes which are very clumsy to use in regexps).