kwutil.util_path module¶

Helpers resulated to filesystem paths, enumeration, manipulation, and search.

kwutil.util_path.tree(path)[source]¶

Like os.walk but yields a flat list of file and directory paths

Parameters:: path (str | os.PathLike)
Yields:: str – path

Example

>>> import itertools as it
>>> from kwutil.util_path import *  # NOQA
>>> import ubelt as ub
>>> path = ub.Path('.')
>>> gen = tree(path)
>>> results = list(it.islice(gen, 5))
>>> print('results = {}'.format(ub.urepr(results, nl=1)))

kwutil.util_path.coerce_patterned_paths(data, expected_extension=None, globfallback=False)[source]¶

Coerce input to a list of paths.

Parameters:

data (str | List[str]) – a glob pattern or list of glob patterns or a yaml list of glob patterns
expected_extension (None | str | List[str]) – one or more expected extensions (including the leading dot)
globfallback (bool) – TODO: need a better name for this. The idea is that if an input doesn’t contain a wildcard, but does not exist (i.e. glob wont match it, then return that input back as-is)

Returns:

Multiple paths that match the query

Return type:

List[ub.Path]

Example

>>> # xdoctest: +REQUIRES(module:ruamel.yaml)
>>> empty_fpaths = coerce_patterned_paths(None)
>>> assert len(empty_fpaths) == 0

Example

>>> # xdoctest: +REQUIRES(module:ruamel.yaml)
>>> from kwutil.util_path import *  # NOQA
>>> import ubelt as ub
>>> dpath = ub.Path.appdir('kwutil/test/utils/path/').ensuredir()
>>> (dpath / 'file1.txt').touch()
>>> (dpath / 'dir').ensuredir()
>>> (dpath / 'dir' / 'subfile1.txt').touch()
>>> (dpath / 'dir' / 'subfile2.txt').touch()
>>> paths = coerce_patterned_paths(
...     f'''
...     - {dpath / 'file1.txt'}
...     - {dpath / 'file2.txt'}
...     - {dpath / 'dir'}
...     ''', expected_extension='.txt')
>>> paths = [p.shrinkuser() for p in paths]
>>> print('paths = {}'.format(ub.urepr(paths, nl=1)))
>>> with ChDir(dpath / 'dir'):
>>>     paths = coerce_patterned_paths('*.txt*')
>>> print('paths = {}'.format(ub.urepr(paths, nl=1)))
>>> assert len(paths) == 2

paths = [: Path(‘~/.cache/kwutil/test/utils/path/file1.txt’), Path(‘~/.cache/kwutil/test/utils/path/dir/subfile1.txt’), Path(‘~/.cache/kwutil/test/utils/path/dir/subfile2.txt’),

]

kwutil.util_path.find(pattern=None, dpath=None, include=None, exclude=None, type=None, recursive=True, followlinks=False)[source]¶

Find all paths in a root subject to a search criterion

Parameters:

pattern (str) – The glob pattern the path name must match to be returned
dpath (str) – The root directory to search. Default to cwd.
include (str | List[str]) – Pattern or list of patterns. If specified, search only files whose base name matches this pattern. By default the pattern is GLOB.
exclude (str | List[str]) – Pattern or list of patterns. Skip any file with a name suffix that matches the pattern. By default the pattern is GLOB.
type (str | List[str]) – A list of 1 character codes indicating what types of file can be returned. Currently we only allow either “f” for file or “d” for directory. Symbolic links are not currently distinguished. In the future we may support posix codes, see [1]_ for details.
recursive – search all subdirectories recursively
followlinks (bool, default=False) – if True will follow directory symlinks

References

_[1] https://linuxconfig.org/identifying-file-types-in-linux

Todo

mindepth

maxdepth

ignore_case

regex_match

Example

>>> from kwutil.util_path import *  # NOQA
>>> paths = list(find(pattern='*'))
>>> paths = list(find(pattern='*', type='f'))
>>> print('paths = {!r}'.format(paths))
>>> print('paths = {!r}'.format(paths))

kwutil.util_path.resolve_relative_to(path, dpath, strict=False)[source]¶: Given a path, try to resolve its symlinks such that it is relative to the given dpath.

kwutil.util_path.resolve_directory_symlinks(path)[source]¶: Only resolve symlinks of directories

kwutil.util_path.sidecar_glob(main_pat, sidecar_ext, main_key='main', sidecar_key=None, recursive=0)[source]¶

Similar to a regular glob, but returns a dictionary with associated main-file / sidecar-file pairs.

Todo

add as a general option to Pattern.paths?

Parameters:: main_pat (str | PathLike) – glob pattern for the main non-sidecar file
Yields:: Dict[str, ub.Path | None]

Notes

A sidecar file is defined by the sidecar extension. We usually use this for .dvc sidecars.

When the pattern includes a .dvc suffix, the result will include those .dvc files and any matching main files they correspond to. Note: if you search for paths like foo_*.dvc this might skipped unstaged files. Therefore it is recommended to only include the .dvc suffix in the pattern ONLY if you do not want any unstaged files.

If you want both staged and unstaged files, ensure the pattern does not exclude objects without a .dvc suffix (i.e. don’t end the pattern with .dvc).

When the pattern does not include a .dvc suffix, we include all those files, for other files that exist by adding a .dvc suffix.

With the pattern matches both a dvc and non-dvc file, they are grouped together.

Example

>>> from kwutil.util_path import *  # NOQA
>>> dpath = ub.Path.appdir('xdev/tests/sidecar_glob')
>>> dpath.delete().ensuredir()
>>> (dpath / 'file1').touch()
>>> (dpath / 'file1.ext').touch()
>>> (dpath / 'file1.ext.car').touch()
>>> (dpath / 'file2.ext').touch()
>>> (dpath / 'file3.ext.car').touch()
>>> (dpath / 'file4.car').touch()
>>> (dpath / 'file5').touch()
>>> (dpath / 'file6').touch()
>>> (dpath / 'file6.car').touch()
>>> (dpath / 'file7.bike').touch()
>>> def _handle_resulst(results):
...     results = list(results)
...     for row in results:
...         for k, v in row.items():
...             if v is not None:
...                 row[k] = v.relative_to(dpath)
...     print(ub.urepr(results, sv=1))
...     return results
>>> main_key = 'main',
>>> sidecar_key = '.car'
>>> sidecar_ext = '.car'
>>> main_pat = dpath / '*'
>>> _handle_resulst(sidecar_glob(main_pat, sidecar_ext))
>>> _handle_resulst(sidecar_glob(dpath / '*.ext', '.car'))
>>> _handle_resulst(sidecar_glob(dpath / '*.car', '.car'))
>>> _handle_resulst(sidecar_glob(dpath / 'file*.ext', '.car'))
>>> _handle_resulst(sidecar_glob(dpath / '*', '.ext'))

kwutil.util_path.sanitize_path_name(path: str, maxlen=128, hash_suffix=None, preserve_prefix: bool = True, replacements=None, safe=False, allow_unicode: bool = True, **deprecated) → str[source]¶

Sanitize an input string so it can be safely used as a filename or path segment.

This function replaces characters that are illegal on common file systems, strips control characters, optionally normalizes Unicode (or converts to ASCII), trims the length if necessary (while preserving a prefix), and ensures the name does not conflict with reserved names (e.g. on Windows).

Parameters:

path (str) – The input file name or path segment.
maxlen (int | None) – Maximum allowed length for the sanitized name. If exceeded, the name is truncated with a hash appended. Set to None for no length limit. (If specified, must be at least 8.)
hash_suffix (str | None | callable) – An optional extra suffix to append if the name is hashed. Can be a string or a callable returning a string.
preserve_prefix (bool) – If True, preserve as much of the original sanitized name as possible when truncating (with an underscore plus hash appended); if False, replace the name entirely with the hash (and optional hash_suffix).
replacements (dict | str | None) – The characters: `|<>:?*”/` are always illegal by default. A mapping of substrings to replace in addition to the defaults. The illegal characters are always replaced, but the user can overwrite what they are replaced with here. If given as a string, all special characters are replaced with the given character.
safe (bool) – If True, also replaces characters that are unsafe but not strictly illegal. This includes characters problematic for shell commands, URLs, or scripts, i.e. ‘ #^&@{}[]$+;!,`~=%’. By default (False), only illegal characters are replaced.
allow_unicode (bool) – If True, preserves Unicode characters (using NFC normalization); if False, converts the name to ASCII (discarding unsupported characters).
**deprecated – handles deprecated arguments

Returns:

A sanitized string that is safe for use as a filename.

Return type:

str

Notes

Illegal characters are disallowed by common filesystems: |, <, >, :, “, ?, *, /, ``
- These are reserved or control characters on Windows and Linux.
- Always replaced, regardless of safe.
Unsafe characters are technically allowed in filenames but may cause issues: #, &, @, ^, {}, [], $, +, ;, !, ,, `` ` ``
- Unsafe for use in:
  
  Shell commands (e.g., &, ;, $)
  
  URLs or cloud storage (e.g., #, %, +)
  
  Code injection or parsing bugs (e.g., {}, [], `` ` ``)

References

https://chatgpt.com/c/67aa3e3b-cf48-8013-9be6-f4ff88eecf72 https://stackoverflow.com/questions/1976007/what-characters-are-forbidden-in-windows-and-linux-directory-names

Examples

>>> from kwutil.util_path import *  # NOQA
>>> sanitize_path_name('a chan with space_PIPE_bar_PIPE_baz')
'a chan with space_PIPE_bar_PIPE_baz'
>>> sanitize_path_name('dont|use<these>chars:in?a*path.')
'dont_PIPE_use_LT_these_GT_chars_COLON_in_QM_a_ASTRIX_path._'
>>> sanitize_path_name('dont|use<these>chars:in?a*path.', maxlen=8)
'nckzxtpn'
>>> sanitize_path_name('CON')
_CON_
>>> # Handling long names (forcing a hash):
>>> # "abcd|efgh" becomes "abcd_efgh" (9 characters) which exceeds maxlen=8,
>>> # so the output will be a hash (of length 8). We cannot predict the hash value,
>>> # but we can check that the length is 8.
>>> result = sanitize_path_name("abcd|efgh", maxlen=8)
>>> len(result) == 8
True
>>> # Preserving a prefix vs. not preserving it:
>>> # With preserve_prefix True (default) and a moderately short maxlen,
>>> # some of the original string is kept along with an appended hash.
>>> result = sanitize_path_name("longfilename_with_illegal|chars", maxlen=20)
>>> "_" in result  # contains an underscore separating prefix and hash
True
>>> # With preserve_prefix False, the entire output is just the hash.
>>> result2 = sanitize_path_name("longfilename_with_illegal|chars", maxlen=20, preserve_prefix=False)
>>> "_" not in result2 or result2.count('_') == 1  # only a possible separator with hash_suffix
True
>>> # Unicode handling:
>>> sanitize_path_name('café', allow_unicode=True)
'café'
>>> sanitize_path_name('café', allow_unicode=False)
'cafe'
>>> # Windows reserved names:
>>> sanitize_path_name('CON')
'_CON_'
>>> sanitize_path_name('NUL')
'_NUL_'
>>> # Removal of control characters:
>>> sanitize_path_name("hello\x00world")
'helloworld'
>>> sanitize_path_name("abc\x01def")
'abcdef'
>>> # Handling names ending with a dot or space:
>>> sanitize_path_name("filename. ")
'filename._'
>>> # Non-string input is converted to a string:
>>> sanitize_path_name(12345)
'12345'
>>> # Using a custom replacement map:
>>> sanitize_path_name("a#b#c", replacements={"#": "X"})
'aXbXc'
>>> # When you specify a map, it updates the defaults
>>> sanitize_path_name("a#b|#c", replacements={"#": "X"})
'aXb_PIPE_Xc'
>>> # But you can overwrite what the invalid characters map to
>>> sanitize_path_name("a#b|#c", replacements={"#": "X", '|': 'HELLO'})
'aXbHELLOXc'
>>> # Use a single character to replace everything.
>>> sanitize_path_name("a/b|<<c", replacements='_')
'a_b___c'
>>> # Unsafe characters are preserved by default
>>> sanitize_path_name('report#final@v2[notes]')
'report#final@v2[notes]'
>>> # When safe=True, unsafe characters are also replaced
>>> sanitize_path_name('report#final@v2[notes]', safe=True)
'report_HASH_final_AT_v2_LSB_notes_RSB_'
>>> # Unsafe and illegal characters can be replaced together
>>> sanitize_path_name('a|b#c@d[e]f', safe=True)
'a_PIPE_b_HASH_c_AT_d_LSB_e_RSB_f'
>>> # Custom replacement mappings still apply and override defaults
>>> sanitize_path_name('a#b|#c', safe=True, replacements={'#': 'X', '|': '-'})
'aXb-Xc'