String Handling in Bash

Handling strings with non-ASCII and special characters in Bash

Handling strings with non-ASCII characters and special characters in Bash can indeed be a headache due to how Bash interprets and manipulates strings. The issues often stem from locale settings, quoting inconsistencies, and the way Bash treats certain characters (like spaces, newlines, or globbing characters). Fortunately, there are practical strategies to mitigate these problems in day-to-day scripting. Let’s break this down and explore solutions, including escaping functions and conventions.

Common problems

  • Non-ASCII characters: Bash may mishandle UTF-8 or other encodings if the locale isn’t set properly, leading to garbled output or unexpected behavior.
  • Special characters: Characters like *, ?, $, quotes (' and "), and newlines can trigger expansions, substitutions, or simply break commands if not handled carefully.
  • Quoting issues: Improper quoting (or lack thereof) can cause strings to split unexpectedly or lose their intended meaning.

Strategies to avoid these issues

1. Set a proper locale

Ensure your environment uses a consistent locale that supports UTF-8, which is standard for handling non-ASCII characters. Add this to your script or .bashrc:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

This ensures Bash and tools like grep, sed, or awk interpret non-ASCII characters correctly.

2. Use robust quoting

Always quote variables and strings to prevent word splitting and globbing. For example:

string="hello * world"
echo "$string"  # Outputs: hello * world
echo $string    # Outputs: hello (and potentially lists files due to *)

Use double quotes ("$var") unless you specifically need unquoted behavior. Single quotes ('$var') prevent all expansion, which can be useful for literal strings.

3. Escape/unescape functions

Creating functions to handle escaping and unescaping can standardize how you deal with special characters. Here’s a simple approach:

  • Escape function: Use printf '%q' to safely escape a string for use in Bash.
escape_string() {
    printf '%q' "$1"
}

Example:

string="hello * world & \"quote\""
escaped=$(escape_string "$string")
echo "$escaped"  # Outputs: hello\ \*\ world\ \&\ \"quote\"
  • Unescape function: Reversing this is trickier since Bash doesn’t have a built-in unescape, but you can use eval cautiously:
unescape_string() {
    eval "printf '%s' \"$1\""
}

Example:

escaped="hello\ \*\ world\ \&\ \"quote\""
unescaped=$(unescape_string "$escaped")
echo "$unescaped"  # Outputs: hello * world & "quote"

Caution: eval can be dangerous with untrusted input—only use it if you control the escaped string’s source.

4. Use arrays instead of strings

For complex strings with spaces or special characters, Bash arrays are a safer alternative to plain strings:

array=("hello * world" "another & string")
echo "${array[0]}"  # Outputs: hello * world

Pass arrays to commands like this:

printf '%s\n' "${array[@]}"

5. Adopt a convention

A consistent convention can prevent a lot of pain. Here’s a practical one:

  • Convention: Always store strings in double-quoted variables or arrays, and escape special characters only when passing them to external commands or eval contexts.
  • Rules:
    1. Use "$var" for all variable expansions.
    2. Use arrays ("${array[@]}") when dealing with lists or strings that might contain spaces/newlines.
    3. Escape dynamically generated strings with printf '%q' before storing them if they’ll be reused in a command.
    4. Avoid relying on word splitting or globbing—explicitly control it when needed.

Example:

text="hello * world"
safe_text=$(escape_string "$text")
command=(echo "$safe_text")
"${command[@]}"  # Safely outputs: hello\ \*\ world

6. Leverage external tools

Bash isn’t great at string manipulation—tools like sed, awk, or tr are often more reliable:

  • Replace special characters: echo "$string" | sed 's/[*&]/\\&/g'
  • Clean non-ASCII: echo "$string" | tr -cd '[:print:]\n'

7. Test with edge cases

Always test your scripts with strings containing spaces, newlines, quotes, and non-ASCII characters (e.g., café, π, or emoji). This helps catch issues early.

What’s the best strategy?

  • Short term: Use quoting religiously ("$var") and printf '%q' for escaping when needed. This is quick and works in most cases.
  • Long term: Switch to arrays for anything complex, and rely on escape/unescape functions for dynamic strings. Pair this with a UTF-8 locale.
  • Convention: Stick to “quote everything, escape only when executing” as a default. It’s simple and minimizes surprises.

For day-to-day Bash programming, combining proper quoting, arrays, and occasional escaping with printf '%q' will handle 90% of the oddities without overcomplicating things. If your scripts grow too complex, consider a language like Python, which has better string handling built-in. Bash is powerful, but it’s not a string maestro!

By Pablo Machón | Computer Science, IT, Libre Software | Written with a little help from my friends

Copyright 2025 Pablo Machón | published under the GFDL License | You may use, copy, modify and redistribute this page as long as you cite the author. Please read the license