Text processing is so fundamental to programming that most developers take it for granted — until they encounter a bug involving character encoding, Unicode normalization, emoji handling, or locale-sensitive string operations. These bugs are notoriously difficult to diagnose because they often produce correct-looking output in the developer's locale while silently corrupting data for users in other regions or languages.
Character Encoding: ASCII, Latin-1, and UTF-8
ASCII (1963) maps 128 characters — English letters, digits, punctuation, and control characters — to the values 0-127. It was designed for English-language teletype machines and has no support for accented characters, non-Latin scripts, or symbols beyond basic punctuation. Despite being over 60 years old, ASCII remains the foundation of every subsequent encoding.
Latin-1 (ISO 8859-1) extends ASCII to 256 characters by using the values 128-255 for Western European accented characters (é, ñ, ü, ß) and common symbols. It served the Western web well but is completely inadequate for Chinese, Japanese, Korean, Arabic, Hindi, or any non-Western script.
UTF-8 (1993) solved the problem definitively. It is a variable-length encoding of Unicode that uses 1 byte for ASCII characters (making it backward-compatible), 2 bytes for most Latin and Cyrillic characters, 3 bytes for CJK characters, and 4 bytes for emoji and rare symbols. UTF-8 is now used by over 98% of websites and is the default encoding in HTML5, JSON, and most modern programming languages.
If you are starting a new project in 2026 and not using UTF-8 everywhere — in your source files, database, API responses, and HTML — you are creating future bugs. UTF-8 is the only encoding you should ever need.
Unicode: One Standard to Represent Every Character
Unicode is not an encoding — it is a catalog that assigns a unique number (code point) to every character in every writing system on Earth, plus mathematical symbols, musical notation, and emoji. Unicode 15 defines over 149,000 characters. A code point is written as U+XXXX, where XXXX is a hexadecimal number. The letter A is U+0041. The euro sign is U+20AC. The 'face with tears of joy' emoji is U+1F602.
UTF-8, UTF-16, and UTF-32 are different ways to encode Unicode code points as bytes. UTF-8 is dominant on the web. UTF-16 is used internally by JavaScript, Java, and Windows. UTF-32 uses 4 bytes per character (simplest but most wasteful). The distinction matters when calculating string lengths: JavaScript's string.length returns the number of UTF-16 code units, not the number of visible characters. The emoji '👨👩👧👦' (family) is a single visible character but has a .length of 11 in JavaScript.
Case Conversion: It Is Not Just toUpperCase()
Case conversion seems trivial — until you encounter locale-specific rules. In Turkish, the uppercase of 'i' is 'İ' (dotted capital I), not 'I'. The lowercase of 'I' is 'ı' (dotless lowercase i), not 'i'. German has 'ß' (sharp s), whose uppercase is 'SS' — a one-character-to-two-character transformation. These rules mean that toUpperCase() and toLowerCase() produce different results depending on the user's locale.
// Locale-sensitive case conversion in JavaScript
'istanbul'.toLocaleUpperCase('tr'); // "İSTANBUL" (Turkish)
'istanbul'.toLocaleUpperCase('en'); // "ISTANBUL" (English)
// Case-insensitive comparison — use localeCompare
'café'.localeCompare('CAFÉ', undefined, { sensitivity: 'accent' });
// 0 (equal, ignoring case)
// Common case conversions for developers
const camelToSnake = (s) => s.replace(/[A-Z]/g, (c) => '_' + c.toLowerCase());
const snakeToCamel = (s) => s.replace(/_([a-z])/g, (_, c) => c.toUpperCase());
camelToSnake('backgroundColor'); // "background_color"
snakeToCamel('background_color'); // "backgroundColor"URL Slugs and Text Normalization
Creating a URL slug from arbitrary text requires several normalization steps: Unicode NFKD decomposition (separating base characters from combining marks), stripping diacritical marks, converting to lowercase, replacing spaces and underscores with hyphens, removing all non-alphanumeric characters except hyphens, and collapsing consecutive hyphens. The result is a URL-safe, SEO-friendly string that preserves the essential meaning of the original text.
function slugify(text) {
return text
.normalize('NFKD') // Decompose Unicode
.replace(/[\u0300-\u036f]/g, '') // Remove diacritical marks
.toLowerCase()
.trim()
.replace(/[^a-z0-9\s-]/g, '') // Remove non-alphanumeric
.replace(/[\s_]+/g, '-') // Spaces/underscores → hyphens
.replace(/-+/g, '-') // Collapse multiple hyphens
.replace(/^-|-$/g, ''); // Trim leading/trailing hyphens
}
slugify('Héllo Wörld! — A Test'); // "hello-world-a-test"
slugify('Ünit Cönversion Tööl'); // "unit-conversion-tool"
slugify(' multiple spaces '); // "multiple-spaces"Counting Characters Correctly
What counts as one character? The answer depends on context. JavaScript's .length counts UTF-16 code units. A grapheme cluster — what a human perceives as a single character — can span multiple code units. The flag emoji '🇺🇸' is two code points (U+1F1FA U+1F1F8). The family emoji '👨👩👧👦' is seven code points joined by zero-width joiners. For user-facing character counts (like a tweet counter), you need to count grapheme clusters, not code units.
// JavaScript .length counts UTF-16 code units — often wrong
'Hello'.length; // 5 ✓
'café'.length; // 4 ✓
'👍'.length; // 2 ✗ (surrogate pair)
'🇺🇸'.length; // 4 ✗ (two surrogate pairs)
'👨👩👧👦'.length; // 11 ✗ (7 code points, some are surrogates)
// Intl.Segmenter counts grapheme clusters — correct for display
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
[...segmenter.segment('👨👩👧👦')].length; // 1 ✓
[...segmenter.segment('🇺🇸')].length; // 1 ✓
[...segmenter.segment('café')].length; // 4 ✓Line Endings, Whitespace, and Invisible Characters
Windows uses CRLF (\r\n) line endings. Unix and macOS use LF (\n). This difference causes subtle bugs: a file with Windows line endings may show '^M' characters in a Unix terminal, fail regex matches anchored with $, or produce incorrect line counts. Git handles this with core.autocrlf, and .editorconfig standardizes line endings across teams.
Beyond visible whitespace, Unicode defines numerous invisible characters that can cause hard-to-debug issues: the zero-width space (U+200B), zero-width non-joiner (U+200C), byte order mark (U+FEFF), and various directional markers for right-to-left text. When processing user input, consider stripping or normalizing these characters to prevent them from polluting database records, breaking string comparisons, or creating visually identical but byte-different strings.
Practical Guidelines
- Always use UTF-8 — for source files, databases, HTML, APIs, and file I/O
- Use Intl.Segmenter or equivalent for user-facing character counts, not .length
- Use locale-aware comparison (localeCompare) for user-visible sorting and search
- Normalize Unicode (NFC or NFKC) before storing or comparing text to prevent equivalent-but-different byte sequences
- Standardize line endings with .editorconfig or git settings to prevent cross-platform bugs
- Sanitize invisible Unicode characters from user input at system boundaries
- When generating URL slugs, always decompose Unicode with NFKD before stripping diacritics