Technical
Text Encoding: ASCII, Unicode, UTF-8
personWritten by Magnus Silverstream
•calendar_todayNovember 10, 2025
•schedule8 min read
Every time you type a character, your computer transforms it into numbers. But which numbers? That depends on the character encoding. From the garbled text of encoding mismatches to the emoji that won't display correctly, encoding issues plague developers everywhere. This guide explains how text encoding works, from the early days of ASCII to modern Unicode, and helps you avoid the most common pitfalls.
What is character encoding?
Character encoding is a system that maps characters to numbers (and vice versa) so computers can store and transmit text.
The basic concept:
• Every character needs a unique number (code point)
• That number needs to be stored as bytes
• Different systems may use different mappings
Why encoding matters:
• Text files don't contain letters - they contain numbers
• Without knowing the encoding, numbers are meaningless
• Wrong encoding = garbled text (mojibake)
Common encoding scenarios:
• Web pages declaring charset
• Database column character sets
• File encodings in editors
• API response encodings
• Email content encoding
ASCII: where it all began
ASCII (American Standard Code for Information Interchange) was created in 1963 and uses 7 bits to represent 128 characters.
ASCII includes:
• 0-31: Control characters (newline, tab, etc.)
• 32-126: Printable characters
• 127: Delete
Key ASCII values:
• 'A' = 65, 'Z' = 90
• 'a' = 97, 'z' = 122
• '0' = 48, '9' = 57
• Space = 32
Limitations:
• Only 128 characters total
• English-centric (no accents, no other scripts)
• Led to many incompatible extensions (Latin-1, etc.)
ASCII's legacy:
• Still the foundation of modern encodings
• First 128 Unicode characters = ASCII
• URL encoding, email headers still rely on ASCII
Unicode: the universal solution
Unicode aims to assign a unique number to every character in every language.
Unicode basics:
• Code points written as U+XXXX (hex)
• Currently over 140,000 characters defined
• Includes all modern languages, historical scripts, symbols, emoji
Code point examples:
• 'A' = U+0041
• 'é' = U+00E9
• '中' = U+4E2D
• '😀' = U+1F600
Unicode planes:
• Plane 0 (BMP): U+0000 to U+FFFF - most common characters
• Plane 1: U+10000 to U+1FFFF - emoji, historic scripts
• Planes 2-16: rare and specialized characters
Important distinction:
Unicode defines what number each character gets.
UTF-8, UTF-16, UTF-32 define how to store those numbers as bytes.
UTF-8: the web's encoding
UTF-8 is the dominant encoding for the web, used by over 98% of websites.
How UTF-8 works:
• Variable-length: 1-4 bytes per character
• ASCII characters: 1 byte (backward compatible!)
• European characters: 2 bytes
• Asian characters: 3 bytes
• Emoji: 4 bytes
UTF-8 byte patterns:
• 0xxxxxxx: 1-byte (ASCII)
• 110xxxxx 10xxxxxx: 2-byte
• 1110xxxx 10xxxxxx 10xxxxxx: 3-byte
• 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx: 4-byte
Why UTF-8 won:
• Backward compatible with ASCII
• No byte-order issues (no BOM needed)
• Self-synchronizing (easy to find character boundaries)
• Efficient for English text
• Works with existing systems that expect ASCII
Best practice: Always use UTF-8 unless you have a specific reason not to.
Other encodings you'll encounter
UTF-16:
• 2 or 4 bytes per character
• Used internally by JavaScript, Java, Windows
• Has byte-order issues (BOM: U+FEFF)
• Efficient for Asian languages
UTF-32:
• Fixed 4 bytes per character
• Simple but wasteful
• Rarely used in practice
Latin-1 (ISO-8859-1):
• 8-bit encoding, 256 characters
• ASCII + Western European characters
• Common in legacy systems
Windows-1252:
• Microsoft's Latin-1 variant
• Adds smart quotes, euro sign
• Often mislabeled as Latin-1
Shift-JIS, EUC, GB2312:
• Legacy Asian encodings
• Still found in older systems
• Being replaced by UTF-8
Common encoding problems and solutions
Problem: Mojibake (garbled text)
Example: 'café' becomes 'café'
Cause: UTF-8 interpreted as Latin-1
Solution: Ensure consistent encoding declaration
Problem: Question marks or boxes
Example: '中文' becomes '??'
Cause: Characters not in the encoding
Solution: Use Unicode/UTF-8
Problem: BOM issues
Example: Invisible character at file start
Cause: UTF-8 with BOM in systems expecting ASCII
Solution: Use UTF-8 without BOM
Best practices:
• Declare encoding explicitly (charset=utf-8)
• Use UTF-8 everywhere
• Validate input encoding
• Test with non-ASCII characters
• Store in databases with utf8mb4 (for full emoji support)
Danger zones:
• Copy-pasting from Word (smart quotes)
• Old email systems
• Legacy database migrations
• File paths with special characters
Conclusion
Character encoding is the bridge between human-readable text and computer-storable bytes. While ASCII served its purpose for English text, Unicode and UTF-8 are essential for our multilingual, emoji-filled modern world. The key takeaway: use UTF-8 everywhere, declare your encoding explicitly, and test with international characters. Use our text encoding tool to explore how different characters are represented in various encodings and debug encoding issues in your data.
Frequently Asked Questions
Unicode is a standard that assigns numbers (code points) to characters. UTF-8 is an encoding that specifies how to store those numbers as bytes. Unicode tells us 'A' is 65; UTF-8 tells us to store 65 as a single byte 0x41.