What's the difference between Unicode and UTF-8?

Unicode is a standard that assigns numbers (code points) to characters. UTF-8 is an encoding that specifies how to store those numbers as bytes. Unicode tells us 'A' is 65; UTF-8 tells us to store 65 as a single byte 0x41.

Why do some emoji need more bytes than others?

Emoji were added to Unicode later and have high code points (above U+10000). In UTF-8, these require 4 bytes. Some emoji are even combinations of multiple code points (like skin tone modifiers), requiring even more bytes.

Should I use UTF-8 with or without BOM?

Generally use UTF-8 without BOM. The BOM (Byte Order Mark) isn't needed for UTF-8 and can cause problems with some tools. Only use BOM if specifically required by your application.

Why does my database show ??? for emoji?

MySQL's 'utf8' encoding is actually limited to 3 bytes per character, which can't store emoji. Use 'utf8mb4' instead, which supports the full Unicode range including all emoji.

How do I fix garbled text?

First identify the actual encoding of the data (look for patterns in the garbled text). Then convert to the correct encoding or re-interpret the bytes with the correct encoding. Prevention is better: always use UTF-8 consistently.

Text Encoding: ASCII, Unicode, UTF-8

Every time you type a character, your computer transforms it into numbers. But which numbers? That depends on the character encoding. From the garbled text of encoding mismatches to the emoji that won't display correctly, encoding issues plague developers everywhere. This guide explains how text encoding works, from the early days of ASCII to modern Unicode, and helps you avoid the most common pitfalls.

What is character encoding?

Character encoding is a system that maps characters to numbers (and vice versa) so computers can store and transmit text. The basic concept: • Every character needs a unique number (code point) • That number needs to be stored as bytes • Different systems may use different mappings Why encoding matters: • Text files don't contain letters - they contain numbers • Without knowing the encoding, numbers are meaningless • Wrong encoding = garbled text (mojibake) Common encoding scenarios: • Web pages declaring charset • Database column character sets • File encodings in editors • API response encodings • Email content encoding

ASCII: where it all began

ASCII (American Standard Code for Information Interchange) was created in 1963 and uses 7 bits to represent 128 characters. ASCII includes: • 0-31: Control characters (newline, tab, etc.) • 32-126: Printable characters • 127: Delete Key ASCII values: • 'A' = 65, 'Z' = 90 • 'a' = 97, 'z' = 122 • '0' = 48, '9' = 57 • Space = 32 Limitations: • Only 128 characters total • English-centric (no accents, no other scripts) • Led to many incompatible extensions (Latin-1, etc.) ASCII's legacy: • Still the foundation of modern encodings • First 128 Unicode characters = ASCII • URL encoding, email headers still rely on ASCII

Unicode: the universal solution

Unicode aims to assign a unique number to every character in every language. Unicode basics: • Code points written as U+XXXX (hex) • Currently over 140,000 characters defined • Includes all modern languages, historical scripts, symbols, emoji Code point examples: • 'A' = U+0041 • 'é' = U+00E9 • '中' = U+4E2D • '😀' = U+1F600 Unicode planes: • Plane 0 (BMP): U+0000 to U+FFFF - most common characters • Plane 1: U+10000 to U+1FFFF - emoji, historic scripts • Planes 2-16: rare and specialized characters Important distinction: Unicode defines what number each character gets. UTF-8, UTF-16, UTF-32 define how to store those numbers as bytes.

UTF-8: the web's encoding

UTF-8 is the dominant encoding for the web, used by over 98% of websites. How UTF-8 works: • Variable-length: 1-4 bytes per character • ASCII characters: 1 byte (backward compatible!) • European characters: 2 bytes • Asian characters: 3 bytes • Emoji: 4 bytes UTF-8 byte patterns: • 0xxxxxxx: 1-byte (ASCII) • 110xxxxx 10xxxxxx: 2-byte • 1110xxxx 10xxxxxx 10xxxxxx: 3-byte • 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx: 4-byte Why UTF-8 won: • Backward compatible with ASCII • No byte-order issues (no BOM needed) • Self-synchronizing (easy to find character boundaries) • Efficient for English text • Works with existing systems that expect ASCII Best practice: Always use UTF-8 unless you have a specific reason not to.

Other encodings you'll encounter

UTF-16: • 2 or 4 bytes per character • Used internally by JavaScript, Java, Windows • Has byte-order issues (BOM: U+FEFF) • Efficient for Asian languages UTF-32: • Fixed 4 bytes per character • Simple but wasteful • Rarely used in practice Latin-1 (ISO-8859-1): • 8-bit encoding, 256 characters • ASCII + Western European characters • Common in legacy systems Windows-1252: • Microsoft's Latin-1 variant • Adds smart quotes, euro sign • Often mislabeled as Latin-1 Shift-JIS, EUC, GB2312: • Legacy Asian encodings • Still found in older systems • Being replaced by UTF-8

Common encoding problems and solutions

Problem: Mojibake (garbled text) Example: 'café' becomes 'cafÃ©' Cause: UTF-8 interpreted as Latin-1 Solution: Ensure consistent encoding declaration Problem: Question marks or boxes Example: '中文' becomes '??' Cause: Characters not in the encoding Solution: Use Unicode/UTF-8 Problem: BOM issues Example: Invisible character at file start Cause: UTF-8 with BOM in systems expecting ASCII Solution: Use UTF-8 without BOM Best practices: • Declare encoding explicitly (charset=utf-8) • Use UTF-8 everywhere • Validate input encoding • Test with non-ASCII characters • Store in databases with utf8mb4 (for full emoji support) Danger zones: • Copy-pasting from Word (smart quotes) • Old email systems • Legacy database migrations • File paths with special characters

Conclusion

Character encoding is the bridge between human-readable text and computer-storable bytes. While ASCII served its purpose for English text, Unicode and UTF-8 are essential for our multilingual, emoji-filled modern world. The key takeaway: use UTF-8 everywhere, declare your encoding explicitly, and test with international characters. Use our text encoding tool to explore how different characters are represented in various encodings and debug encoding issues in your data.