Ever opened a file and seen é where there should be an é? Or gotten a database full of ??? where Japanese characters used to be? Welcome to the world of character encoding — the invisible layer that turns bytes into letters and letters into chaos when it breaks.
Character encoding is one of those things developers deal with constantly but rarely think about until something goes wrong. And when it goes wrong, it goes really wrong.
This guide breaks down the three encoding systems you'll encounter most: ASCII, UTF-8, and Unicode. You'll learn how they work, how they differ, and how to stop encoding bugs from ruining your day.
What Is Character Encoding?
Computers store everything as numbers. Character encoding is the system that maps those numbers to the characters you see on screen. The letter A isn't stored as a tiny picture of an A — it's stored as the number 65 (or 41 in hexadecimal).
Different encoding systems use different mappings and different amounts of storage per character. That's where things get interesting — and where most encoding problems originate.
What Is ASCII?
ASCII (American Standard Code for Information Interchange) is the grandfather of character encoding. Created in 1963, it maps 128 characters to the numbers 0–127 using 7 bits.
What ASCII Includes
| Range | Characters | Count |
|---|---|---|
| 0–31 | Control characters (newline, tab, null) | 32 |
| 32–126 | Printable characters (letters, digits, symbols) | 95 |
| 127 | DEL (delete) | 1 |
That gives you:
- 26 uppercase letters (A–Z): codes 65–90
- 26 lowercase letters (a–z): codes 97–122
- 10 digits (0–9): codes 48–57
- 33 symbols (!, @, #, etc.)
- 33 control characters (most are legacy)
You can explore the full mapping with a hex to ASCII converter — input 48 65 6C 6C 6F and you get Hello.
ASCII's Limitation
ASCII only covers English. No accented characters (é, ñ, ü), no Chinese, no Arabic, no emoji. 128 characters seemed like enough in 1963 when the primary users were American engineers. It wasn't enough for the global internet.
Extended ASCII (codes 128–255) tried to fix this by using all 8 bits, but different systems mapped those extra 128 slots differently. Windows used Code Page 1252. Mac used Mac Roman. ISO created ISO 8859-1 (Latin-1). Same byte, different character — chaos.
What Is Unicode?
Unicode is the answer to the "too many incompatible encoding tables" problem. Instead of being an encoding itself, Unicode is a universal character catalog — a master list that assigns a unique number (called a code point) to every character in every writing system.
How Unicode Works
Each character gets a code point written as U+ followed by a hex number:
| Character | Code Point | Name |
|---|---|---|
| A | U+0041 | Latin Capital Letter A |
| é | U+00E9 | Latin Small Letter E with Acute |
| 你 | U+4F60 | CJK Unified Ideograph |
| 🔥 | U+1F525 | Fire Emoji |
| ∞ | U+221E | Infinity |
As of Unicode 15.1, there are 149,813 characters covering 161 scripts. Everything from Egyptian hieroglyphs to musical notation to emoji.
Unicode Is Not an Encoding
This is the key distinction most people miss. Unicode defines which number maps to which character. It doesn't define how those numbers are stored as bytes. That's the job of encoding formats like UTF-8, UTF-16, and UTF-32.
Think of Unicode as the dictionary and UTF-8 as the handwriting style you use to write from that dictionary.
What Is UTF-8?
UTF-8 (Unicode Transformation Format – 8-bit) is the dominant encoding on the web. Over 98% of websites use it. It's the default for HTML5, JSON, YAML, TOML, and most modern programming languages.
How UTF-8 Works
UTF-8 is a variable-length encoding. It uses 1 to 4 bytes per character, depending on the code point:
| Code Point Range | Bytes | Bit Pattern | Example |
|---|---|---|---|
| U+0000 – U+007F | 1 | 0xxxxxxx |
A → 41 |
| U+0080 – U+07FF | 2 | 110xxxxx 10xxxxxx |
é → C3 A9 |
| U+0800 – U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx |
你 → E4 BD A0 |
| U+10000 – U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
🔥 → F0 9F 94 A5 |
The genius of UTF-8:
- Backward compatible with ASCII. Any valid ASCII text is also valid UTF-8, byte for byte.
- Self-synchronizing. You can jump into the middle of a byte stream and find the start of the next character by looking at the bit patterns.
- No byte-order issues. Unlike UTF-16, there's no endianness ambiguity.
- Space efficient for Latin text. English content uses exactly the same space as ASCII.
UTF-8 Encoding Example: Step by Step
Let's encode the character é (U+00E9) into UTF-8:
- Code point:
0x00E9=0000 0000 1110 1001in binary - Falls in range U+0080–U+07FF → needs 2 bytes
- Template:
110xxxxx 10xxxxxx - Fill in the bits:
110 00011 10 101001 - Result:
0xC3 0xA9
You can verify this with the hex to ASCII converter at hextoascii.co — input C3 A9 in UTF-8 mode and you'll see é.
ASCII vs UTF-8 vs Unicode: Key Differences
| Feature | ASCII | Unicode | UTF-8 |
|---|---|---|---|
| Type | Encoding + character set | Character set (catalog) | Encoding (for Unicode) |
| Characters | 128 | 149,813+ | All of Unicode |
| Bytes per character | 1 | N/A (it's not an encoding) | 1–4 |
| English text size | Baseline | — | Same as ASCII |
| CJK text size | Can't represent | — | 3 bytes per character |
| Emoji support | ❌ | ✅ | ✅ (4 bytes each) |
| ASCII compatible | — | — | ✅ |
| Web usage | Legacy | Standard | 98%+ of websites |
| Year created | 1963 | 1991 | 1993 |
When to Use What
- ASCII: Only when you're working with legacy systems that explicitly require it, or when you know your data is 100% English alphanumeric.
- UTF-8: Almost always. It's the default for the modern web, APIs, databases, and file formats. If in doubt, use UTF-8.
- UTF-16: Windows internals, Java strings, JavaScript strings (internally). You'll encounter it, but you rarely need to choose it.
- UTF-32: Almost never in practice. Fixed 4 bytes per character. Simple but wasteful.
Other Encoding Formats Worth Knowing
Latin-1 (ISO 8859-1)
An 8-bit encoding that extends ASCII with Western European characters (codes 128–255). Still found in older databases and HTTP headers. Covers French, German, Spanish, and Portuguese — but not much else.
Windows-1252
Microsoft's variant of Latin-1. Adds "smart quotes," em dashes, and the Euro sign (€) in the 128–159 range where Latin-1 has control characters. The most common source of encoding confusion on the web.
UTF-16
Uses 2 or 4 bytes per character. Characters in the Basic Multilingual Plane (U+0000–U+FFFF) use 2 bytes. Others use surrogate pairs (4 bytes). Used internally by Java, JavaScript, and Windows.
| Encoding | English A |
Chinese 你 |
Emoji 🔥 |
|---|---|---|---|
| UTF-8 | 41 (1 byte) |
E4 BD A0 (3 bytes) |
F0 9F 94 A5 (4 bytes) |
| UTF-16 | 00 41 (2 bytes) |
4F 60 (2 bytes) |
D8 3D DD 25 (4 bytes) |
| UTF-32 | 00 00 00 41 (4 bytes) |
00 00 4F 60 (4 bytes) |
00 01 F5 25 (4 bytes) |
Base64
Not a character encoding in the traditional sense — Base64 encodes binary data into ASCII text for safe transmission over text-only channels (email, URLs, JSON). It uses 64 ASCII characters (A–Z, a–z, 0–9, +, /) to represent binary data.
Practical Guide: Detecting and Converting Encodings
Python
# Detect encoding with chardet
import chardet
with open('mystery_file.txt', 'rb') as f:
raw = f.read()
result = chardet.detect(raw)
print(result)
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
# Convert between encodings
text = raw.decode(result['encoding']) # Decode from detected encoding
utf8_bytes = text.encode('utf-8') # Re-encode as UTF-8
# Manual encoding/decoding
'Hello'.encode('ascii') # b'Hello'
'café'.encode('utf-8') # b'caf\xc3\xa9'
'café'.encode('latin-1') # b'caf\xe9'
b'\xc3\xa9'.decode('utf-8') # 'é'
# Get code points
[hex(ord(c)) for c in 'Hello 🔥']
# ['0x48', '0x65', '0x6c', '0x6c', '0x6f', '0x20', '0x1f525']
JavaScript
// Encode string to UTF-8 bytes
const encoder = new TextEncoder(); // Always UTF-8
const bytes = encoder.encode('café');
console.log([...bytes].map(b => b.toString(16)));
// ['63', '61', '66', 'c3', 'a9']
// Decode bytes back to string
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(new Uint8Array([0xc3, 0xa9]));
console.log(text); // 'é'
// Decode Latin-1
const latin1 = new TextDecoder('iso-8859-1');
console.log(latin1.decode(new Uint8Array([0xe9]))); // 'é'
// Get code point
'🔥'.codePointAt(0).toString(16); // '1f525'
// Convert hex to text (what hextoascii.co does)
function hexToText(hex) {
const bytes = hex.match(/.{2}/g).map(b => parseInt(b, 16));
return new TextDecoder().decode(new Uint8Array(bytes));
}
hexToText('48656c6c6f'); // 'Hello'
Command Line
# Check file encoding
file -bi document.txt
# text/plain; charset=utf-8
# Convert encoding with iconv
iconv -f WINDOWS-1252 -t UTF-8 input.txt > output.txt
# List supported encodings
iconv -l
# Convert and strip invalid characters
iconv -f UTF-8 -t ASCII//TRANSLIT input.txt > ascii_output.txt
# Hex dump to see actual bytes
xxd document.txt | head
# 00000000: 4865 6c6c 6f20 776f 726c 640a Hello world.
Common Encoding Errors and How to Fix Them
Mojibake
Symptom: é instead of é, â€" instead of —
Cause: UTF-8 bytes decoded as Latin-1 or Windows-1252.
Fix:
# The string was decoded wrong — re-encode as Latin-1, decode as UTF-8
broken = "café"
fixed = broken.encode('latin-1').decode('utf-8')
print(fixed) # "café"
Question Marks (???)
Symptom: Characters replaced with ? or � (U+FFFD replacement character).
Cause: The decoder couldn't map the bytes to any valid character in the target encoding — often ASCII trying to handle UTF-8 multi-byte sequences.
Fix: Ensure your pipeline uses UTF-8 end-to-end. Check database collation, HTTP headers, and file read/write modes.
BOM (Byte Order Mark) Issues
Symptom: Invisible  at the start of a file, or JSON parsing fails on a valid-looking file.
Cause: The file starts with a UTF-8 BOM (EF BB BF). Some editors add it; most parsers don't expect it.
Fix:
# Remove BOM from a file
sed -i '1s/^\xEF\xBB\xBF//' file.txt
# In Python
with open('file.txt', 'r', encoding='utf-8-sig') as f:
content = f.read() # BOM is silently stripped
Double Encoding
Symptom: é shows up even when you set UTF-8 everywhere.
Cause: Text was encoded to UTF-8, then the resulting bytes were encoded to UTF-8 again. The é byte C3 A9 got treated as two Latin-1 characters (Ã and ©) and each got re-encoded.
Fix:
double_encoded = "café"
# Reverse the double encoding
step1 = double_encoded.encode('utf-8') # Get the raw bytes
step2 = step1.decode('utf-8') # Still wrong
fixed = double_encoded.encode('latin-1').decode('utf-8')
Best Practices
- Default to UTF-8. Always. Database, HTTP headers, file I/O, APIs.
- Declare your encoding. Set
<meta charset="UTF-8">in HTML. UseContent-Type: application/json; charset=utf-8in HTTP. - Don't mix encodings. If one part of your pipeline uses Latin-1 and another uses UTF-8, you will get mojibake.
- Store as Unicode, encode at the boundary. Keep strings as native Unicode objects in memory. Only encode/decode when reading from or writing to bytes (files, network, databases).
- Use tools. When in doubt, paste your hex bytes into hextoascii.co and see what comes out. Check file encodings with
file -bibefore processing.
FAQ
What is the difference between ASCII and UTF-8?
ASCII is a 7-bit encoding that covers 128 characters (English letters, digits, and basic symbols). UTF-8 is a variable-length encoding that covers all 149,000+ Unicode characters. UTF-8 is backward compatible with ASCII — every ASCII file is also a valid UTF-8 file.
Is Unicode the same as UTF-8?
No. Unicode is a character catalog — a master list of characters and their code points. UTF-8 is one way to encode those code points as bytes. UTF-16 and UTF-32 are other encoding formats for Unicode. When people say "Unicode," they often mean UTF-8, but they're technically different things.
Why is UTF-8 the most popular encoding?
UTF-8 wins because it's backward compatible with ASCII (no migration cost for English content), space-efficient for Latin scripts, self-synchronizing (you can jump into the middle of a stream), and has no byte-order issues. It's the default for HTML5, JSON, and most modern tools.
Can ASCII represent emojis?
No. ASCII only covers 128 characters. Emojis have Unicode code points above U+1F000 and require UTF-8 (4 bytes), UTF-16 (4 bytes via surrogate pairs), or UTF-32 (4 bytes) to encode.
How do I check what encoding a file uses?
On Linux/Mac, use file -bi filename.txt. In Python, use the chardet library: chardet.detect(open('file', 'rb').read()). You can also hex-dump the first few bytes with xxd filename.txt | head and look for BOM markers or multi-byte patterns.
What causes mojibake (garbled text)?
Mojibake happens when text encoded in one format is decoded in another. The most common case: UTF-8 bytes decoded as Latin-1 or Windows-1252. For example, é (UTF-8: C3 A9) decoded as Latin-1 becomes é. Use the hex to ASCII converter to inspect the actual bytes and identify the correct encoding.
Should I use UTF-8 or UTF-16 for my database?
UTF-8 for almost all cases. It's more space-efficient for Latin text, has wider tooling support, and is the standard for web APIs and JSON. UTF-16 only makes sense if you're storing primarily CJK text (where UTF-16 uses 2 bytes vs UTF-8's 3 bytes per character) and storage is a critical concern.
How do I convert between encodings?
In Python: text.encode('utf-8') and bytes.decode('latin-1'). On the command line: iconv -f WINDOWS-1252 -t UTF-8 input.txt > output.txt. In JavaScript: use TextEncoder and TextDecoder. For quick hex-to-text conversion, use hextoascii.co.