Common Text Encoding Formats: ASCII vs UTF-8 vs Unicode Explained

Common Text Encoding Formats: ASCII vs UTF-8 vs Unicode Explained

Common Text Encoding Formats: ASCII vs UTF-8 vs Unicode Explained

Try the Hex Converter

Ever opened a file and seen é where there should be an é? Or gotten a database full of ??? where Japanese characters used to be? Welcome to the world of character encoding — the invisible layer that turns bytes into letters and letters into chaos when it breaks.

Character encoding is one of those things developers deal with constantly but rarely think about until something goes wrong. And when it goes wrong, it goes really wrong.

This guide breaks down the three encoding systems you'll encounter most: ASCII, UTF-8, and Unicode. You'll learn how they work, how they differ, and how to stop encoding bugs from ruining your day.

What Is Character Encoding?

Computers store everything as numbers. Character encoding is the system that maps those numbers to the characters you see on screen. The letter A isn't stored as a tiny picture of an A — it's stored as the number 65 (or 41 in hexadecimal).

Different encoding systems use different mappings and different amounts of storage per character. That's where things get interesting — and where most encoding problems originate.

What Is ASCII?

ASCII (American Standard Code for Information Interchange) is the grandfather of character encoding. Created in 1963, it maps 128 characters to the numbers 0–127 using 7 bits.

What ASCII Includes

Range Characters Count
0–31 Control characters (newline, tab, null) 32
32–126 Printable characters (letters, digits, symbols) 95
127 DEL (delete) 1

That gives you:

  • 26 uppercase letters (A–Z): codes 65–90
  • 26 lowercase letters (a–z): codes 97–122
  • 10 digits (0–9): codes 48–57
  • 33 symbols (!, @, #, etc.)
  • 33 control characters (most are legacy)

You can explore the full mapping with a hex to ASCII converter — input 48 65 6C 6C 6F and you get Hello.

ASCII's Limitation

ASCII only covers English. No accented characters (é, ñ, ü), no Chinese, no Arabic, no emoji. 128 characters seemed like enough in 1963 when the primary users were American engineers. It wasn't enough for the global internet.

Extended ASCII (codes 128–255) tried to fix this by using all 8 bits, but different systems mapped those extra 128 slots differently. Windows used Code Page 1252. Mac used Mac Roman. ISO created ISO 8859-1 (Latin-1). Same byte, different character — chaos.

What Is Unicode?

Unicode is the answer to the "too many incompatible encoding tables" problem. Instead of being an encoding itself, Unicode is a universal character catalog — a master list that assigns a unique number (called a code point) to every character in every writing system.

How Unicode Works

Each character gets a code point written as U+ followed by a hex number:

Character Code Point Name
A U+0041 Latin Capital Letter A
é U+00E9 Latin Small Letter E with Acute
U+4F60 CJK Unified Ideograph
🔥 U+1F525 Fire Emoji
U+221E Infinity

As of Unicode 15.1, there are 149,813 characters covering 161 scripts. Everything from Egyptian hieroglyphs to musical notation to emoji.

Unicode Is Not an Encoding

This is the key distinction most people miss. Unicode defines which number maps to which character. It doesn't define how those numbers are stored as bytes. That's the job of encoding formats like UTF-8, UTF-16, and UTF-32.

Think of Unicode as the dictionary and UTF-8 as the handwriting style you use to write from that dictionary.

What Is UTF-8?

UTF-8 (Unicode Transformation Format – 8-bit) is the dominant encoding on the web. Over 98% of websites use it. It's the default for HTML5, JSON, YAML, TOML, and most modern programming languages.

How UTF-8 Works

UTF-8 is a variable-length encoding. It uses 1 to 4 bytes per character, depending on the code point:

Code Point Range Bytes Bit Pattern Example
U+0000 – U+007F 1 0xxxxxxx A → 41
U+0080 – U+07FF 2 110xxxxx 10xxxxxx é → C3 A9
U+0800 – U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx 你 → E4 BD A0
U+10000 – U+10FFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 🔥 → F0 9F 94 A5

The genius of UTF-8:

  1. Backward compatible with ASCII. Any valid ASCII text is also valid UTF-8, byte for byte.
  2. Self-synchronizing. You can jump into the middle of a byte stream and find the start of the next character by looking at the bit patterns.
  3. No byte-order issues. Unlike UTF-16, there's no endianness ambiguity.
  4. Space efficient for Latin text. English content uses exactly the same space as ASCII.

UTF-8 Encoding Example: Step by Step

Let's encode the character é (U+00E9) into UTF-8:

  1. Code point: 0x00E9 = 0000 0000 1110 1001 in binary
  2. Falls in range U+0080–U+07FF → needs 2 bytes
  3. Template: 110xxxxx 10xxxxxx
  4. Fill in the bits: 110 00011 10 101001
  5. Result: 0xC3 0xA9

You can verify this with the hex to ASCII converter at hextoascii.co — input C3 A9 in UTF-8 mode and you'll see é.

ASCII vs UTF-8 vs Unicode: Key Differences

Feature ASCII Unicode UTF-8
Type Encoding + character set Character set (catalog) Encoding (for Unicode)
Characters 128 149,813+ All of Unicode
Bytes per character 1 N/A (it's not an encoding) 1–4
English text size Baseline Same as ASCII
CJK text size Can't represent 3 bytes per character
Emoji support ✅ (4 bytes each)
ASCII compatible
Web usage Legacy Standard 98%+ of websites
Year created 1963 1991 1993

When to Use What

  • ASCII: Only when you're working with legacy systems that explicitly require it, or when you know your data is 100% English alphanumeric.
  • UTF-8: Almost always. It's the default for the modern web, APIs, databases, and file formats. If in doubt, use UTF-8.
  • UTF-16: Windows internals, Java strings, JavaScript strings (internally). You'll encounter it, but you rarely need to choose it.
  • UTF-32: Almost never in practice. Fixed 4 bytes per character. Simple but wasteful.

Other Encoding Formats Worth Knowing

Latin-1 (ISO 8859-1)

An 8-bit encoding that extends ASCII with Western European characters (codes 128–255). Still found in older databases and HTTP headers. Covers French, German, Spanish, and Portuguese — but not much else.

Windows-1252

Microsoft's variant of Latin-1. Adds "smart quotes," em dashes, and the Euro sign (€) in the 128–159 range where Latin-1 has control characters. The most common source of encoding confusion on the web.

UTF-16

Uses 2 or 4 bytes per character. Characters in the Basic Multilingual Plane (U+0000–U+FFFF) use 2 bytes. Others use surrogate pairs (4 bytes). Used internally by Java, JavaScript, and Windows.

Encoding English A Chinese Emoji 🔥
UTF-8 41 (1 byte) E4 BD A0 (3 bytes) F0 9F 94 A5 (4 bytes)
UTF-16 00 41 (2 bytes) 4F 60 (2 bytes) D8 3D DD 25 (4 bytes)
UTF-32 00 00 00 41 (4 bytes) 00 00 4F 60 (4 bytes) 00 01 F5 25 (4 bytes)

Base64

Not a character encoding in the traditional sense — Base64 encodes binary data into ASCII text for safe transmission over text-only channels (email, URLs, JSON). It uses 64 ASCII characters (A–Z, a–z, 0–9, +, /) to represent binary data.

Practical Guide: Detecting and Converting Encodings

Python

# Detect encoding with chardet
import chardet

with open('mystery_file.txt', 'rb') as f:
    raw = f.read()
    result = chardet.detect(raw)
    print(result)
    # {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

# Convert between encodings
text = raw.decode(result['encoding'])  # Decode from detected encoding
utf8_bytes = text.encode('utf-8')       # Re-encode as UTF-8

# Manual encoding/decoding
'Hello'.encode('ascii')     # b'Hello'
'café'.encode('utf-8')      # b'caf\xc3\xa9'
'café'.encode('latin-1')    # b'caf\xe9'
b'\xc3\xa9'.decode('utf-8') # 'é'

# Get code points
[hex(ord(c)) for c in 'Hello 🔥']
# ['0x48', '0x65', '0x6c', '0x6c', '0x6f', '0x20', '0x1f525']

JavaScript

// Encode string to UTF-8 bytes
const encoder = new TextEncoder(); // Always UTF-8
const bytes = encoder.encode('café');
console.log([...bytes].map(b => b.toString(16)));
// ['63', '61', '66', 'c3', 'a9']

// Decode bytes back to string
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(new Uint8Array([0xc3, 0xa9]));
console.log(text); // 'é'

// Decode Latin-1
const latin1 = new TextDecoder('iso-8859-1');
console.log(latin1.decode(new Uint8Array([0xe9]))); // 'é'

// Get code point
'🔥'.codePointAt(0).toString(16); // '1f525'

// Convert hex to text (what hextoascii.co does)
function hexToText(hex) {
  const bytes = hex.match(/.{2}/g).map(b => parseInt(b, 16));
  return new TextDecoder().decode(new Uint8Array(bytes));
}
hexToText('48656c6c6f'); // 'Hello'

Command Line

# Check file encoding
file -bi document.txt
# text/plain; charset=utf-8

# Convert encoding with iconv
iconv -f WINDOWS-1252 -t UTF-8 input.txt > output.txt

# List supported encodings
iconv -l

# Convert and strip invalid characters
iconv -f UTF-8 -t ASCII//TRANSLIT input.txt > ascii_output.txt

# Hex dump to see actual bytes
xxd document.txt | head
# 00000000: 4865 6c6c 6f20 776f 726c 640a  Hello world.

Common Encoding Errors and How to Fix Them

Mojibake

Symptom: é instead of é, â€" instead of

Cause: UTF-8 bytes decoded as Latin-1 or Windows-1252.

Fix:

# The string was decoded wrong — re-encode as Latin-1, decode as UTF-8
broken = "café"
fixed = broken.encode('latin-1').decode('utf-8')
print(fixed)  # "café"

Question Marks (???)

Symptom: Characters replaced with ? or (U+FFFD replacement character).

Cause: The decoder couldn't map the bytes to any valid character in the target encoding — often ASCII trying to handle UTF-8 multi-byte sequences.

Fix: Ensure your pipeline uses UTF-8 end-to-end. Check database collation, HTTP headers, and file read/write modes.

BOM (Byte Order Mark) Issues

Symptom: Invisible  at the start of a file, or JSON parsing fails on a valid-looking file.

Cause: The file starts with a UTF-8 BOM (EF BB BF). Some editors add it; most parsers don't expect it.

Fix:

# Remove BOM from a file
sed -i '1s/^\xEF\xBB\xBF//' file.txt

# In Python
with open('file.txt', 'r', encoding='utf-8-sig') as f:
    content = f.read()  # BOM is silently stripped

Double Encoding

Symptom: é shows up even when you set UTF-8 everywhere.

Cause: Text was encoded to UTF-8, then the resulting bytes were encoded to UTF-8 again. The é byte C3 A9 got treated as two Latin-1 characters (Ã and ©) and each got re-encoded.

Fix:

double_encoded = "café"
# Reverse the double encoding
step1 = double_encoded.encode('utf-8')       # Get the raw bytes
step2 = step1.decode('utf-8')                 # Still wrong
fixed = double_encoded.encode('latin-1').decode('utf-8')

Best Practices

  1. Default to UTF-8. Always. Database, HTTP headers, file I/O, APIs.
  2. Declare your encoding. Set <meta charset="UTF-8"> in HTML. Use Content-Type: application/json; charset=utf-8 in HTTP.
  3. Don't mix encodings. If one part of your pipeline uses Latin-1 and another uses UTF-8, you will get mojibake.
  4. Store as Unicode, encode at the boundary. Keep strings as native Unicode objects in memory. Only encode/decode when reading from or writing to bytes (files, network, databases).
  5. Use tools. When in doubt, paste your hex bytes into hextoascii.co and see what comes out. Check file encodings with file -bi before processing.

FAQ

What is the difference between ASCII and UTF-8?

ASCII is a 7-bit encoding that covers 128 characters (English letters, digits, and basic symbols). UTF-8 is a variable-length encoding that covers all 149,000+ Unicode characters. UTF-8 is backward compatible with ASCII — every ASCII file is also a valid UTF-8 file.

Is Unicode the same as UTF-8?

No. Unicode is a character catalog — a master list of characters and their code points. UTF-8 is one way to encode those code points as bytes. UTF-16 and UTF-32 are other encoding formats for Unicode. When people say "Unicode," they often mean UTF-8, but they're technically different things.

Why is UTF-8 the most popular encoding?

UTF-8 wins because it's backward compatible with ASCII (no migration cost for English content), space-efficient for Latin scripts, self-synchronizing (you can jump into the middle of a stream), and has no byte-order issues. It's the default for HTML5, JSON, and most modern tools.

Can ASCII represent emojis?

No. ASCII only covers 128 characters. Emojis have Unicode code points above U+1F000 and require UTF-8 (4 bytes), UTF-16 (4 bytes via surrogate pairs), or UTF-32 (4 bytes) to encode.

How do I check what encoding a file uses?

On Linux/Mac, use file -bi filename.txt. In Python, use the chardet library: chardet.detect(open('file', 'rb').read()). You can also hex-dump the first few bytes with xxd filename.txt | head and look for BOM markers or multi-byte patterns.

What causes mojibake (garbled text)?

Mojibake happens when text encoded in one format is decoded in another. The most common case: UTF-8 bytes decoded as Latin-1 or Windows-1252. For example, é (UTF-8: C3 A9) decoded as Latin-1 becomes é. Use the hex to ASCII converter to inspect the actual bytes and identify the correct encoding.

Should I use UTF-8 or UTF-16 for my database?

UTF-8 for almost all cases. It's more space-efficient for Latin text, has wider tooling support, and is the standard for web APIs and JSON. UTF-16 only makes sense if you're storing primarily CJK text (where UTF-16 uses 2 bytes vs UTF-8's 3 bytes per character) and storage is a critical concern.

How do I convert between encodings?

In Python: text.encode('utf-8') and bytes.decode('latin-1'). On the command line: iconv -f WINDOWS-1252 -t UTF-8 input.txt > output.txt. In JavaScript: use TextEncoder and TextDecoder. For quick hex-to-text conversion, use hextoascii.co.

Convert Hex to ASCII Instantly

Paste hex strings and get readable text. Supports multiple formats, batch conversion, all client-side.

Open Hex Converter