You decoded a hex string, expected clean text, and got café or a scatter of � replacement boxes. The bytes are fine — the interpretation is wrong. Nine times out of ten the culprit is the gap between plain ASCII and UTF-8: your data contains multi-byte characters, and you decoded each byte as if it stood alone.
This guide is about decoding hex specifically into UTF-8 text, which is what you actually want most of the time on the modern web. It picks up where a plain hex to ASCII conversion leaves off: ASCII only describes 128 characters, but real-world data — names, currencies, accents, CJK scripts, emoji — lives in Unicode and travels as UTF-8. We'll cover how UTF-8 packs characters into bytes, how to decode hex to UTF-8 in several languages, why "garbage" appears, and exactly how to fix it.
If you only need the answer right now, paste your hex into the hex to ASCII converter — it reads the bytes for you — and then read on to understand what's happening underneath.
ASCII Stops at 127; the World Doesn't
ASCII assigns codes 0–127: the Latin alphabet, digits, punctuation, and control codes. Every ASCII character is exactly one byte, and that byte's top bit is always 0 (00–7F). For English text, hex-to-ASCII and hex-to-UTF-8 give identical results, because UTF-8 was deliberately designed so that all 128 ASCII characters encode to the same single bytes.
The divergence starts at byte value 128 (80). The character é, €, 中, or 😀 cannot fit in one byte. UTF-8 represents these using two, three, or four bytes, and those bytes individually are not valid standalone characters. Decode them one at a time as Latin-1 or ASCII and you get the classic mangled output known as mojibake.
How UTF-8 Encodes a Character
UTF-8 is a variable-width encoding. The number of bytes is signaled by the high bits of the first byte:
| Code point range | Bytes | First byte pattern | Continuation bytes |
|---|---|---|---|
| U+0000 – U+007F | 1 | 0xxxxxxx |
none |
| U+0080 – U+07FF | 2 | 110xxxxx |
10xxxxxx |
| U+0800 – U+FFFF | 3 | 1110xxxx |
10xxxxxx × 2 |
| U+10000 – U+10FFFF | 4 | 11110xxx |
10xxxxxx × 3 |
Every continuation byte starts with the bits 10, which is how a decoder knows it's in the middle of a sequence and not at a character boundary. Two concrete examples:
éis U+00E9, which encodes as the two bytesc3 a9.😀is U+1F600, which encodes as the four bytesf0 9f 98 80.
So the hex 63 61 66 c3 a9 is five bytes but four characters: c, a, f, then c3 a9 combine into é, giving café. Read those last two bytes separately and you instead get à (c3) and © (a9) — that's where café comes from.
Decoding Hex to UTF-8 in Code
Python
Convert the hex to raw bytes, then decode those bytes as UTF-8 in one step:
hex_string = "636166c3a9"
text = bytes.fromhex(hex_string).decode("utf-8")
print(text) # café
The two operations are distinct and both matter: bytes.fromhex() turns the hex digits into actual bytes, and .decode("utf-8") interprets those bytes as a UTF-8 string. Swap in .decode("ascii") and the same data throws a UnicodeDecodeError, because c3 is above 127.
JavaScript
Build a byte array from the hex, then let TextDecoder apply UTF-8:
function hexToUtf8(hex) {
hex = hex.replace(/[^0-9a-fA-F]/g, ""); // strip spaces/colons
const bytes = new Uint8Array(hex.length / 2);
for (let i = 0; i < bytes.length; i++) {
bytes[i] = parseInt(hex.substr(i * 2, 2), 16);
}
return new TextDecoder("utf-8").decode(bytes); // UTF-8 by default
}
console.log(hexToUtf8("f09f9880")); // 😀
Avoid the old String.fromCharCode(parseInt(pair, 16)) loop for anything non-English: it treats each byte as a separate code unit and produces mojibake for multi-byte characters. TextDecoder is the correct, Unicode-aware tool.
Go
In Go, strings are already UTF-8 byte sequences, so decoding hex is just hex.DecodeString:
package main
import (
"encoding/hex"
"fmt"
)
func main() {
b, _ := hex.DecodeString("e4b8ad") // 中
fmt.Println(string(b)) // prints: 中
}
Command Line
xxd -r -p reverses hex back to raw bytes; pipe it through a UTF-8-aware terminal and the characters render directly:
echo "636166c3a9" | xxd -r -p
# café
echo "f09f9880" | xxd -r -p
# 😀
If your terminal shows boxes instead of the emoji, that's a font/locale display issue, not a decoding error — the bytes were reconstructed correctly.
Why You're Seeing Garbage — and How to Fix It
Symptom: é, ’, ü (mojibake)
Cause: UTF-8 bytes were decoded as a single-byte encoding (Latin-1 / Windows-1252). Each multi-byte sequence got split into separate "characters."
Fix: Decode as UTF-8, not ASCII or Latin-1.
# wrong: each byte read alone
bytes.fromhex("c3a9").decode("latin-1") # 'é'
# right: bytes read together as one UTF-8 sequence
bytes.fromhex("c3a9").decode("utf-8") # 'é'
Symptom: � replacement characters
Cause: The byte sequence is not valid UTF-8 — perhaps it's genuinely a different encoding, the data is truncated mid-character, or it's binary that was never text. The � (U+FFFD) is the decoder's way of saying "this isn't valid here."
Fix: First confirm the data really is UTF-8. If a sequence is cut off (you sliced a byte buffer in the middle of a multi-byte character), realign to a character boundary. If it's a legacy encoding, decode with the correct one:
data = bytes.fromhex("e9") # lone 0xE9 — valid Latin-1 'é', invalid UTF-8
data.decode("latin-1") # 'é'
data.decode("utf-8", errors="replace") # '�'
Symptom: UnicodeDecodeError / an exception
Cause: You asked for strict ASCII decoding (or strict UTF-8) on bytes that don't qualify.
Fix: Decode as UTF-8, and choose an error policy if some bytes may be malformed: errors="replace" substitutes �, errors="ignore" drops the bad bytes, and the default raises so you notice corruption early. Use replace for display, strict (the default) for validation.
Symptom: extra invisible character at the very start
Cause: A UTF-8 byte-order mark (BOM), ef bb bf, prepended to the data. It's harmless but shows up as a zero-width character or a stray .
Fix: Decode with utf-8-sig in Python to strip a leading BOM automatically, or slice off the first three bytes if present.
A Quick Decision Guide
When a hex string won't decode cleanly, run through this in order:
- Is it ASCII-only? If every byte is
00–7F, ASCII and UTF-8 agree — the problem is elsewhere (whitespace, delimiters). - Are there bytes ≥
80? Then it's almost certainly UTF-8 (or a legacy 8-bit encoding). Try UTF-8 first. - Does UTF-8 produce
�? The data may be Latin-1/Windows-1252, truncated, or not text at all. Trylatin-1; if that reads cleanly, you've found the encoding. - Mojibake like
é? You decoded UTF-8 as a single-byte encoding. Re-decode as UTF-8.
For curiosity or verification, the full byte-to-character mapping is laid out in our ASCII table reference, and the conceptual differences are covered in ASCII vs UTF-8 vs Unicode.
FAQ
How do I decode hex to UTF-8 in Python?
Use two steps: bytes.fromhex(hex_string).decode("utf-8"). The first call converts hex digits into raw bytes; the second interprets those bytes as a UTF-8 string. For example, bytes.fromhex("e282ac").decode("utf-8") returns €. Add errors="replace" if some bytes might be invalid and you'd rather see � than raise an exception.
Why does my hex decode to é instead of é?
Because the UTF-8 bytes c3 a9 were decoded one at a time as Latin-1 or Windows-1252, splitting a single two-byte character into à and ©. This is called mojibake. Decode the bytes as UTF-8 in a single operation and the two bytes combine correctly into é.
What does the � character mean when decoding?
It's U+FFFD, the Unicode replacement character. A UTF-8 decoder emits it when it hits a byte sequence that isn't valid UTF-8 — commonly because the data is actually a different encoding, was truncated in the middle of a multi-byte character, or isn't text at all. It signals where decoding failed, not a character that was in the original data.
Is hex to ASCII different from hex to UTF-8?
For bytes 00–7F they are identical, since UTF-8 was designed to be ASCII-compatible. They differ for any byte 80 or higher: ASCII can't represent those at all, while UTF-8 reads two to four such bytes together as one character. If your data has any accented letters, symbols, or emoji, you need UTF-8.
How many bytes is an emoji in hex?
Most emoji are four bytes in UTF-8. For instance, 😀 (U+1F600) is f0 9f 98 80. Some symbols that predate the emoji block are three bytes, and certain emoji are actually sequences of multiple code points (skin-tone and flag emoji combine several four-byte units joined by zero-width joiners), making them eight bytes or more.
Can I tell which encoding a hex string uses just by looking?
Not with certainty, but there are strong hints. All bytes ≤ 7F means ASCII-compatible. Bytes that follow the UTF-8 lead/continuation pattern (110xxxxx then 10xxxxxx, etc.) strongly suggest UTF-8. A leading ef bb bf is a UTF-8 BOM. When UTF-8 decoding succeeds without �, that's usually confirmation enough.
Conclusion
Decoding hex to UTF-8 is a two-stage process — hex digits to bytes, bytes to characters — and the second stage is where text comes alive or falls apart. Plain ASCII decoding works only for the first 128 characters; everything else on the modern web is UTF-8, where characters span multiple bytes. When output looks wrong, the bytes are usually correct and the encoding choice isn't: mojibake means you read UTF-8 as a single-byte charset, and � means the bytes aren't valid UTF-8 in the first place.
Reach for bytes.fromhex(...).decode("utf-8"), TextDecoder, or xxd -r -p, decide on an error policy, and check for stray BOMs and truncation. When you just need a fast, correct read of a hex string, the hex to ASCII converter handles the byte reconstruction for you so you can focus on the text.