Decode Hex to UTF-8 Text: Fix Garbled Characters

Try the Hex Converter

You decoded a hex string, expected clean text, and got café or a scatter of replacement boxes. The bytes are fine — the interpretation is wrong. Nine times out of ten the culprit is the gap between plain ASCII and UTF-8: your data contains multi-byte characters, and you decoded each byte as if it stood alone.

This guide is about decoding hex specifically into UTF-8 text, which is what you actually want most of the time on the modern web. It picks up where a plain hex to ASCII conversion leaves off: ASCII only describes 128 characters, but real-world data — names, currencies, accents, CJK scripts, emoji — lives in Unicode and travels as UTF-8. We'll cover how UTF-8 packs characters into bytes, how to decode hex to UTF-8 in several languages, why "garbage" appears, and exactly how to fix it.

If you only need the answer right now, paste your hex into the hex to ASCII converter — it reads the bytes for you — and then read on to understand what's happening underneath.

ASCII Stops at 127; the World Doesn't

ASCII assigns codes 0–127: the Latin alphabet, digits, punctuation, and control codes. Every ASCII character is exactly one byte, and that byte's top bit is always 0 (007F). For English text, hex-to-ASCII and hex-to-UTF-8 give identical results, because UTF-8 was deliberately designed so that all 128 ASCII characters encode to the same single bytes.

The divergence starts at byte value 128 (80). The character é, , , or 😀 cannot fit in one byte. UTF-8 represents these using two, three, or four bytes, and those bytes individually are not valid standalone characters. Decode them one at a time as Latin-1 or ASCII and you get the classic mangled output known as mojibake.

How UTF-8 Encodes a Character

UTF-8 is a variable-width encoding. The number of bytes is signaled by the high bits of the first byte:

Code point range Bytes First byte pattern Continuation bytes
U+0000 – U+007F 1 0xxxxxxx none
U+0080 – U+07FF 2 110xxxxx 10xxxxxx
U+0800 – U+FFFF 3 1110xxxx 10xxxxxx × 2
U+10000 – U+10FFFF 4 11110xxx 10xxxxxx × 3

Every continuation byte starts with the bits 10, which is how a decoder knows it's in the middle of a sequence and not at a character boundary. Two concrete examples:

  • é is U+00E9, which encodes as the two bytes c3 a9.
  • 😀 is U+1F600, which encodes as the four bytes f0 9f 98 80.

So the hex 63 61 66 c3 a9 is five bytes but four characters: c, a, f, then c3 a9 combine into é, giving café. Read those last two bytes separately and you instead get à (c3) and © (a9) — that's where café comes from.

Decoding Hex to UTF-8 in Code

Python

Convert the hex to raw bytes, then decode those bytes as UTF-8 in one step:

hex_string = "636166c3a9"
text = bytes.fromhex(hex_string).decode("utf-8")
print(text)  # café

The two operations are distinct and both matter: bytes.fromhex() turns the hex digits into actual bytes, and .decode("utf-8") interprets those bytes as a UTF-8 string. Swap in .decode("ascii") and the same data throws a UnicodeDecodeError, because c3 is above 127.

JavaScript

Build a byte array from the hex, then let TextDecoder apply UTF-8:

function hexToUtf8(hex) {
  hex = hex.replace(/[^0-9a-fA-F]/g, "");        // strip spaces/colons
  const bytes = new Uint8Array(hex.length / 2);
  for (let i = 0; i < bytes.length; i++) {
    bytes[i] = parseInt(hex.substr(i * 2, 2), 16);
  }
  return new TextDecoder("utf-8").decode(bytes);   // UTF-8 by default
}

console.log(hexToUtf8("f09f9880")); // 😀

Avoid the old String.fromCharCode(parseInt(pair, 16)) loop for anything non-English: it treats each byte as a separate code unit and produces mojibake for multi-byte characters. TextDecoder is the correct, Unicode-aware tool.

Go

In Go, strings are already UTF-8 byte sequences, so decoding hex is just hex.DecodeString:

package main

import (
    "encoding/hex"
    "fmt"
)

func main() {
    b, _ := hex.DecodeString("e4b8ad") // 中
    fmt.Println(string(b))             // prints: 中
}

Command Line

xxd -r -p reverses hex back to raw bytes; pipe it through a UTF-8-aware terminal and the characters render directly:

echo "636166c3a9" | xxd -r -p
# café

echo "f09f9880" | xxd -r -p
# 😀

If your terminal shows boxes instead of the emoji, that's a font/locale display issue, not a decoding error — the bytes were reconstructed correctly.

Why You're Seeing Garbage — and How to Fix It

Symptom: é, ’, ü (mojibake)

Cause: UTF-8 bytes were decoded as a single-byte encoding (Latin-1 / Windows-1252). Each multi-byte sequence got split into separate "characters."

Fix: Decode as UTF-8, not ASCII or Latin-1.

# wrong: each byte read alone
bytes.fromhex("c3a9").decode("latin-1")   # 'é'
# right: bytes read together as one UTF-8 sequence
bytes.fromhex("c3a9").decode("utf-8")     # 'é'

Symptom: replacement characters

Cause: The byte sequence is not valid UTF-8 — perhaps it's genuinely a different encoding, the data is truncated mid-character, or it's binary that was never text. The (U+FFFD) is the decoder's way of saying "this isn't valid here."

Fix: First confirm the data really is UTF-8. If a sequence is cut off (you sliced a byte buffer in the middle of a multi-byte character), realign to a character boundary. If it's a legacy encoding, decode with the correct one:

data = bytes.fromhex("e9")  # lone 0xE9 — valid Latin-1 'é', invalid UTF-8
data.decode("latin-1")              # 'é'
data.decode("utf-8", errors="replace")  # '�'

Symptom: UnicodeDecodeError / an exception

Cause: You asked for strict ASCII decoding (or strict UTF-8) on bytes that don't qualify.

Fix: Decode as UTF-8, and choose an error policy if some bytes may be malformed: errors="replace" substitutes , errors="ignore" drops the bad bytes, and the default raises so you notice corruption early. Use replace for display, strict (the default) for validation.

Symptom: extra invisible character at the very start

Cause: A UTF-8 byte-order mark (BOM), ef bb bf, prepended to the data. It's harmless but shows up as a zero-width character or a stray .

Fix: Decode with utf-8-sig in Python to strip a leading BOM automatically, or slice off the first three bytes if present.

A Quick Decision Guide

When a hex string won't decode cleanly, run through this in order:

  1. Is it ASCII-only? If every byte is 007F, ASCII and UTF-8 agree — the problem is elsewhere (whitespace, delimiters).
  2. Are there bytes ≥ 80? Then it's almost certainly UTF-8 (or a legacy 8-bit encoding). Try UTF-8 first.
  3. Does UTF-8 produce ? The data may be Latin-1/Windows-1252, truncated, or not text at all. Try latin-1; if that reads cleanly, you've found the encoding.
  4. Mojibake like é? You decoded UTF-8 as a single-byte encoding. Re-decode as UTF-8.

For curiosity or verification, the full byte-to-character mapping is laid out in our ASCII table reference, and the conceptual differences are covered in ASCII vs UTF-8 vs Unicode.

FAQ

How do I decode hex to UTF-8 in Python?

Use two steps: bytes.fromhex(hex_string).decode("utf-8"). The first call converts hex digits into raw bytes; the second interprets those bytes as a UTF-8 string. For example, bytes.fromhex("e282ac").decode("utf-8") returns . Add errors="replace" if some bytes might be invalid and you'd rather see than raise an exception.

Why does my hex decode to é instead of é?

Because the UTF-8 bytes c3 a9 were decoded one at a time as Latin-1 or Windows-1252, splitting a single two-byte character into à and ©. This is called mojibake. Decode the bytes as UTF-8 in a single operation and the two bytes combine correctly into é.

What does the character mean when decoding?

It's U+FFFD, the Unicode replacement character. A UTF-8 decoder emits it when it hits a byte sequence that isn't valid UTF-8 — commonly because the data is actually a different encoding, was truncated in the middle of a multi-byte character, or isn't text at all. It signals where decoding failed, not a character that was in the original data.

Is hex to ASCII different from hex to UTF-8?

For bytes 007F they are identical, since UTF-8 was designed to be ASCII-compatible. They differ for any byte 80 or higher: ASCII can't represent those at all, while UTF-8 reads two to four such bytes together as one character. If your data has any accented letters, symbols, or emoji, you need UTF-8.

How many bytes is an emoji in hex?

Most emoji are four bytes in UTF-8. For instance, 😀 (U+1F600) is f0 9f 98 80. Some symbols that predate the emoji block are three bytes, and certain emoji are actually sequences of multiple code points (skin-tone and flag emoji combine several four-byte units joined by zero-width joiners), making them eight bytes or more.

Can I tell which encoding a hex string uses just by looking?

Not with certainty, but there are strong hints. All bytes ≤ 7F means ASCII-compatible. Bytes that follow the UTF-8 lead/continuation pattern (110xxxxx then 10xxxxxx, etc.) strongly suggest UTF-8. A leading ef bb bf is a UTF-8 BOM. When UTF-8 decoding succeeds without , that's usually confirmation enough.

Conclusion

Decoding hex to UTF-8 is a two-stage process — hex digits to bytes, bytes to characters — and the second stage is where text comes alive or falls apart. Plain ASCII decoding works only for the first 128 characters; everything else on the modern web is UTF-8, where characters span multiple bytes. When output looks wrong, the bytes are usually correct and the encoding choice isn't: mojibake means you read UTF-8 as a single-byte charset, and means the bytes aren't valid UTF-8 in the first place.

Reach for bytes.fromhex(...).decode("utf-8"), TextDecoder, or xxd -r -p, decide on an error policy, and check for stray BOMs and truncation. When you just need a fast, correct read of a hex string, the hex to ASCII converter handles the byte reconstruction for you so you can focus on the text.

Convert Hex to ASCII Instantly

Paste hex strings and get readable text. Supports multiple formats, batch conversion, all client-side.

Open Hex Converter