search

Found

info Overview

Convert characters and Unicode codepoints both ways, listing each codepoint with its UTF-8 and UTF-16 byte sequences, character name and block.

📘 How to Use

  1. Choose "Char → Codepoint" or "Codepoint → Char"
  2. Paste characters, or type a codepoint in U+XXXX form
  3. Read the codepoint, UTF-8 and UTF-16 bytes, name and block in the table

Unicode Codepoint Lookup

Enter multiple characters to see a table

0

Results will appear here

Copied!
Article

Unicode Codepoint Lookup|Map Characters to U+ and Their Bytes

Paste any text to see each character's Unicode codepoint (U+XXXX), or type U+ notation to recover the original glyph. Alongside the codepoint, every row shows the UTF-8 and UTF-16 byte sequences, the character name, and the Unicode block it belongs to.

💡 Track Down Mojibake and Invisible Characters

"Two strings look identical but fail an equality check." "There's a stray symbol in this JSON I can't delete." Most of these bugs come from characters that look the same but carry different codepoints, or from invisible control and zero-width characters. Paste the offending string here and it breaks down character by character, so the unexpected codepoint becomes visible in the table instead of hiding in plain sight.

The reverse direction lives in the same view. When a spec or bug report mentions a codepoint like U+200B or U+FEFF, type it in and the tool reconstructs the actual character so you can confirm what you are dealing with. Because each row carries the UTF-8 byte sequence, it pairs well with reading a binary dump or a network log where you spotted a run like E2 80 8B. Characters beyond the Basic Multilingual Plane — emoji, rarer CJK ideographs — are shown with their UTF-16 surrogate pair so you can follow the high and low units directly.

🧐 Frequently Asked Questions

Q. Can I look up several characters at once? Yes. Paste a string and each character is split onto its own row. On the codepoint side you can enter several values separated by spaces or commas, such as U+3042 U+0041.

Q. What input formats does the codepoint side accept? It parses U+3042, 0x3042, and bare hexadecimal like 3042. The leading U+ or 0x prefix is optional.

Q. Does it handle emoji and characters above U+10000? Yes. Characters beyond the Basic Multilingual Plane (above U+FFFF) take 4 bytes in UTF-8 and appear as a surrogate pair (two code units) in UTF-16. The UTF-16 column shows both the high and low surrogate.

Q. Are the displayed names the official Unicode names? For common ranges — Latin letters, digits, Hiragana, Katakana, CJK, emoji — it shows a representative name. It does not bundle the full Unicode names database, so some characters display "—" for the name. Codepoints and byte sequences are accurate across the whole range.

📚 A Codepoint Is Not an Encoding

A Unicode codepoint is the number assigned to a character; UTF-8 and UTF-16 are ways of storing that number as bytes. The same U+3042 (あ) is 3 bytes in UTF-8 but 2 bytes in UTF-16 — the layout changes with the encoding. Blurring that distinction is what leads to "the character count doesn't match the byte count" and "my database column is too short" surprises.

UTF-8 is designed so ASCII (U+0000–U+007F) stays a single byte, which is why English-heavy text looks like one character per byte. Mix in Japanese or emoji and a single character swells to 3 or 4 bytes. A glyph that looks the same size on screen can still occupy 4 UTF-8 bytes or 2 UTF-16 units in memory, and seeing that gap laid out per character is exactly what this tool is for.