Understanding ASCII and Unicode: Character Encoding Explained

February 15, 2026 · 9 min read · Developer

Every piece of text on a computer — this article, your emails, your code — is stored as numbers. Character encoding is the system that maps those numbers to the letters, symbols, and emoji you see on screen. Understanding encoding is fundamental for developers because encoding bugs are some of the most frustrating and hard-to-debug issues in software.

From garbled text in databases to mysterious question marks in emails, encoding problems are everywhere. This guide explains ASCII, Unicode, UTF-8, and everything you need to know about how computers handle text.

What Is ASCII?

ASCII (American Standard Code for Information Interchange) was developed in the 1960s as a standard way to represent text in computers. It defines 128 characters, each mapped to a number from 0 to 127:

0–31: Control characters (tab, newline, carriage return, bell, etc.)
32–47: Punctuation and symbols (space, !, ", #, $, %, etc.)
48–57: Digits 0–9
65–90: Uppercase letters A–Z
97–122: Lowercase letters a–z
123–127: More symbols ({, |, }, ~) and DEL

ASCII uses 7 bits per character, meaning it can represent exactly 128 characters. This was enough for English text — letters, numbers, basic punctuation. But it couldn't handle accented characters (é, ñ, ü), non-Latin scripts (Chinese, Arabic, Japanese), or any of the thousands of symbols used worldwide.

Extended ASCII

Since computers work with bytes (8 bits), the extra bit gave room for 128 more characters (128–255). Different systems used this space for different characters, creating incompatible "extended ASCII" standards — ISO 8859-1 (Western European), Windows-1252, KOI8-R (Russian), and many others. A file written in one encoding would display garbage when opened with another. This was the era of mojibake — garbled text caused by encoding mismatches.

Enter Unicode

Unicode was created in the late 1980s to solve the encoding chaos. Its goal was ambitious: assign a unique number (called a code point) to every character in every writing system ever used by humans.

Unicode doesn't just cover the major alphabets. It includes mathematical symbols, musical notation, ancient scripts like Egyptian hieroglyphs, technical symbols, and yes — emoji. As of Unicode 15.0, there are over 149,000 characters defined.

Each character has a code point written as U+ followed by hexadecimal digits:

U+0041 — A (Latin Capital Letter A)
U+00E9 — é (Latin Small Letter E with Acute)
U+4E16 — 世 (CJK character meaning "world")
U+1F600 — 😀 (Grinning Face emoji)

The first 128 Unicode code points are identical to ASCII, which means ASCII text is automatically valid Unicode. This backward compatibility was a critical design decision.

Unicode Encoding: UTF-8, UTF-16, and UTF-32

Unicode defines what characters exist and their code points. But how do you store those code points as bytes? That's where encoding comes in. The three main Unicode encodings differ in how they convert code points to bytes.

UTF-8

UTF-8 is the most widely used encoding on the web (used by over 98% of websites). It uses a variable number of bytes per character:

1 byte: ASCII characters (U+0000 to U+007F)
2 bytes: Latin, Greek, Cyrillic, Arabic, Hebrew (U+0080 to U+07FF)
3 bytes: Chinese, Japanese, Korean, most other scripts (U+0800 to U+FFFF)
4 bytes: Emoji, rare scripts, historic characters (U+10000 to U+10FFFF)

UTF-8's brilliance is its efficiency: English text uses exactly 1 byte per character (same as ASCII), while other scripts use more bytes as needed. It's also self-synchronizing — you can always determine where a character starts by looking at a single byte.

UTF-16

UTF-16 uses 2 bytes for most common characters and 4 bytes for characters above U+FFFF (like emoji). JavaScript strings and Java internally use UTF-16. This means emoji and some CJK characters are represented as "surrogate pairs" — two 16-bit units that together represent one character. This is why "😀".length returns 2 in JavaScript, not 1.

UTF-32

UTF-32 uses 4 bytes for every character. It's simple (every character is the same size) but wasteful — English text uses 4x the memory compared to UTF-8. It's rarely used for storage or transmission but sometimes used internally for easy character indexing.

⚡ Try it yourself: Explore character code points with the Wootils Unicode Lookup tool — search for any character and see its Unicode details.

Common Encoding Problems

Mojibake (Garbled Text)

When text is decoded with the wrong encoding, you get mojibake. The string "café" in UTF-8 becomes "cafÃ©" when decoded as ISO 8859-1. The fix is simple: always declare your encoding explicitly. In HTML, use <meta charset="UTF-8">. In databases, set the connection encoding to UTF-8.

The BOM (Byte Order Mark)

UTF-8 files sometimes start with a BOM — the bytes EF BB BF. While optional, it can cause problems with shell scripts, CSV parsing, and JSON files. Many editors add it automatically. When in doubt, use UTF-8 without BOM.

String Length vs Byte Length

In UTF-8, the character "é" is 2 bytes, and "😀" is 4 bytes. A string's character count and byte count can differ significantly. This matters for database column sizes, API payload limits, and buffer allocation.

Encoding Best Practices

Use UTF-8 everywhere. For files, databases, APIs, and HTML. There's almost never a reason to use anything else in modern software.
Declare encoding explicitly. Don't rely on defaults — they vary by system, locale, and version.
Be careful with string operations. Splitting, reversing, or truncating UTF-8 strings at arbitrary byte positions can corrupt characters.
Test with non-ASCII data. Use real Chinese, Arabic, emoji, and accented characters in your test data.
Normalize Unicode. The same visual character can sometimes be represented multiple ways (composed vs decomposed). Use Unicode normalization (NFC/NFD) when comparing strings.

ASCII and Unicode in Programming

JavaScript

// String.prototype.charCodeAt returns UTF-16 code unit
'A'.charCodeAt(0)  // 65
'é'.charCodeAt(0)  // 233

// For full Unicode code points, use codePointAt
'😀'.codePointAt(0)  // 128512 (U+1F600)

// Spread operator handles Unicode correctly
[...'Hello 😀'].length  // 7 (not 8)

Python

# Python 3 strings are Unicode by default
len('café')      # 4 characters
len('café'.encode('utf-8'))  # 5 bytes

# Unicode escapes
'\u00e9'         # é
'\U0001F600'     # 😀

Emoji and Modern Unicode

Emoji are a fascinating part of Unicode. They started as carrier-specific pictograms on Japanese phones in the 1990s and were incorporated into Unicode in 2010. Today, there are over 3,600 emoji.

Some emoji are complex: skin tone modifiers combine a base emoji with a modifier character. Family emoji combine multiple characters with Zero Width Joiners (ZWJ). The "👩‍💻" emoji (Woman Technologist) is actually four code points joined together: Woman + ZWJ + Laptop. This is why emoji handling in code requires careful attention.

Conclusion

Character encoding is one of those topics that seems simple on the surface but has deep complexity underneath. ASCII gave us the foundation, Unicode gave us universality, and UTF-8 gave us the efficient encoding the web runs on. Always use UTF-8, always declare your encoding, and always test with diverse characters. Your future self (and your international users) will thank you.

🔧 Related Wootils Tools:
Unicode Lookup · Text to Binary · Text to Hex · Base64 Encoder