Pular para o conteúdo principal

Collation Confusion: Demystifying VARBINARY vs VARCHAR in Database Design


Have you ever wondered why your database sorts strings differently than expected, or why a seemingly simple query delivers "quirky" results? If you’ve worked with databases like MySQL, PostgreSQL, or SQL Server, you’ve likely stumbled across the mysterious world of collation and the subtle but critical differences between VARBINARY and VARCHAR data types.

In this article I hope to unravel these concepts for you, explore how collation shapes database behavior, and dive into an experimental case study to reveal surprising insights. Whether you’re a developer or a database engineer, this deep dive will equip you with practical knowledge to avoid common pitfalls and optimize your database designs.

Let’s get started!

What Is Collation and Why Does It Matter?

Collation refers to the set of rules a database uses to compare and sort character strings. It governs how strings are ordered (e.g., is “Apple” less than “apple”?), how accents are treated (e.g., does “é” equal “e”?), and even how special characters are handled. Collation is critical because it directly impacts query results, sorting, and indexing, especially in multilingual applications or systems handling diverse datasets.

Different databases handle collation in different ways:

  • MySQL: Offers a wide range of collations (e.g., utf8mb4_unicode_ci for case-insensitive Unicode or utf8mb4_bin for binary comparisons). You can set collation at the server, database, table, or column level, providing fine-grained control.
  • PostgreSQL: Uses the operating system’s locale settings for collation (e.g., en_US.UTF-8) but allows custom collation definitions for specific columns or queries. It’s flexible but can be sensitive to system-level settings.
  • SQL Server: Relies on collations like SQL_Latin1_General_CP1_CI_AS (case-insensitive, accent-sensitive). It ties collation to code pages, which can affect character encoding and sorting behavior.
Another very popular database that I will leave out of the other comparisons at this article is SQLite but I'd like to leave the note that while it doesn't have native support for collation, it does allow for custom collation functions through application code, offering flexibility at the cost of additional complexity.

Why is collation a “nice thing” to have? It ensures consistent string handling across diverse datasets, supports internationalization (e.g., proper sorting for non-English languages), and prevents unexpected query behavior. Without collation, your database might treat “Café” and “Cafe” as entirely different strings or worse, sort them inconsistently.

VARBINARY vs VARCHAR: The Core Differences

To understand how collation interacts with data types, let’s compare VARBINARY and VARCHAR:

VARCHAR: Stores character data with a specific character set (e.g., UTF-8) and collation. It interprets data as text, applying collation rules during comparisons and sorting. For example, in MySQL’s utf8mb4_general_ci collation, “Apple” and “apple” are treated as equal in a case-insensitive comparison. It can handle the difference even in those situation but you would need to specify that need.

VARBINARY: Stores raw binary data as a sequence of bytes, ignoring character sets and collation. Comparisons are byte-by-byte, making VARBINARY case-sensitive and accent-sensitive by default. For instance, “Apple” and “apple” are distinct because their ASCII byte values differ.

Diferencies on DBMSs

VARBINARY may have different names across database systems, such as BINARY, BYTEA, or even BIN. These types are used to store binary data (i.e., raw bytes), but they are not necessarily bit-specific fields, they store byte sequences, not individual bits.

| Term       | Database System     | Description                                         |
|------------|---------------------|-----------------------------------------------------|
| `VARBINARY`| SQL Server, MySQL   | Variable-length binary data                         |
| `BINARY`   | SQL Server, MySQL   | Fixed-length binary data                            |
| `BYTEA`    | PostgreSQL          | Binary string type for storing raw bytes            |
| `BIN`      | Not a standard type | Sometimes used informally or in proprietary systems |

Storage and Performance

VARCHAR: Stores text in the specified character set, which may use varying byte lengths per character depending on the encoding. For example, latin1 (used in MySQL) uses 1 byte per character, supporting Western European languages with simpler storage needs but limited character range compared to UTF-8 (1–4 bytes for broader Unicode support). Collation-aware comparisons, such as "case-insensitive" or "accent-sensitive" sorting, may incur additional processing overhead, particularly for complex Unicode rules or collations like latin1_swedish_ci that prioritize specific linguistic conventions.

VARBINARY: Stores data as a sequence of raw bytes, with each character occupying exactly the number of bytes required by its binary representation (e.g., a string like “Café” stored as VARBINARY uses the exact byte sequence of its encoding, such as 5 bytes in ASCII or UTF-8). Binary comparisons are faster because they bypass collation rules, directly comparing byte values without linguistic interpretation, but this sacrifices text-specific features like "case-insensitive" searches or "accent normalization", making it unsuitable for text data requiring human-readable sorting or matching.

Database-Specific Behaviors

MySQL: VARBINARY is unaffected by collation, treating data as raw bytes. VARCHAR respects the column’s collation, which can lead to slower queries if complex rules (e.g., Unicode sorting) are involved.

PostgreSQL: Similar to MySQL, BYTEA (the equivalent of VARBINARY) stores raw bytes, while TEXT or VARCHAR uses collation. PostgreSQL’s locale-based collation can cause platform-specific sorting differences.

SQL Server: VARBINARY stores binary data, while NVARCHAR or VARCHAR applies collation rules. SQL Server’s collation choices can impact performance, especially for large datasets with accent-sensitive comparisons.

Experimental Case Study: VARBINARY vs VARCHAR in MySQL 8

Let’s dive into a hands-on experiment to see how VARBINARY and VARCHAR behave differently under collation in MySQL 8. We’ll create a sample dataset, run queries, and observe the results.

Setup

Create a table with both VARCHAR and VARBINARY columns containing identical data:


CREATE TABLE test_collation (
    id INT AUTO_INCREMENT PRIMARY KEY,
    text_varchar VARCHAR(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
    text_varbinary VARBINARY(50)
);

INSERT INTO test_collation (text_varchar, text_varbinary) VALUES
('Apple', 'Apple'),
('apple', 'apple'),
('Café', 'Café'),
('Cafe', 'Cafe');

Query 1: Sorting Behavior

Let’s sort the data and see how collation affects the results:
SELECT text_varchar FROM test_collation ORDER BY text_varchar;
SELECT text_varbinary FROM test_collation ORDER BY text_varbinary;
Output for text_varchar (utf8mb4_unicode_ci):
Apple
apple
Café
Cafe
Output for text_varbinary:
Apple
Cafe
Café
apple
Insight: The VARCHAR column, using utf8mb4_unicode_ci, sorts case-insensitively, grouping “Apple” and “apple” together and treating “Café” and “Cafe” similarly due to its collation rules. VARBINARY, ignoring collation, sorts based on raw byte values, resulting in a different order (e.g., uppercase letters have lower ASCII values than lowercase ones). Note that specific MySQL configurations, such as SQL_MODE settings (e.g., PAD_CHAR_TO_FULL_LENGTH), may subtly affect string handling or query behavior, so always verify your database’s configuration for consistent results.

Query 2: Comparison Behavior

Now, let’s compare strings in a WHERE clause:
SELECT text_varchar FROM test_collation WHERE text_varchar = 'apple';
SELECT text_varbinary FROM test_collation WHERE text_varbinary = 'apple';
Output for text_varchar:
Apple
apple
Output for text_varbinary:
apple
The VARCHAR query returns both “Apple” and “apple” due to case-insensitive collation. The VARBINARY query only returns “apple” because it performs a byte-exact match. Again, it may vary depending on the SQL_MODE settings explained above.

An Unusual Twist: Collation in Mixed Queries

Here’s an experimental scenario: what happens when you join a VARCHAR and VARBINARY column? Let’s create a second table and perform a join:
CREATE TABLE test_collation2 (
    id INT AUTO_INCREMENT PRIMARY KEY,
    text_varbinary VARBINARY(50)
);

INSERT INTO test_collation2 (text_varbinary) VALUES ('apple');

SELECT t1.text_varchar, t2.text_varbinary
FROM test_collation t1
JOIN test_collation2 t2 ON t1.text_varchar = t2.text_varbinary;
Output:
text_varchar | text_varbinary
-------------|---------------
Apple        | apple
apple        | apple
MySQL implicitly converts the VARBINARY data to VARCHAR during the join, applying the VARCHAR column’s collation (utf8mb4_unicode_ci). This results in a case-insensitive match, which might surprise developers expecting a binary comparison.

This behavior varies across databases:
PostgreSQL: Joins between TEXT and BYTEA require explicit casting, or the query will fail due to type mismatch.
SQL Server: Similar to MySQL, but collation precedence rules determine which collation applies, potentially leading to unexpected matches.

Collation Pitfalls to Avoid

Mixing VARBINARY and VARCHAR or using the wrong collation can lead to subtle bugs:Inconsistent Sorting: Using VARBINARY for text data can cause unexpected sort orders, especially in multilingual applications.
  • Join Issues: As shown in the experiment, joining VARCHAR and VARBINARY can lead to implicit conversions, skewing results.
  • Performance Traps: Case-insensitive collations (e.g., utf8mb4_unicode_ci) are slower than binary comparisons, especially for large datasets.
  • Cross-Database Portability: Collation behavior varies across databases, so a MySQL query may behave differently in PostgreSQL or SQL Server.

Best Practices for Choosing Between VARBINARY and VARCHAR

  1. Use VARCHAR for Text Data: If you need collation-aware sorting or searching (e.g., for user-facing data), choose VARCHAR with an appropriate collation.
  2. Use VARBINARY for Binary Data: Store raw bytes (e.g., hashes, encrypted data) in VARBINARY to avoid collation overhead.
  3. Test Collation Impact: Always test queries with your chosen collation, especially in multilingual applications.
  4. Explicit Casting in Joins: When mixing types, use explicit casts (e.g., CAST(text_varbinary AS CHAR)) to avoid surprises.
IMPORTANT: Understand Your Database! Research how your database (MySQL, PostgreSQL, SQL Server, etc.) handles collation to ensure consistent behavior.

Try It Yourself!

Want to replicate this experiment? Try the SQL scripts in an online SQL playground like DB Fiddle. Experiment with different collations (e.g., utf8mb4_bin vs. utf8mb4_unicode_ci) to see how they affect results.

Master Collation for Better Database Design

Collation is a powerful feature that shapes how databases handle string data, but it can also be a source of confusion when mixing VARBINARY and VARCHAR. By understanding collation rules, testing their impact, and choosing the right data type for your use case, you can avoid pitfalls and build more robust applications. Whether you’re working with MySQL, PostgreSQL, or SQL Server, a little experimentation goes a long way in demystifying these concepts.

What’s your experience with collation or data type quirks? Have you encountered unexpected sorting or query behavior? Share your thoughts in the comments or try the experiment yourself! If you found this article helpful, share it with your network to spark a discussion on database design.

Comentários

Postagens mais visitadas deste blog

Article: Preventing Database Gridlock: Recognizing and Resolving Deadlock Scenarios

Learn how deadlocks occur in database systems, understand their impact on performance, and discover practical techniques for identifying potential deadlock scenarios in your SQL code.

PHP: Always use an index to add a values to an array

  Demand Always identify the array index when manipulating the array directly Description Whenever you are adding a new item to the array you must add an index so that php does not need to recalculate the size of the array to identify in which position the new content will be added. Examples 1: # Bad code: 2: do { 3: $array[] = $newItem; 4: } while(!feof($fp)); 5: 6: # Good code 7: $array = []; 8: $index = 0; 9: do { 10: $array[$index++] = $newItem; 11: } while(!feof($fp)); Examples Explanation When instantiating a new item in an array in the "bad example" php will identify that you expect the item to be added to the end of the array, since array is a primitive type and has no attributes php always needs to call a calculation function to identify the current size of the array before trying to add a new item to the end of it. By monitoring the growth of the array in an external variable and automatically reference which position will...

General: Prevent arbitrary precision arithmetic

Demand Some simple decimal calculations like 4.6 * 100 may lead in irregularities when dealing with numbers. Description During arithmetic operations like addition, subtraction, multiplication, and division, the numbers are manipulated according to the rules defined by the IEEE 754 standard. However, due to the finite precision of the representation, rounding errors may occur, leading to small discrepancies between the expected and actual results of computations. In simpler terms, the computer operates using powers of 2 (binary-based), so whenever we need to work with a number that cannot be exactly represented in binary (like 0.1, which is a base-10 fraction), it uses an approximation rather than an infinite number and is incapable of presenting the exact correct value. To prevent it we may use some libraries to prevent the problem. In Javascript and Typescript we have the BigNumbers library (and others). PHP has a built-in library for arbitrary-precision arithmetic called BCMath (Bin...