Reverse a string C# extension method (Unicode safe)

5 minutes

Table of Contents

Description

This extension method reverses a string. Furthermore, it supports emojis such as 😀, 🚀, and accented characters such as é, ü, and ñ, which may be represented using a single code unit or a combination of code units.

Supporting emojis and accented characters enables correct reversal, such as mañana 🚀 becomes 🚀 anañam. Otherwise, reversing without proper handling would result in broken characters such as mañana 🚀 becomes �ana�am�.


Extension method code

using System.Globalization;

namespace Illumonos.Extensions.Strings;

public static partial class StringExtensions
{
    /// <summary>
    /// Reverses the characters in the specified string while preserving visually combined characters, such as emoji and accented letters.
    /// </summary>
    /// <param name="value">The string to reverse.</param>
    /// <returns>
    /// A new string with the visual characters in reverse order. This includes correct handling of combined characters like emoji and letters with accents.
    /// If the input string is empty or contains only one character, it is returned unchanged.
    /// </returns>
    /// <exception cref="ArgumentNullException">
    /// Thrown when the <paramref name="value"/> parameter is <c>null</c>.
    /// </exception>
    public static string Reverse(this string value)
    {
        ArgumentNullException.ThrowIfNull(value);
        
        if (value.Length <= 1)
        {
            return value;
        }

        TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(value);
        List<string> elements = [];

        while (enumerator.MoveNext())
        {
            elements.Add(enumerator.GetTextElement());
        }
        
        elements.Reverse();

        return string.Concat(elements);
    }
}

Code units

Below you will find a description of code units and why reversing them is more complex than it may first appear, along with a detailed explanation of the extension method code.

In C#, accented characters and emojis can be made up of multiple UTF-16 code units. Simply reversing the string by characters can split these sequences, causing invalid or unreadable output.

As C# uses UTF-16 for strings, each code unit holds part or all of a character. Think of a code unit as a 16-bit value for a char. Therefore, basic characters such as A, B, C, 1, 2, 3, etc. fit into a single 16-bit code unit. Some characters such as emojis require two code units, which are called surrogate pairs. Accented characters, on the other hand, may be represented as a single code unit, or a combination of two, and this is known as combining characters.

Accented characters and code units

string singleCodeUnit = "é";                // U+00E9

Console.WriteLine(singleCodeUnit);          // Output: é
Console.WriteLine(singleCodeUnit.Length);   // Output: 1

string combinedCodeUnit = "e\u0301";        // U+0065 + U+0301

Console.WriteLine(combinedCodeUnit.Length); // Output: 2
Console.WriteLine(combinedCodeUnit);        // Output: é

In the example above, the accented character é is defined in two ways using separate variables. One uses a single code unit, and the second uses a combination of code units: the e character and the combining acute accent character ́. It is difficult to see the combining acute accent character visually, but it almost looks like a backtick.

When outputting both variables to the console, they both produce é, and you can see from the example that when checking the length of each, the singleCodeUnit variable is 1 and the combinedCodeUnit is 2.

Emojis and code units

string emoji = "🚀";
Console.WriteLine(emoji.Length); // Output: 2

In the example above, you can again see that the length of the emoji is two, showing that two code units are used to represent a single emoji.


Breaking down the extension method

public static string Reverse(this string value)
{
    ArgumentNullException.ThrowIfNull(value);

    if (value.Length <= 1)
    {
        return value;
    }

The initial part of the extension method will be familiar to most developers. We throw an exception if the input string is null. That makes sense, as the string parameter type has not been annotated with a ? to indicate that it could be null. Therefore, throwing an ArgumentNullException is the standard way to handle the scenario where a consumer passes a null string.

If we did not want to throw on null string values, we could remove the ArgumentNullException.ThrowIfNull(value); and annotate the string value parameter to indicate nullable strings like so: public static string Reverse(this string? value)

Text Elements and Unicode-safe string enumeration

You can think of a text element in C# as a Unicode character or a combining character sequence, such as a base character like e plus one or more combining marks, such as the combining acute accent character ́ mentioned earlier. Text elements allow processing emojis or accented letters correctly.

TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(value);
List<string> elements = [];

while (enumerator.MoveNext())
{
    elements.Add(enumerator.GetTextElement());
}

StringInfo.GetTextElementEnumerator(string) returns a TextElementEnumerator that allows enumeration of the text elements in the string value.

The difference between using the GetTextElementEnumerator method and just converting the value string to an array and reversing it is that GetTextElementEnumerator supports characters where more than one code unit makes up the character. Therefore, for every enumeration of the enumerator, you receive a correct string containing either a single letter, number, special character, a combined-unit accented character, or an emoji.

Another way to understand the difference is that a string enumerator returns a char per iteration, whereas a text element enumerator returns a string, which can represent a combined character sequence such as an emoji or an accented letter. This is the feature which allows the text element-generated string to support accented characters and emojis.

To enumerate our string, we use a while loop, along with the MoveNext method, and in the body of the while loop we add the resulting character to an array named elements.

Finishing up

elements.Reverse();

return string.Concat(elements);

In the final parts of the method, we use the system library System.Collections.Generic to actually reverse the elements list and use the string.Concat method to join the characters strings.


Summary

With our version of a string reversal extension method, the bulk of the code is tasked with correctly handling combined-unit characters. This gives users of this method the advantage of supporting those characters, as the use of emojis is becoming more common, and accented characters are common in languages other than English.


Filed under Extras, and tagged under Strings

View the source code for this article on GitHub