Generic Run-length encoding (RLE) for C#

RLE is a very simple form of data compression that deflates repeated elements in a sequence. For a current project, I needed an implementation of RLE in C#. So I looked around on the internet. I was surprised that most implementations were quite simplistic and didn’t nearly fit my needs. One shortcoming of many implementations was that they exclusively work on strings that must not contain any digit. That’s a strong limitation and not what RLE should be used for. How many texts do you know that contain many repeated (more than 2) characters? So string compression is hardly a good use case for RLE. That’s why I decided to write my own generic implementation, which is quite tiny, but way more flexible.

Run-length encoding

But let’s start by having a closer look at RLE.

As I already stated, RLE deflates repeated characters in a sequence. So if there are no repeated characters, the sequence is preserved:

input:   1 2 3 4 5 6 5 4 3 2 1
rle:     1 2 3 4 5 6 5 4 3 2 1

It gets interesting once we introduce repeated values:

input:   1 2 3 3 3 3 3 4 3 2 1
rle:     1 2 M 5 3 4 3 2 1

M stands for a data-type specific marker value. This marker tells the decoder that the following two values are actually a compressed run. The first value is the run’s length and the second value is the run’s value. We have 5 3s, so this is encoded with M 5 3. The marker is usually the greatest number in the data-type’s range.

One problem comes up when the marker’s value is part of the input data. Assume that the input data is of type byte, then the marker value is usually 255.

input:   1 2 3 4 255 6 5 4 3 2 1

If we don’t take care of this, the encoding and decoding would be as follow:

rle:     1 2 3 4 255 6 5 4 3 2 1
decoded: 1 2 3 4 5 5 5 5 5 5 4 3 2 1

Not what we wanted. That’s why the marker’s value should be escaped with M M. If we can assure that a run’s length is always less than M, this works perfectly. The result would be:

rle:     1 2 3 4 255 255 6 5 4 3 2 1
decoded: 1 2 3 4 255 6 5 4 3 2 1

Ensuring the maximum run length is also quite simple. If we have a run that is longer, we can just split it into several runs.

That’s pretty much all the magic behind RLE.

Implementation

I decided to make my implementation generic in terms of the data’s type. So the implementation can be used for byte data, short data, basically for all integer data types, but also for strings (because they are IEnumerable<char>). Here it is:

/// <summary>
/// Provides the RLE codec for any integer data type.
/// </summary>
/// <typeparam name="T">The data's type. Must be an integer type or an ArgumentException will be thrown</typeparam>
static class RLE<T> where T : struct, IConvertible
{
    /// <summary>
    /// This is the marker that identifies a compressed run
    /// </summary>
    private static T rleMarker;

    /// <summary>
    /// A run can be at most as long as the marker - 1
    /// </summary>
    private static ulong maxLength;

    static RLE()
    {
        GetMaxValues();
    }

    /// <summary>
    /// RLE-Encodes a data set.
    /// </summary>
    /// <param name="data">The data to encode</param>
    /// <returns>Encoded data</returns>
    public static IEnumerable<T> Encode(IEnumerable<T> data)
    {
        var enumerator = data.GetEnumerator();

        if (!enumerator.MoveNext())
            yield break;

        var firstRunValue = enumerator.Current;
        ulong runLength = 1;
        while(enumerator.MoveNext())
        {
            var currentValue = enumerator.Current;
            // if the current value is the value of the current run, don't yield anything, 
            // just extend the run
            if (currentValue.Equals(firstRunValue))
                runLength++;
            else
            {
                // the current value is different from the current run
                // yield what we have so far
                foreach (var item in MakeRun(firstRunValue, runLength))
                    yield return item;
                    
                // and reset the run
                firstRunValue = currentValue;
                runLength = 1;
            }
            // if there are very many identical values, don't exceed the max length
            if (runLength > maxLength)
            {
                foreach (var item in MakeRun(firstRunValue, maxLength))
                    yield return item;
                runLength -= maxLength;
            }
        }
        //yield everything that has been buffered
        foreach (var item in MakeRun(firstRunValue, runLength))
            yield return item;
    }

    /// <summary>
    /// Decodes RLE-encoded data
    /// </summary>
    /// <param name="data">RLE-encoded data</param>
    /// <returns>The original data</returns>
    public static IEnumerable<T> Decode(IEnumerable<T> data)
    {
        var enumerator = data.GetEnumerator();
        if (!enumerator.MoveNext())
            yield break;

        do
        {
            var value = enumerator.Current;
            if (!value.Equals(rleMarker))
            {
                //an ordinary value
                yield return value;
            }
            else
            {
                //might be flag or escape
                //examine the next value
                if (!enumerator.MoveNext())
                    throw new ArgumentException("The provided data is not properly encoded.");
                if (enumerator.Current.Equals(rleMarker))
                {
                    //escaped value
                    yield return value;
                }
                else
                {
                    //rle marker
                    var length = enumerator.Current.ToInt64(CultureInfo.InvariantCulture);
                    if (!enumerator.MoveNext())
                        throw new ArgumentException("The provided data is not properly encoded.");
                    var val = enumerator.Current;
                    for (var j = 0; j < length; ++j)
                        yield return val;
                }
            }
        }
        while (enumerator.MoveNext());
    }

    private static IEnumerable<T> MakeRun(T value, ulong length)
    {
        if ((length <= 3 && !value.Equals(rleMarker)) || length <= 1)
        {
            //don't compress this run, it is just too small
            for (ulong i = 0; i < length; ++i)
            {
                if (value.Equals(rleMarker))
                    yield return rleMarker;
                yield return value;
            }
        }
        else
        {
            //compressed run
            yield return rleMarker;
            yield return (T)(dynamic)length;
            yield return value;
        }
    }

    private static void GetMaxValues()
    {
        object maxValue = default(T);
        TypeCode typeCode = Type.GetTypeCode(typeof(T));
        switch (typeCode)
        {
            case TypeCode.Byte:
                {
                    var limit = byte.MaxValue;
                    rleMarker = __refvalue( __makeref(limit),T);
                    maxLength = (ulong)(limit - 1);
                    break;
                }
            case TypeCode.Char:
                {
                    var limit = char.MaxValue;
                    rleMarker = __refvalue( __makeref(limit),T);
                    maxLength = (ulong)(limit - 1);
                    break;
                }
            case TypeCode.Int16:
                {
                    var limit = short.MaxValue;
                    rleMarker = __refvalue( __makeref(limit),T);
                    maxLength = (ulong)(limit - 1);
                    break;
                }
            case TypeCode.Int32:
                {
                    var limit = int.MaxValue;
                    rleMarker = __refvalue( __makeref(limit),T);
                    maxLength = (ulong)(limit - 1);
                    break;
                }
            case TypeCode.Int64:
                {
                    var limit = long.MaxValue;
                    rleMarker = __refvalue( __makeref(limit),T);
                    maxLength = (ulong)(limit - 1);
                    break;
                }
            case TypeCode.SByte:
                {
                    var limit = sbyte.MaxValue;
                    rleMarker = __refvalue( __makeref(limit),T);
                    maxLength = (ulong)(limit - 1);
                    break;
                }
            case TypeCode.UInt16:
                {
                    var limit = ushort.MaxValue;
                    rleMarker = __refvalue( __makeref(limit),T);
                    maxLength = (ulong)(limit - 1);
                    break;
                }
            case TypeCode.UInt32:
                {
                    var limit = uint.MaxValue;
                    rleMarker = __refvalue( __makeref(limit),T);
                    maxLength = (ulong)(limit - 1);
                    break;
                }
            case TypeCode.UInt64:
                {
                    var limit = ulong.MaxValue;
                    rleMarker = __refvalue( __makeref(limit),T);
                    maxLength = (ulong)(limit - 1);
                    break;
                }
            default:
                throw new ArgumentException("The provided type parameter is not an integer type");
        }
    }
}

Using it is quite simple:

var data = new byte[] { 1, 2, 3, 4, 5, 5, 5, 5, 5, 4, 3, 2, 1 };
var rle = RLE<byte>.Encode(data);
var decoded = RLE<byte>.Decode(rle);

Trace.WriteLine("RLE-encoded data: " + string.Join(", ", rle.Select(d => d.ToString())));
Trace.WriteLine("Decoded data: " + string.Join(", ", decoded.Select(d => d.ToString())));

Output:

RLE-encoded data: 1, 2, 3, 4, 255, 5, 5, 4, 3, 2, 1
Decoded data: 1, 2, 3, 4, 5, 5, 5, 5, 5, 4, 3, 2, 1

Since the methods take and return IEnumerable<T>, you can provide any iterable data structure, be it an array, a list or an iterator method. With the extension methods ToList() and ToArray() you can transform it to any data structure you need.

Here is the proof of my earlier statement. My implementation can also encode strings:

var data = "This is a simple texxxxxt with artificialllly repeatedddd characterssss.";
var rle = RLE<char>.Encode(data);
var decoded = RLE<char>.Decode(rle);

Trace.WriteLine("RLE-encoded data: " + string.Join("", rle.Select(d => d.ToString())));
Trace.WriteLine("Decoded data: " + string.Join("", decoded.Select(d => d.ToString())));

Output:

RLE-encoded data: This is a simple te�xt with artificia�ly repeate�d character�s.
Decoded data: This is a simple texxxxxt with artificialllly repeatedddd characterssss.

The marker has a quite high value (which gets lost, because I can just input ASCII characters in the WordPress text box). The length value is quite small and does not have a visible representation.

Advertisements

, , ,

  1. #1 by pragmateek on June 9, 2014 - 6:39 pm

    Interesting, thanks for sharing. 🙂
    But shouldn’t your second example rle be “1 2 M 5 3 4 3 2 1” instead of “1 2 M 5 3 5 3 2 1”?…

  2. #2 by nicoschertler on June 9, 2014 - 6:43 pm

    Of course it should. Thanks for spotting the mistake.

  3. #3 by pragmateek on June 9, 2014 - 6:48 pm

    Thanks for the quick fix.

  4. #4 by robertmccabe25 on September 7, 2014 - 11:20 am

    The “Encode” method is never stepped into when I run for some reason and the data is always null, no compression ever takes place. What am I doing wrong?

  5. #5 by nicoschertler on September 7, 2014 - 11:29 am

    Hi Robert. The Encode method is an iterator method which is only evaluated on demand (i.e. when the data are needed). This can be achieved with a for-each loop or the call of ToList() or similar methds on the result.

  6. #6 by Dale Brubaker on March 21, 2015 - 8:15 am

    Thank you, Nico, for sharing this. In case others run into this issue, I wasn’t able to compile the undocumented keywords in the lines like: rleMarker = __refvalue( __makeref(limit),T); So I replaced them with this method:

    private T LoadIntoTypeT(object value)
    {
    var changeType = Convert.ChangeType(value, typeof(T));
    if (changeType != null)
    return (T)changeType;
    return default(T);
    }

  7. #7 by nicoschertler on March 21, 2015 - 10:06 pm

    Hi Dale, thanks for your comment. The strange reference value construct is used to avoid boxing and should compile just fine.

  8. #8 by Armin on January 9, 2016 - 9:31 pm

    Thanks for sharing this Nico! Saved me some hours for an own implementation.
    Like Dale I had problems with __refvalue (seems like Unity doesn’t allow it for some reasons). So thanks fale for your Alternative, even if it’s slightly slower because of boxing.

  9. #9 by Anonymous on February 27, 2017 - 11:06 pm

    Very nice Nico! I am trying to understand the efficiency in this algo…it looks it is O(n) when numbers dont repeat…i am dont follow the efficiency when there are repeats.

  10. #10 by nicoschertler on February 28, 2017 - 8:11 am

    The time complexity of the encoding procedure is indeed linear. The operation is basically the loop that iterates the input. For every input, a constant-time operation is executed – either recording the run or adding the run to the output. And adding a run (method MakeRun()) emits at most three characters and is therefore O(1).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: