Encode Categories into Int Arrays

I had a hard time trying to encode categories into int arrays manually since it can get a bit overwhelming if there are a lot of features.

The basic logic here:

  1. Separates data into a 2 dimension string array
  2. Find every column that doesn’t parse into a double (This would not work when the category can be parsed e.g. Category 1)
  3. Loops every item in the column and adds it if it doesn’t already exist
  4. Encode categories into int arrays by setting an int in an int array to one. i.e. result[the index of the category] = 1;

Let’s get to the code, shall we?

void GetStringOptions(string[] lines, char separator)
            string[][] splitedData = new string[lines.Count()][]; //2 dimension array, total data * features

            for (int i = 0; i < lines.Count(); ++i)
                splitedData[i] = lines[i].Split(separator);

            List<int> stringColumns  = new List<int>();

            for (int i = 0; i < splitedData[0].Count(); ++i)
                if(!double.TryParse(splitedData[0][i], out double value)) //if value does not parse as double


            foreach (int i in stringColumns)
                int options = 0;
                int startingIndex = TypeIntsList.Count(); //index of the first item

                for (int j = 0; j < splitedData.Count(); ++j)
                    if(!TypeIntsList.Any(x => x.Name == splitedData[j][i]))     // adds value to list, also counts how many total options there are
                        TypeIntsList.Add(new TypeInts() { Name = splitedData[j][i], ValueString = "", Index = i });

                for(int option = 0; option < options; option++)     //generates an int array with a single 1 to activate different inputs
                    int[] test = new int[options];
                    test[option] = 1;
                    TypeIntsList[startingIndex + option].ValueString = string.Join(",", test);

            TypeToIntGrid.Items.Refresh();  //refreshes the grid (new items won't show without this)

Example output:

Iris-setosa       ->   [1,0,0]
Iris-versicolor   ->   [0,1,0]
Iris-virginica    ->   [0,0,1]
Auto-Encode Categories

I do have to admit that it isn’t the most elegant way of doing it; I will try to improve it in the future. (In other words, tomorrow.)

In addition, it is also very limited to what it can process. For example, male/female would not fit this encoding as it would output [1,0] and [0,1]. In reality, I would personally use 1 and -1 instead.

I’ve tried to combine the two for loops within foreach (int i in stringColumns); yet, it seems like I can’t get the size of the array without actually looping once and find out how many options there are.

This code is part of Neural Network GUI Demo, it’s full source code can be found here on github.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.