Encode Categories into Int Arrays

I had a hard time trying to encode categories into int arrays manually since it can get a bit overwhelming if there are a lot of features.

The basic logic here:

  1. Separates data into a 2 dimension string array
  2. Find every column that doesn’t parse into a double (This would not work when the category can be parsed e.g. Category 1)
  3. Loops every item in the column and adds it if it doesn’t already exist
  4. Encode categories into int arrays by setting an int in an int array to one. i.e. result[the index of the category] = 1;

Let’s get to the code, shall we?

void GetStringOptions(string[] lines, char separator)
        {
            string[][] splitedData = new string[lines.Count()][]; //2 dimension array, total data * features

            for (int i = 0; i < lines.Count(); ++i)
            {
                splitedData[i] = lines[i].Split(separator);
            }

            List<int> stringColumns  = new List<int>();
            

            for (int i = 0; i < splitedData[0].Count(); ++i)
            {
                if(!double.TryParse(splitedData[0][i], out double value)) //if value does not parse as double
                    stringColumns.Add(i);
            }

            Console.WriteLine(stringColumns[0]);

            foreach (int i in stringColumns)
            {
                int options = 0;
                int startingIndex = TypeIntsList.Count(); //index of the first item

                for (int j = 0; j < splitedData.Count(); ++j)
                {
                    if(!TypeIntsList.Any(x => x.Name == splitedData[j][i]))     // adds value to list, also counts how many total options there are
                    {
                        TypeIntsList.Add(new TypeInts() { Name = splitedData[j][i], ValueString = "", Index = i });
                        options++;
                        Console.WriteLine(splitedData[j][i]);
                    }
                }

                for(int option = 0; option < options; option++)     //generates an int array with a single 1 to activate different inputs
                {
                    int[] test = new int[options];
                    test[option] = 1;
                    TypeIntsList[startingIndex + option].ValueString = string.Join(",", test);
                }
            }



            TypeToIntGrid.Items.Refresh();  //refreshes the grid (new items won't show without this)
        }

Example output:

Iris-setosa       ->   [1,0,0]
Iris-versicolor   ->   [0,1,0]
Iris-virginica    ->   [0,0,1]
Auto-Encode Categories

I do have to admit that it isn’t the most elegant way of doing it; I will try to improve it in the future. (In other words, tomorrow.)

In addition, it is also very limited to what it can process. For example, male/female would not fit this encoding as it would output [1,0] and [0,1]. In reality, I would personally use 1 and -1 instead.

I’ve tried to combine the two for loops within foreach (int i in stringColumns); yet, it seems like I can’t get the size of the array without actually looping once and find out how many options there are.

This code is part of Neural Network GUI Demo, it’s full source code can be found here on github.

Neural Network GUI Demo

Neural Network GUI Demo

Recently, I’ve been messing around with neural networks, especially the Iris data set. Yet, building a neural network from scratch seems a bit too much work… So I started this project —— Neural Network GUI Demo. (Yes it is a demo)

I will be updating this as well as Endless Launcher.

Supported Features:

  • User-defined input/node/output number
  • Epoch/learn rate/momentum/weight decay/exit error values
  • Separator and data reading
  • Single layered neural network
  • Automatically encode categorical inputs
  • And a random text box that I use as a console

Note: There are currently no labels for the time being. This is for development purposes as it is much easier to move a text box rather than a text box AND a label.

(I’m totally not being lazy _(:зゝ∠)_)

 

This project is open source under the GPL-3.0 license, source code can be found here on github.

 

Special Thanks:

A lot of the code for the neural network are from http://quaetrix.com/Build2014.html