One Hot Encoding in Power BI

One Hot Encoding in Power BI

In a recent post where I described how to implement multiple linear regression in Power BI, I had to perform one hot encoding to make my categorical variable usable by the model.

So I decided to make another short post to explain what is hot encoding and how to implement it in Power BI using Power Query.

Why do need to perform Hot Encoding?

In machine learning or statistical algorithms, many models require to have only numerical input variables. So one of the most common preprocessing techniques to transform is called “one-hot encoding,” which is used to transform categorical data into a format that machine learning algorithms can understand and use to generate the model.

What is One-Hot Encoding?

One-hot encoding is a technique used to represent categorical data as a series of binary values. Each value in a categorical variable is transformed into a new binary variable, with a value of 1 indicating the presence of the category and a value of 0 indicating its absence.

For example, let’s say we have a dataset of animal types, with the following categories: “dog,” “cat,” and “bird.” Using one-hot encoding, we would transform the categorical variable into three binary variables: “is_dog,” “is_cat,” and “is_bird.” Each variable would have a value of 1 if the animal belongs to that category, and a value of 0 if it does not.

How to implement One-Hot Encoding in Power BI?

To follow up with my post on Multiple linear regression in Power BI I will use the same dataset ElentraSales which contains data about sales of the Elantra car over a period of time and some economic variables.

Now some tools like R are capable of automatically applying hot encoding on the fly but this is not the case in Power BI, so to make sure our model can run we must perform the encoding manually.

This will be done in 4 steps:

  • add an index column if you don’t already have a unique row identifier
    • In theory, the unique row identifier can be a combination of multiple columns, but to keep things simple if you don’t have a unique column add an index column
  • Then add a Dummy column with all values equal to 1
  • Pivot the Month column based on the Dummy column values
  • Replace nulls with 0 (optional step)

Create an Index Column

The only reason we create an index column is to keep the number of rows consistent after the pivot.

Add Dummy column

The dummy column will be used to generate the binary variables for each month, all the rows of the month Jan will have the value 1 in the pivoted column “Jan”.

Pivot the Month column

Now we simply pivot the month column based on the Dummy values previously created.

Replace nulls with zero

After the pivot, all the rows will take the value 1 for their belonging month and otherwise null.

The final step consists of replacing the nulls with 0, or, we can also keep the nulls the model will work just fine, and the linest function will also work with the nulls.
Now if for whatever reason you want to replace all the nulls you can select all the columns, press right click and replace.

Hot Encoding in Power Bi

The M code

Note that there may be a more optimized way to write the M code but since I wanted to use the Power Query UI only with no prerequisite knowledge of M.

   #"Added Index" = Table.AddIndexColumn(#"Previous Step Name", "Index", 0, 1, Int64.Type),
    #"Added Custom" = Table.AddColumn(#"Added Index", "Dummy", each 1),
    #"Pivoted Column" = Table.Pivot(#"Added Custom", List.Distinct(#"Added Custom"[MonthName]), "MonthName", "Dummy", List.Sum),
    #"Replaced Value" = Table.ReplaceValue(#"Pivoted Column",null,0,Replacer.ReplaceValue,{"Apr", "Aug", "Dec", "Feb", "Jan", "Jul", "Jun", "Mar", "May", "Nov", "Oct", "Sep"})

Conclusion

As we saw in this post it is pretty straightforward to implement one hot encoding in Power BI and we need to implement it for multiple columns we can simply repeat the same process for the other columns.

The only limitation that I can foresee would be the performance side on a larger dataset or columns with many categorical values.

Leave a Reply

Your email address will not be published. Required fields are marked *