Encoding a column in a dataset
Encoding a column in a dataset during data preprocessing is a crucial step in preparing your data for machine learning models. This process involves converting categorical data, which are often represented as text, into a numerical format that can be understood by the algorithms. There are several methods to do this, each suitable for different types of data and models.
Types of Encoding
Sample dataset:
color size 0 red S 1 green M 2 blue L 3 green XL
1. Label Encoding:
Each category is assigned a unique integer based on alphabetical ordering. Best for ordinal data where the order matters (e.g., 'low', 'medium', 'high'). Python Example:
# use pandas# The factorize method returns a tuple where the first element is an array of the encoded labels.df['size_encoded'] = pd.factorize(df['size'])[0]# use scikit-learnlabel_encoder = LabelEncoder()data['size_encoded'] = label_encoder.fit_transform(data['size'])
Result:
color size size_encoded0 red S 21 green M 12 blue L 03 green XL 3
2. One-Hot Encoding:
Creates a new binary column for each category in the original column. Suitable for nominal data where no ordinal relationship exists. Python Example:
# use scikit-learncolumn_transformer = ColumnTransformer( [("color_one_hot", OneHotEncoder(), ['color'])], remainder='passthrough')one_hot_encoded_data_sklearn = column_transformer.fit_transform(data[['color']])# use pandasimport pandas as pdencoded_data = pd.get_dummies(data['color'])
Result:
color_blue color_green color_red size0 0.0 0.0 1.0 S1 0.0 1.0 0.0 M2 1.0 0.0 0.0 L3 0.0 1.0 0.0 XL
3. Binary Encoding:
Converts categories into binary numbers and then splits the numbers into separate columns. Useful for high cardinality features. Python Example: python Copy code import category_encoders as ce
encoder = ce.BinaryEncoder(cols=['column']) data = encoder.fit_transform(data) Considerations for Encoding Model Type: Tree-based models can work well with label encoding, while models like linear regression or neural networks often require one-hot encoding. Cardinality: For columns with a large number of categories, one-hot encoding can lead to a high increase in dataset dimensions, which might require dimensionality reduction techniques. Ordinal vs. Nominal: Understand if the categorical data is ordinal (there is an order) or nominal (no intrinsic order) to choose the appropriate encoding technique. Steps in Encoding a Column Identify Categorical Columns: Determine which columns in your dataset are categorical and need encoding. Choose the Right Encoding Technique: Based on the type of categorical data and the model you plan to use, choose an appropriate encoding method. Apply Encoding: Use tools like pandas, scikit-learn, or category_encoders in Python to apply the chosen encoding to your data. Use Encoded Data for Modeling: Proceed with the encoded data for further data preprocessing steps (like normalization) and model training. Example In the provided example, we have a sample dataset with two columns: color and size. Let's examine the encoding applied to each:
Label Encoding on 'size' Column: The 'size' column is an example of ordinal data (S < M < L < XL). Each size category (S, M, L, XL) has been assigned a unique integer (2, 1, 0, 3 respectively) based on alphabetical order. One-Hot Encoding on 'color' Column: The 'color' column is nominal
with no intrinsic order among the categories (red, green, blue).
One-hot encoding creates new binary columns for each color category. In the transformed dataset, color_blue, color_green, and color_red represent the presence (1) or absence (0) of each color. The resulting data demonstrates how each technique transforms the categorical data into a format suitable for machine learning models:
The size_encoded column shows the result of label encoding, where each size label is mapped to a unique integer. The color_blue, color_green, and color_red columns illustrate the result of one-hot encoding, with separate columns for each color category. This approach to encoding categorical data prepares it for effective use in various machine learning algorithms, allowing them to process and learn from categorical features accurately.