Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 162 additions & 0 deletions content/blog/feature-engineering.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
---
title: "What is Feature Engineering?"
date: 2025-08-07
draft: false
featured: false
categories: ["Engineering"]
image: /assets/blog/feature-engineering/preview-image.png
meta_image: /assets/blog/feature-engineering/preview-image.png
description: "Learn how to transform raw data into meaningful features that improve machine learning model performance."
author: David Myriel
author_avatar: "/assets/authors/david-myriel.jpg"
author_bio: "Writer."
author_twitter: "davidmyriel"
author_github: "davidmyriel"
author_linkedin: "davidmyriel"
---

If you've spent any time around data science conversations, you've probably heard terms like **feature engineering**, **data transformations**, **ETL**, and **data pipelines**. They get thrown around a lot, sometimes even interchangeably. But what do they actually mean? And more importantly, why should you care?

Let's dive into feature engineering with clear, real-world examples that show exactly how it works in practice.

## From Raw Data to Useful Data

At its core, **feature engineering** is the process of taking raw information from the real world and transforming it into a format that makes your machine learning model perform better.

A machine learning model, no matter how advanced, is only as good as the data you feed it. Raw data is messy. It comes from multiple sources, in inconsistent formats, full of missing values, text in multiple languages, numbers that mean different things depending on the context.

Feature engineering sits between raw data collection and model training. Without it, you might have all the right ingredients but no recipe and your "AI cake" will flop.

*Diagram: Flow from Raw Data → Feature Engineering → Model Training → Predictions*

![Flow from Raw Data → Feature Engineering → Model Training → Predictions](/assets/blog/feature-engineering/diagram.png)


## Why It's Often Overlooked

In popular discussions about AI, the spotlight is usually on the model itself, whether it's a neural network, decision tree, or a large language model. Deployment gets a lot of attention too, because it's the stage where the AI starts producing real-world results.

But the quiet hero in the background is feature engineering. If you feed your model bad inputs, even the best architecture can't save it. A well-engineered dataset can make a basic model outperform a poorly fed state-of-the-art one.

{{< admonition "warning" "Garbage in, garbage out" >}}
"Garbage in, garbage out" is more than a cliché in data science, it's a measurable truth.
{{< /admonition >}}

## What Counts as a Feature?

![Feature Engineering Process](/assets/blog/feature-engineering/image-1.jpeg)

In machine learning, a **feature** is simply a measurable property of the thing you're trying to predict.

- In a house price prediction model, features could be square footage, number of bedrooms, or the age of the building.
- In a fraud detection system, features might include the time of day a transaction happens, the purchase location, or the distance between consecutive purchases.

Feature engineering is about shaping these properties so the model can make sense of them.

## Common Feature Engineering Techniques

Let's explore some transformations in real-world scenarios.

### 1. Converting Categories into Numbers

Many models can't directly handle text categories like `"Red"`, `"Blue"`, and `"Green"`. If you give a regression model these as strings, it has no idea how to interpret them.

Instead, we might use **one-hot encoding**, where each category becomes its own column with binary values:

| Color | Is_Red | Is_Blue | Is_Green |
|---------|--------|---------|----------|
| Red | 1 | 0 | 0 |
| Blue | 0 | 1 | 0 |
| Green | 0 | 0 | 1 |

*Example: Side-by-side comparison of original categorical data and its one-hot encoded version*

![Flow from Raw Data → Feature Engineering → Model Training → Predictions](/assets/blog/feature-engineering/diagram.png)

### 2. Mathematical Transformations

Sometimes the relationship between a variable and the outcome isn't linear. For example, income might affect spending behavior, but doubling income doesn't exactly double spending.

We might take the **logarithm** of income to better reflect diminishing returns.

Other transformations include:
- Taking the square root to stabilize variance.
- Creating interaction features by multiplying two columns (e.g., `height × width` to get area).
- Normalizing values so they're on the same scale.

{{< admonition "tip" "Performance Impact" >}}
These transformations can dramatically improve model performance, especially for algorithms sensitive to value ranges.
{{< /admonition >}}

### 3. Extracting Information from Text

For unstructured data like documents, PDFs, or emails, feature engineering might involve extracting only the useful pieces:
- Summarizing the text.
- Identifying entities (people, organizations, locations).
- Counting keyword frequencies.
- Converting text into embeddings for semantic understanding.

For example, instead of giving a legal AI the entire 300-page contract, you might extract:
- Number of clauses
- Key parties involved
- Dates of interest

*Flowchart: Text document going through summarization and entity extraction before feeding into a model*

![Flow from Raw Data → Feature Engineering → Model Training → Predictions](/assets/blog/feature-engineering/diagram.png)

### 4. Date and Time Features

Dates often hide useful patterns. The raw date `"2025-08-07"` isn't directly useful, but splitting it into:
- Day of week
- Month
- Whether it's a holiday

…can reveal trends, like sales spikes on weekends or dips during certain seasons.

## How Feature Engineering Fits with ETL and Data Pipelines

You might have heard **ETL** (Extract, Transform, Load) in the context of data engineering. That's the broader process of moving data from one place to another:
1. **Extract** from a source (like a database or API).
2. **Transform** into the desired structure.
3. **Load** into storage or a model.

Feature engineering overlaps with the "Transform" step, but its goal is more specific: **to create predictive power, not just clean data**.

| Aspect | ETL | Feature Engineering |
|--------------------|----------------------------------|--------------------------------------|
| Goal | Move & clean data | Make data useful for prediction |
| Focus | Structure & formatting | Signal extraction & transformation |
| Example | Convert CSV to SQL table | Convert dates to "day of week" |

## When Done Well, It Changes Everything

A real-world example: a retail company tried predicting customer churn using raw transaction history. The model performed poorly. After feature engineering:
- They added **average purchase interval** as a feature.
- Calculated **recency of last purchase**.
- Measured **category diversity** in purchases.

With just these changes, accuracy improved by over 20%, without altering the model itself.

*Chart: Before/after performance graph showing the same model with raw vs engineered features*

![Flow from Raw Data → Feature Engineering → Model Training → Predictions](/assets/blog/feature-engineering/diagram.png)

## The Balancing Act

More features aren't always better. Too many can cause **overfitting**, where your model learns the noise rather than the pattern. Part of feature engineering is knowing what to keep, what to drop, and when to combine variables rather than multiply them endlessly.

## Wrapping Up

Feature engineering is the craft of making your data tell its story clearly to the model. It's about:
- Converting raw inputs into meaningful numbers.
- Creating new variables that capture relationships.
- Filtering out the noise.

Without it, you're leaving accuracy, efficiency, and model interpretability on the table. With it, you can turn even a simple algorithm into a powerful predictive tool.

If you remember one thing: models get the glory, but features win the game.

![Flow from Raw Data → Feature Engineering → Model Training → Predictions](/assets/blog/feature-engineering/diagram.png)

![Feature Engineering Process](/assets/blog/feature-engineering/image-2.jpeg)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.