Kyle I S Harrington / kyle@eecs.tufts.edu
Some slides adapted from Roni Khardon
Image from Alex Poon
The input to a ML algorithm/model is composed features (aka attributes)
The output is a class or a value
Can we just take tons of measurements, feed them into our ML algorithm, and start making predictions?
What can we do about this?
Our dataset $D$ has $N$ dimensions, for $N$ features
Instance transformations reduce the number of dimensions by transforming the features themselves
If we had a dataset with red, green, blue, and yellow features (N=4),
Then we might transform the dataset to $( red - green ) / (red + green)$ and $ (blue-yellow) / (blue + yellow)$
Principle component analysis (PCA) maps from one set of axes to orthogonal axes
Roughly speaking, project onto the axes of highest variation
Eigenface reduces faces to low-dimensional space
Bases | Generated faces |
![]() |
![]() |
Datasets can be thought of as manifolds
Embed data in a low dimensional space, while preserving distance between points
Instance transformations reduce the number of dimensions by transforming the features themselves
Most methods of instance transformation are unsupervised
Filter irrelevant features based upon the dataset
Ranking features based upon correlation between feature and class
$Rank(f) = \frac{ E[( X_f - \mu_{X_f} ) ( Y - \mu_Y ) ] }{ \sigma_{X_f} \sigma_Y }$
where $f$ is the feature of interest, and $Y$ is the class
Mutual information between a feature and class
$Rank(f) = \displaystyle \sum_{X_f} \displaystyle \sum_Y p(X_f,Y) log \frac{ p(X_f,Y) }{ p(X_f) p(Y) }$
where $f$ is the feature of interest, and $Y$ is the class
What issues could there be?
How necessary is this?
Issues:
What is appealing about this?
Advantages of using validation sets
Features are tailored to ML algorithm
Considers different ways of combining features
Start with subsets of only 1 feature
Grow subset of features by adding 1 new feature per iteration
Start with the full set of features
Eliminate 1 feature per iteration
Exhaustive search (consider all subsets)
Alternative AI methods (simulated annealing, genetic algorithms, ...)
Feature subset search is NP-hard
Filtering: 1-step process, considers features independently
Wrapper: iterates through subsets of features, selects subset that matches ML algorithm
When searching/optimizing, provide an incentive to be simple (sparse) by penalizing complexity
Will be covered later (L1 regularization)
The range of values for a given feature can impact an algorithm's performance
Remember using the value of the year directly on assignment 1?
Scale the values into the range $[0,1]$
$x \leftarrow \frac{ x - x_{min} }{ x_{max} - x_{min} }
Scale based on training set only
Scale the distribution to have mean=0 and std=1
$x \leftarrow \frac{ x - \mu_X }{ \sigma_X }
Scale based on training set only
Some algorithms only work on discrete features
We may need to discretize real-valued features
Calculate the histogram
This divides the values into bins
Alternatively, use a heuristic/ad-hoc method to discretize in a useful way
E.G. Build a decision tree, let the DT algorithm discretize, and use the split values of the optimized tree
Some features are unordered (i.e. Browsers = [ Firefox, Chrome, Safari ])
Most common approach is to use unit vectors:
Firefox | Chrome | Safari |
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
Similar assignment 3
Covers: kNN, Decision trees, Naive Bayes, Measuring ML algorithms
Quiz 1
Hands-on with Features