Kyle I S Harrington / kyle@eecs.tufts.edu
Some slides adapted from Roni Khardon
The range of values for a given feature can impact an algorithm's performance
Remember using the value of the year directly on assignment 1?
Scale the values into the range $[0,1]$
$x \leftarrow \frac{ x - x_{min} }{ x_{max} - x_{min} }$
Scale based on training set only
Scale the distribution to have mean=0 and std=1
$x \leftarrow \frac{ x - \mu_X }{ \sigma_X }$
Scale based on training set only
Some algorithms only work on discrete features
We may need to discretize real-valued features
Calculate the histogram
This divides the values into bins
Alternatively, use a heuristic/ad-hoc method to discretize in a useful way
E.G. Build a decision tree, let the DT algorithm discretize, and use the split values of the optimized tree
Some features are unordered (i.e. Browsers = [ Firefox, Chrome, Safari ])
Most common approach is to use unit vectors:
Firefox | Chrome | Safari |
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
Proposal due: 03/07
Write a 300-500 word abstract describing your proposed project. This should include 2-3 references of papers you expect to include in your final paper.
Project due: 04/25
Turn in a 8-12 page paper. A rough outline is:
B-cell receptors bind specific molecular structures to recruit immune system activity.
The epitope prediction problem:
Given that a protein is a sequence of amino acids that fold into a 3D shape, only a subset of these amino acids are bound by B-cell receptors. Which ones?
Complex Features in Prediction of Discontinuous B-cell Epitopes
This is a project with James Chin.
There are existing methods for B-cell epitope prediction. They vary in terms of specific feature sets, machine learning algorithms, and training/test sets.
Much of the advancement in B-cell epitope prediction has been driven by improvement in features.
We propose a method for synthesizing complex features from a set of basis features that allows provides high-degrees of nonlinearity not possible with simple ML representations.
Linear Threshold Units