Better features usually help more than a better model. Good features would ideally:

  • Allow learning with few examples, hard to overfit with many examples
  • Capture most important aspects of problem
  • Reflects invariances (generalize to new scenarios)

Find the features (columns) of that are important for predicting

  • What are the relevant factors?
  • Which basis functions should I use among these choices?
  • What types of new data should I collect?
  • How can I speed up computation?

This can help us to remove features. Feature complexity is also correlated with the fundamental tradeoff. Increased complexity leads to increased overfitting risk. Models (like linear regression) can overfit with large so reducing to only useful factors may improve results.

Generally, there are no right answers but there are wrong answers.

Association

For each feature , compute correlation between feature values and .

Usually gives unsatisfactory results as it ignores variable interactions (e.g. if tacos make you sick, and you often eat tacos on Tuesdays, it will say “Tuesday” is relevant.)

Regression Weight

Fit linear regression weights based on all features.

Take all features where weight is greater than a threshold

Has major problems with collinearity

Search and Score

  1. Define score function that measures quality of a set of features
  2. Search for the variables with the best score

We create the set of features by creating every possible combination of features.

However, as we have a large number of sets of variables we are prone to optimization bias

runtime

Forward Selection

  1. Start with an empty set of features
  2. For each possible feature , compute scores of features in combined with feature
  3. Find the that has the best score when added to
  4. Check if improves on the best score found so far
  5. Add to S and go back to step 2
    1. We can stop when no improves the score

runtime

Number of Features Penalties

We can again use complexity penalties and penalize the number of features used. This can also be called the -norm which is the number of non-zero values.

Text Features

  1. Bag of Words: represents sentences/documents by word counts:
  2. Bigram: an ordered set of two words
  3. Trigram: an ordered set of three words

Global vs Local

Global vs. local features allow for “personalized” predictions.

We add a feature for each ‘person’ in the system.