Guest Speaker Series: Professor Henry Estrada, Evergreen Valley College

Johnsy Vineela
4 min readAug 15, 2018

--

This blog post was co-authored by michael korens

Professor Henry Estrada works in the Computer Science department at Evergreen Valley College in San Jose. When he gave his introduction to the interns and then went on to talk a little about his experience in this field, he said, “I am a life long student and some of my best teachers are my students”.

Making the students more enthusiastic with that comment, he gave a brief overview of Machine Learning, Security and Data Science and how they all tie together. He then went on to discuss about analyzing data in detail.

Professor Henry Estrada lecturing the students at Cyber Defenders

Find the full live video here:

Basics of Data Science

Talking about the basics of data science, Professor Estrada gave a few simple steps to understand the process of leveraging machine learning to do data analysis:

1. Data must be processed and prepared for analysis

  • Data Format — tabular form is the most common way to represent data for analysis
  • Each row indicates a data point representing a single observation (student, name, address)
  • Each column shows the variables describing the data points (Features and specifics)

2. Look for algorithms and see the host of choices out there

  • Unsupervised Learning — Let algorithm discover what patterns to look for (k-Means clustering, Principal component analysis, Association rules, Social Network Analysis)
  • Supervised Learning — Predictions based on preexisting patterns (Regression analysis, K-nearest neighbors, Support vector machine, Decision tree, Random forest, Neutral networks)
  • Reinforcement learning — Use the patterns in the data to make prediction and improve these predictions as more results come in

3. Suitable algorithms should be shortlisted based on requirements

4. Parameters of competing algorithms should be fine-tuned to optimize results

  • Algorithms can generate varying results, depending on how the parameters are tuned

5. Resulting models built are then compared to select the best one

  • Overfitting: When an algorithm is overly-sensitive, it can mistake random variations in data as persistent patterns — known as over-fitting. This will typically result in a model which makes highly accurate predictions for existing datasets, but not generalize well for future data.
  • Underfitting: An insensitive algorithm would tend to overlook understood patterns — known as underfitting. An underfitted model would cause it to yield less accurate predictions for current and future data.

One great tip that he gave us was that sometimes it gets difficult to categorize certain types of categorical data. So, when you look at a confusing dataset, the question that we need to ask ourselves is that if there is a way to reduce the amount of categories.

Validation

Validation is an assessment of how accurate a model is at predictions. Existing datasets are split into two parts:

  • Training: A training dataset to generate a prediction model, and a test dataset, which acts as a proxy for new data, and an assessment of the model’s accuracy when it makes predictions. For large datasets the training dataset usually consists of up to 75% of randomly selected datapoints from the original dataset.
  • Testing: The remaining 25% would then constitute the test dataset. How good is it at predicting it. This is very good for large datasets. If the dataset is relatively small, you may have to employ cross validation.
  • Cross-Validation: If you have an entire dataset and it’s small, what you can do is let me take 4 different versions of it. Let us take a small percentage and call that the training part. We use this and it produces some sort of result, but we know that it’s not really accurate. Then, you pick a different dataset to train and single that out so that you can train your model. Then, you take the average of all the results.

We then were given a clear walk-through of how to use K-Means clustering and KNN classification algorithms along with a code rundown.

Machine Learning Project Checklist

In order to do a project leveraging machine learning, Professor gave us a cool checklist of the things required:

  1. Frame the problem and look at the big picture
  2. Get the data
  3. Explore the data to gain insights
  4. Prepare the data to better expose the underlying data patterns to machine learning algorithms
  5. Explore many different models and shortlist the best ones
  6. Fine tune your models and combine them into a good solution
  7. Present your solution
  8. Launch, monitor, and maintain your system

Useful Resources

Some great resources to gain more information from are listed below:

  • Data Science from Scratch by Joel Grus
  • Data Science by John D Kelleher and Brendan Tierney
  • Introduction to Computation and Programming using Python with Applications to Understanding Data by John Guttag
  • Hands-On Machine Learning with Scikit-Learn and TensorFlow by Arelien Geron
  • Introduction to Machine Learning with Python by Andreas C. Muller and Sarah Guido
  • Machine Learning and Security by Clarence Chio and David Freeman

Research and Dataset information

1.. KDD Cup 99 Dataset for intrusion detection — Although the data dates back to 15+ years ago and has million of entries, looks like it is broadly being used for academic research purposes. It overs four different types of attacks for intrusion detection.

http://archive.icsucl.edu/ml/machine-learning-databases/kddcup99-mld/kddcup99.html

2. In the below research, they use the dataset to classify intruders into two categories: normal and abnormal. Using KNN, they have explained the procedure and code build a simple network detection system.

https://simplyml.com/machine-learning-for-network-anomaly-detection/

3. The following paper uses the dataset for classification using KMeans clustering. This uses similar concept as the above by clustering the data into normal and abnormal, which is later classified

https://pdfs.semanticsscholar.org/0890/621455a8a5bfd79c12168e3b76ca68f547a.pdf

4. Staudemeyer and Omlin have used this dataset to find out which features are most important: “Extracting Silent Features for Network Intrusion Detection Using Machine Learning Methods”

http://saci.cs.uct.ac.za/index.php/saci/article/view/200

5. Relevant datasets — This site includes 18 datasets that are used in academic research

https://github.com/jjvoi/awesome-ml-for-cybersecurity#-datasets

--

--