Week 7 Lab: Unsupervised Learning


This week's assignment will focus completeing a K-Means analysis.

Our Dataset:

Dataset: student-mat.csv(Provided in folder assign_wk7)

Remember to take a look at the student.txt files for a better understanding of the dataset. You can also read more about the dataset here

Unsupervised Lab

Objective:

Things to keep in mind:

Deliverables:

Upload your notebook's .ipynb file (This assignment can be done in one or two notebooks. The choice it up to you!)

Important: Make sure your provide complete and thorough explanations for all of your analysis. You need to defend your thought processes and reasoning.

After a little EDA, there seems to be no null values at first glance, though I am still mildly suspicious given the previous datasets in this course. We will read the txt file for clues, and break down our columns into categories. We will also look to encode our dataset for the ML model. UPDATE we are not doing this, as we will eventually have to have all integers for our ML models. Looking back, the providers of the data did much of the work for us in encoding these categories already. My category calls will be commented out.

Reasoning for so many categorical

While many of the columns were integers, upon inspection of what they contain, both in Python and the Txt file, the vast majority of the integers were still categorical.

Update - This is no longer true, leaving to show progress

See above markdown comment for more (all category calls were commented out. We will now encode our dataframe to then plug into our models.

Not entirely sure what these warnings are about, can likely find info on sklearn's site about new parameter, guessing 'jobs' or something, but moving on. We will now try and find our best k in the 'elbow'.

At quick glancce, we see our last real improvement at what looks like 13. Let's keep going.

Ew

I have plug and played some, and 5 clusters gets us up to .25, but lets use some PCA to get better results.

Let's gather our thoughts for a second

So, much like the walkthrough, it does seem as though 2 clusters was the way to go, though the PCA is still less than convincing, and honestly a bit of a let down.

A theory

Looking at our data quickly, as well as at the correlation plot, we can see that our G1 and G2 Columns appear to be highly correlated to the G3. This makes logical sense, as these are semester 1 and semester 2 grades, while g3 is the final grades, conceivably a combination of the two.

Let's copy our df_encoded dataframe, drop the G1 & G2 columns, then run it again.

Revisit

I originally submitted this assignment with this as the ending. Below, I'm going to try and implement TPOT to see if we can't do a better job clustering our data, perhaps with a different ML model.

Conclusion

Well, we still didn't get anything good. I've thrown my hands up a couple times at this point with the data set, and I refuse to believe that this was the intention and I know I'm missing something. I'm submitting this on the last Sunday of the class, just to show the continued attempts, but rest assured I will be revisiting this again once I'm less frustrated.