The project I completed for the Data Science bootcamp Module 5 involved finding a dataset I was unfamiliar with and analyze it. I browsed Kaggle and found a dataset of variables about strategy game apps from the Apple App Store. It can be found at this link: https://www.kaggle.com/tristan581/17k-apple-app-store-strategy-games/.
After completing four projects for the Flatiron Data Science course, I discovered that one aspect of data exploration (Exploratory Data Analysis or the E of OSEMN) that I enjoy is feature engineering.
One aspect of data science I already had in my daily life before starting this course was clustering.
The package pandas
is oft-used in Flatiron’s Data Science course. Its manipulation of dataframes is very handy, although there are some surprising limitations that further differentiate it from working with simple Python or Java matrices. This post will address workarounds for a few of those limitations as well as the existing useful simple coding available when working with a pandas
dataframe. Links appear throughout that go directly to the relevant webpage of the documentation.
One important concept of data science that comes up in this course is overfitting and underfitting. Overfitting is when a function fits the training data supremely well, to the exclusion of any other data. LASSO (Least Absolute Shrinkage and Selection Operator) and Ridge regressions or (L1 and L2 Norm Regularizations, respectively) are two related approaches to overfitting and are used similarly to a linear regression model. They use a hyperparameter to penalize the coefficients so some are zero or closer to it in order to filter out noise in the data, thus helping reduce overfitting.