In today’s class, we have discussed about cross-validation using a dataset that focuses on three key health variables: obesity, inactivity, and diabetes. We meticulously examined 354 individuals’ data, thoroughly measuring these health aspects.
Our main goal is to construct predictive models that provide deeper insights into how these health variables interrelate. To achieve this, we’re considering a diverse range of polynomial models, spanning from simple linear models to more intricate ones, extending up to a degree of 4. This variety enables us to explore and grasp the complexity of our dataset.
To effectively assess and select the most suitable model that aligns with our data, we’ve adopted a 5-fold cross-validation technique. In this process, we segment our dataset into five sections or “folds.” Subsequently, we train and test our polynomial models five times, each time utilizing a distinct fold as the test set, while utilizing the remaining four folds for training. This systematic approach aids us in gauging how effectively our models can adapt to new, unseen data.
Our primary objective is to pinpoint the polynomial model that accurately represents our data. We evaluate model performance using common statistical measures like mean squared error or R-squared, providing insights into how well our models match the actual data.
To delve deeper and visually comprehend patterns and distributions within the dataset, we’ve constructed a 3D scatter plot. In this visualization, every data point is illustrated as a black dot, and the three axes showcase the values for obesity, inactivity, and diabetes. This graphical representation helps us identify any noticeable trends or groupings in the data.
In essence, our analysis revolves around selecting the most fitting polynomial model to elucidate the relationships among obesity, inactivity, and diabetes. We utilize the robust approach of cross-validation to meticulously assess these models, while our 3D scatter plot aids in grasping the fundamental patterns and tendencies in these health-related variables.