Sep 27, 2023

In today’s class, we have discussed about cross-validation using a dataset that focuses on three key health variables: obesity, inactivity, and diabetes. We meticulously examined 354 individuals’ data, thoroughly measuring these health aspects.

Our main goal is to construct predictive models that provide deeper insights into how these health variables interrelate. To achieve this, we’re considering a diverse range of polynomial models, spanning from simple linear models to more intricate ones, extending up to a degree of 4. This variety enables us to explore and grasp the complexity of our dataset.

To effectively assess and select the most suitable model that aligns with our data, we’ve adopted a 5-fold cross-validation technique. In this process, we segment our dataset into five sections or “folds.” Subsequently, we train and test our polynomial models five times, each time utilizing a distinct fold as the test set, while utilizing the remaining four folds for training. This systematic approach aids us in gauging how effectively our models can adapt to new, unseen data.

Our primary objective is to pinpoint the polynomial model that accurately represents our data. We evaluate model performance using common statistical measures like mean squared error or R-squared, providing insights into how well our models match the actual data.

To delve deeper and visually comprehend patterns and distributions within the dataset, we’ve constructed a 3D scatter plot. In this visualization, every data point is illustrated as a black dot, and the three axes showcase the values for obesity, inactivity, and diabetes. This graphical representation helps us identify any noticeable trends or groupings in the data.

In essence, our analysis revolves around selecting the most fitting polynomial model to elucidate the relationships among obesity, inactivity, and diabetes. We utilize the robust approach of cross-validation to meticulously assess these models, while our 3D scatter plot aids in grasping the fundamental patterns and tendencies in these health-related variables.

Sep 25, 2023

In this class I have learnt about Cross-Validation and Bootstrap, k-Fold Cross-Validation.

Cross-Validation- Cross-validation is a method employed in both machine learning and statistical modeling to evaluate how effectively a predictive model will perform on data it hasn’t seen before, allowing for an assessment of its performance and applicability. Instead of a single train-test split, cross-validation involves repeating the training and validation process multiple times, each time using a different subset of the data as the testing/validation set and the remaining data for training.

Bootstrap- Bootstrap is akin to a clever data illusion, conjured by crafting “simulated” datasets from our original data through random selection with replacement. This technique proves particularly useful in situations where data is limited. By scrutinizing these simulated datasets, We can determine the level of confidence we can place in our model’s outcomes. It’s comparable to having our model revisit its assignments multiple times, ensuring a comprehensive understanding of the subject matter.

k-Fold Cross-Validation- It is a common technique used in machine learning for model evaluation. It involves partitioning the dataset into ‘k’ equally sized folds, where ‘k’ is a specified number. The model is trained ‘k’ times, each time using a different fold as the validation set and the remaining data as the training set. The results from the ‘k’ iterations are then averaged to obtain a single performance metric for the model. This method provides a robust assessment of the model’s performance and is especially useful for optimizing hyperparameters and understanding how the model generalizes to unseen data.

Sep 22, 2023

In this class I have learnt about t-test.

A t-test is like being a detective in statistics. Suppose we have two sets of data, and we are curious if they’re truly different or just a coincidence. The t-test comes to the rescue. We start with a guess called the “null hypothesis,” assuming there’s no actual difference between the groups. Then, we collect data and crunch the numbers to get a special value called the “t-statistic.” This value indicates how significant the differences between the groups are. If the t-statistic is large and the groups are notably different,  we get a small “p-value,” which is like a hint. A small p-value implies the groups are likely genuinely different, not by chance. If the p-value is less than a number we pick (usually 0.05), we can confidently say the groups differ, and we reject the null hypothesis. But if the p-value is large, it suggests the groups might not differ much, and we lack enough proof to reject the null hypothesis. Essentially, the t-test guides us in determining if what we observe in our data is a genuine difference or just a random outcome.

Sep 20, 2023

In this lecture  I have learned about T-Test – A Statistical Tool for Comparing Groups.

The statistical hypothesis test known as the t-test is used to evaluate whether there is a significant difference between the means of two groups. For comparing means and determining the significance of observed differences, it is a crucial statistical tool that is applied widely in many different domains.

There are two main types of t-tests:

  1. Independent Samples T-Test: This type of t-test is used when we want to compare the means of two independent groups or populations. It’s often applied when comparing two different treatments, groups of people, or any situation where the samples are independent of each other.
  2. Dependent Samples T-Test: This type of t-test is used when the samples are dependent or related in some way (e.g., before and after measurements on the same individuals). It’s often used in situations where we want to determine if there’s a significant difference between two measurements taken on the same individuals or objects.

The t-test assesses whether the observed difference between the means is likely to be due to chance (random variability) or if it’s likely to represent a true difference in the populations. The result of a t-test is usually reported as a p-value, which indicates the probability of observing the given difference (or a more extreme difference) if the null hypothesis is true. If the p-value is below a predetermined significance level (e.g., 0.05), you reject the null hypothesis in favor of the alternative hypothesis, suggesting that there is a significant difference between the groups. If the p-value is above the significance level, you fail to reject the null hypothesis, indicating that there is no significant difference.

Sep 18, 2023

Linear regression with two predictor variables:

Multiple linear regression is the term frequently used to describe linear regression with two predictor variables. In this case, a single result or dependent variable is what we are attempting to predict using at least two independent factors. Multiple linear regression takes the following general form:

y=β0+β1x1+β2x2+ϵ

where:

  • is the dependent variable (what you’re trying to predict),
  • x1and are the two independent variables (predictors),
  • is the intercept (the value of when both and are zero),
  • and are the coefficients associated with and , respectively, representing the change in for a one-unit change in or ,
  • is the error term.

Overfitting: Overfitting occurs when a model becomes overly complex and fits too closely to the noise or idiosyncrasies present in the training dataset. This noise is essentially the random variations or errors that naturally exist in real-world data. When a complex model tries to capture this noise along with the actual underlying relationship between variables, it “memorizes” the peculiarities of the training data, including the noise, instead of capturing the true patterns.

In summary, an overfit model essentially fails to generalize well beyond the training data because it has become too complex and has modeled the noise rather than the actual relationship between the variables. This is a significant issue, as the ultimate goal of any regression model is to provide accurate predictions on new data, not just replicate the training data.
To mitigate overfitting, techniques such as regularization (e.g., Lasso, Ridge), cross-validation, early stopping, and using simpler models are applied. These approaches help strike a balance between capturing important patterns and avoiding fitting to noise, resulting in a model that generalizes well to new, unseen data.

Sep 13, 2023

In today’s class I have learnt about Null Hypothesis and P- Value.

Null Hypothesis-The null hypothesis is that the new approach doesn’t actually affect students’ marks when two groups of students are being compared, one using the new method and the other the old one. Therefore, the null hypothesis serves as our baseline and is what we will test. You might reject the null hypothesis and declare that something unusual or noteworthy is occurring if our data and analysis support this. If not, we can accept the null hypothesis and state, “Looks like there’s no real difference or effect.”

P- Value- A p-value helps in determining whether a study’s findings are the result of pure random chance or if something significant is actually occurring. Consider trying a new drug: a small p-value indicates that the drug is likely to work, whereas a large one indicates that the drug may not be doing much. Scientists utilize p-values to determine whether or not they should believe their findings, but they must also consider how significant those findings are in the actual world in addition to what the p-value indicates.

Sep 11, 2023

From the CDC 2018 diabetes dataset I understood that we have data on %diabetes, %obesity, and %inactivity, but only 354 rows contain information for all three variables. Out of these 1370 data points related to %inactivity, all of them also include data for %diabetes.

The correlation between %diabetes and %inactivity is about 0.44, indicating a moderate positive relationship. %diabetes data is slightly skewed with a kurtosis of approximately 4, while %inactivity data is skewed in the opposite direction with a kurtosis of about 2. A linear regression model suggests that around 20% of the variation in %diabetes can be attributed to %inactivity. However, the residuals from the linear model do not follow a normal distribution and exhibit heteroscedasticity, meaning that their variability changes with %inactivity values.

This heteroscedasticity violates a key assumption of linear regression, raising concerns about the reliability of the model. For Further analysis or alternative modeling approaches may be needed for more dependable predictions.

The relationship between a dependent variable which is also known as the target or outcome variable and one or more independent variables known as predictors or features is modeled using the core statistical and machine learning approach of linear regression. A simple linear regression method is a prediction of quantitative response Y based on a single predictor variable, X. It presumes that the relationship between X and Y is roughly linear.