Cross validation

Cross validation using the fold technique provides us with a way to evaluate the performance of a machine learning model. We divide our data into ‘k’ equal portions or segments. Then we train the model on ‘k 1’ segments. Test it on the remaining segment. This process is repeated ‘k’ times with each iteration testing, on a segment.

For instance if we employ 5 fold cross validation we split the data into 5 parts. The model is trained on 4 parts. Tested on the part. We repeat this process 5 times testing each part once. Finally we average the results, from all 5 tests to assess how well the model performs overall.

Resampling methods 

The video clips shared information, about resampling methods specifically focusing on aspects like estimating prediction error and using validation sets for model evaluation. One of the techniques discussed was K fold cross validation, which is widely used to assess model performance and ensure reliability. The emphasis was on following the methodologies in cross validation highlighting the approaches and warning against common mistakes that can affect outcomes.

Understanding these insights is crucial for measuring and validating models, which plays a role, in effective data analysis and decision making. By mastering these techniques we can gain insights. Develop more precise predictive models. Overall the videos provided a framework to comprehend the complexities of resampling methods

Crab Molt Data

In class today, we looked at data about how crabs shed their hard outer shells to grow. We wanted to guess a crab’s size before it shed by using its size after shedding. We made a graph and found that our guess was pretty close (the score was 0.98 out of 1). We noticed some stats about the sizes before and after shedding: the post-shedding data leaned a bit to the left (-2.3469 skewness) and had a peaky distribution (13.116 kurtosis), while the pre-shedding data was also a bit left-leaning (-2.00349 skewness) but less peaky (9.76632 kurtosis).

Both sets of data looked kind of similar on a graph, but with a slight difference in average size. We did a T-test to see if the sizes were really different. The test showed that they were, in fact, different. We also used another method, called ‘Monte-Carlo’, to be extra sure of this difference.

September 18

In Today discussion ,  multiple linear regression serves as an indispensable tool for delving into complex relationships between a dependent variable and multiple independent variables. Unlike simple linear regression, which is limited to exploring the relationship between two variables, this advanced statistical method allows for a more detailed understanding by incorporating multiple predictors. By estimating unique coefficients for each of these predictors, the model provides invaluable insights into how each variable individually, and collectively, influences the outcome variable

P value

In our last class, we talked about:

We looked at some key ideas in statistics, mainly focused on regression. First, there’s the null hypothesis, which is our starting guess that there’s no major change in error across different factors. Then, we have the p-value, which tells us how confident we are in this guess. If the p-value is very low (usually less than 0.05), it suggests our initial guess might be wrong. Lastly, we discussed the Breusch-Pagan Test, which checks if error changes are tied to specific factors.

September 11 2023

Dataset Summary:

I’ve looked at the 2018 CDC diabetes data which has 354 records on diabetes, obesity, and inactivity. I noticed the diabetes data might not be typical. I also plan to study inactivity, but I haven’t shared those details yet.

Linear Regression in Simple Terms:

Linear regression is like drawing a line through dots on a graph to see a pattern. This line helps us understand how things are connected and make guesses about future data. We need to be careful and make sure it’s the right approach for our data.