NoShade.Vision

Kaggle's 30 Days Of ML (Competition Part-1): Cross Validation & First Submission on Kaggle

Abhishek-Thakur

Kaggle's 30 Days Of ML (Competition Part-1): Cross Validation & First Submission on Kaggle by Abhishek-Thakur

The video covers the first part of Kaggle's 30 Days of ML competition, which involves a regression problem with RMSE as the evaluation metric. The presenter demonstrates how to use K-Fold Cross Validation to create train-validation splits and shows how to prepare and fit both random forest and XGBoost models. They discuss various issues that may arise during the modeling process and provide solutions, including removing unnecessary columns and copying data frames. Additionally, they explain the submission process on Kaggle and caution against submitting more than 10 times per day, while emphasizing the importance of writing code on one's own and trying it out in the competition.

00:00:00

In this section, the speaker introduces the Kaggle competition for the next 15 days which is a regression problem with RMSE as the evaluation metric. They begin by creating folds using k-fold cross-validation and initializing k-fold as kf to set the number of splits, which is set to five. Using enumerate and kf dot split, the speaker extracts the indices for training and validation and prints them. Finally, they create a new column called 'kpold' in the data frame and assign it the values of validation indices.

00:05:00

In this section of the transcript, the speaker explains how to create train-validation splits using K-Fold Cross Validation in Python using Pandas DataFrame. They assign the value of -1 to the K-Fold column and get an error, but then add the .lock function to fix it. They then demonstrate how to check the distribution of the target and remove unnecessary data. Finally, they save the data to a CSV file and use it in a new notebook to add the data to a Kaggle competition. The speaker advises viewers to upload the dataset and make it public so others can use it, but to ensure they understand how to create it themselves using K-Fold Cross Validation.

00:10:00

In this section, the presenter imports the required modules and data for the Kaggle's 30 Days of ML competition. They remove the unnecessary modules and proceed with importing the necessary ones, including the train and test datasets, which will be used for modeling. The presenter reads the CSV files of train, test, and sample submission and explains their formats. They also apply the ordinal encoder on object columns and fit the random forest model. Finally, they define a loop for the 5-fold cross-validation and prepare the data for it.

00:15:00

In this section, the presenter continues by writing code to fit the model on the training data, see how good the predictions are, and calculate the RMSE score. They also discuss useful features and explain how to remove certain columns from the data sets. Additionally, they show how to define the model and how to improve the training speed by adding n_jobs = -1. They encounter an error of unknown categories in column zero during transform, which they explain is due to not copying df_test. They make this correction and continue with the training process.

00:20:00

In this section, the speaker discusses the use of xgboost regressor instead of the random forest model, which was taking too much time to run. The xgboost model is much faster and gives a better full zero score. The speaker also shows how to save the final predictions and convert them from a list of lists to an array. The final predictions can be used to generate a submission file for the competition.

00:25:00

In this section, the speaker discusses their submission process on Kaggle and explains the code used to train the model, calculate predictions on the validation and test sets, and save them in a list. They also caution against submitting more than 10 times per day and explain that the leaderboard score is based on approximately 25% of the data. The speaker emphasizes that the goal is to write the code on your own and try it out in the competition. They also mention that they will publish their notebook for others to reference.

More from
Abhishek-Thakur

No videos found.

Trending
AI Music

No music found.