You are developing a training pipeline for a new XGBoost classification model based on tabular data. The data is stored in a BigQuery table. You need to complete the following steps: 1. Randomly split the data into training and evaluation datasets in a 65/35 ratio 2. Conduct feature engineering 3. Obtain metrics for the evaluation dataset 4. Compare models trained in different pipeline executions How should you execute these steps?