Predict Pipeline module
- quick_pp.machine_learning.predict_pipeline.load_data(hash: str) DataFrame[source]
Load data from the specified directory using a hash to identify the file.
- Parameters:
hash (str) – A unique hash string contained within the target Parquet filename.
- Raises:
FileNotFoundError – If no file is found with the specified hash.
- Returns:
The loaded well log data as a DataFrame.
- Return type:
pd.DataFrame
- quick_pp.machine_learning.predict_pipeline.preprocess_data(df: DataFrame) DataFrame[source]
Preprocess the input DataFrame by generating engineered features.
- Parameters:
df (pd.DataFrame) – The raw input DataFrame.
- Raises:
ValueError – If required columns for feature engineering are missing.
- Returns:
The DataFrame with added feature-engineered columns.
- Return type:
pd.DataFrame
- quick_pp.machine_learning.predict_pipeline.postprocess_data(df: DataFrame) DataFrame[source]
Postprocess the DataFrame by inverting LOG_PERM to PERM if needed. This function also calculates hydrocarbon volumes and corrects for fluid segregation.
- Parameters:
df (pd.DataFrame) – The DataFrame containing model predictions.
- Returns:
- The postprocessed DataFrame with added ‘PERM’, ‘VHC’, ‘VOIL’,
and ‘VGAS’ columns.
- Return type:
pd.DataFrame
- quick_pp.machine_learning.predict_pipeline.save_predictions(pred_df: DataFrame, output_file_name: str, plot: bool = False)[source]
Save the predictions DataFrame to a Parquet file. Optionally, generate and save individual well log plots.
- quick_pp.machine_learning.predict_pipeline.predict_pipeline(model_config: str, data_hash: str, output_file_name: str, env: str = 'local', plot_predictions: bool = False) None[source]
Execute the end-to-end prediction pipeline.
This function orchestrates loading data, preprocessing, loading the latest registered MLflow models, making predictions, postprocessing the results, and saving the output.
- Parameters:
model_config (str) – The key for the model configuration (e.g., ‘clastic’, ‘carbonate’).
data_hash (str) – The unique hash identifying the input data file.
output_file_name (str) – The base name for the output predictions file.
env (str, optional) – Environment for MLflow server. Defaults to ‘local’.
plot_predictions (bool, optional) – If True, generate plots for each well. Defaults to False.
Train Pipeline module
- quick_pp.machine_learning.train_pipeline.load_data(hash: str)[source]
Load data from the specified directory using a hash to identify the file.
- Parameters:
hash (str) – A unique hash string contained within the target Parquet filename.
- Raises:
FileNotFoundError – If no file is found with the specified hash.
- Returns:
The loaded well log data as a DataFrame.
- Return type:
pd.DataFrame
- quick_pp.machine_learning.train_pipeline.preprocess_data(df: DataFrame) DataFrame[source]
Preprocess the DataFrame by generating features and cleaning the data.
- Parameters:
df (pd.DataFrame) – The raw input DataFrame.
- Returns:
- The preprocessed DataFrame with engineered features and
duplicates removed.
- Return type:
pd.DataFrame
- quick_pp.machine_learning.train_pipeline.split_data(df: DataFrame, target_column: list[str], features: list[str], test_size=0.2, random_state=42) list[source]
Split the data into training and testing sets.
- Parameters:
- Returns:
A list containing [X_train, X_test, y_train, y_test].
- Return type:
- quick_pp.machine_learning.train_pipeline.train_model(alg, X_train: DataFrame, y_train: DataFrame)[source]
Train the model using the specified algorithm.
- Parameters:
alg (callable) – The scikit-learn model class to instantiate.
X_train (pd.DataFrame) – The training feature DataFrame.
y_train (pd.DataFrame) – The training target DataFrame.
- Returns:
The trained model instance.
- Return type:
- quick_pp.machine_learning.train_pipeline.evaluate_model(model, X_test: DataFrame, y_test: DataFrame) dict[source]
Evaluate the model using the test data.
- quick_pp.machine_learning.train_pipeline.train_pipeline(model_config: str, data_hash: str, env: str = 'local')[source]
Execute the end-to-end model training pipeline.
This function automates training, evaluating, and logging multiple models as defined in the configuration, leveraging MLflow for experiment tracking and model management.
- Parameters:
- Raises:
TypeError – If the targets or features are not lists of strings.
Utils module
- quick_pp.machine_learning.utils.unique_id(df: DataFrame) str[source]
Generate a unique ID for the DataFrame based on its content.
- Parameters:
df (pd.DataFrame) – DataFrame to hash.
- Returns:
An 8-character unique hexadecimal ID for the DataFrame.
- Return type:
- quick_pp.machine_learning.utils.is_mlflow_server_running(host, port)[source]
Check if the MLflow server is running on the specified host and port.
- quick_pp.machine_learning.utils.run_mlflow_server(env)[source]
Start an MLflow tracking server if not already running.
This function checks for a running MLflow server based on the environment configuration and sets the MLflow tracking URI accordingly.
- quick_pp.machine_learning.utils.get_model_info(registered_model)[source]
Extract key information from a registered MLflow model version object.
- quick_pp.machine_learning.utils.get_latest_registered_models(client: MlflowClient, experiment_name: str, data_hash: str) dict[source]
Get the latest versions of registered models from MLflow for a given experiment and data hash.
- Parameters:
- Returns:
A dictionary where keys are registered model names and values are dicts of their details.
- Return type: