Predict Pipeline module

quick_pp.machine_learning.predict_pipeline.load_data(hash: str) → DataFrame[source]

Load data from the specified directory using a hash to identify the file.

Parameters:: hash (str) – A unique hash string contained within the target Parquet filename.
Raises:: FileNotFoundError – If no file is found with the specified hash.
Returns:: The loaded well log data as a DataFrame.
Return type:: pd.DataFrame

quick_pp.machine_learning.predict_pipeline.preprocess_data(df: DataFrame) → DataFrame[source]

Preprocess the input DataFrame by generating engineered features.

Parameters:: df (pd.DataFrame) – The raw input DataFrame.
Raises:: ValueError – If required columns for feature engineering are missing.
Returns:: The DataFrame with added feature-engineered columns.
Return type:: pd.DataFrame

quick_pp.machine_learning.predict_pipeline.postprocess_data(df: DataFrame) → DataFrame[source]

Postprocess the DataFrame by inverting LOG_PERM to PERM if needed. This function also calculates hydrocarbon volumes and corrects for fluid segregation.

Parameters:

df (pd.DataFrame) – The DataFrame containing model predictions.

Returns:

The postprocessed DataFrame with added ‘PERM’, ‘VHC’, ‘VOIL’,: and ‘VGAS’ columns.

Return type:

pd.DataFrame

quick_pp.machine_learning.predict_pipeline.save_predictions(pred_df: DataFrame, output_file_name: str, plot: bool = False)[source]

Save the predictions DataFrame to a Parquet file. Optionally, generate and save individual well log plots.

Parameters:

pred_df (pd.DataFrame) – DataFrame containing predictions.
output_file_name (str) – The base name for the output Parquet and plot files.
plot (bool, optional) – If True, generate and save well log plots. Defaults to False.

quick_pp.machine_learning.predict_pipeline.predict_pipeline(model_config: str, data_hash: str, output_file_name: str, env: str = 'local', plot_predictions: bool = False) → None[source]

Execute the end-to-end prediction pipeline.

This function orchestrates loading data, preprocessing, loading the latest registered MLflow models, making predictions, postprocessing the results, and saving the output.

Parameters:

model_config (str) – The key for the model configuration (e.g., ‘clastic’, ‘carbonate’).
data_hash (str) – The unique hash identifying the input data file.
output_file_name (str) – The base name for the output predictions file.
env (str, optional) – Environment for MLflow server. Defaults to ‘local’.
plot_predictions (bool, optional) – If True, generate plots for each well. Defaults to False.

Train Pipeline module

quick_pp.machine_learning.train_pipeline.load_data(hash: str)[source]

Load data from the specified directory using a hash to identify the file.

Parameters:: hash (str) – A unique hash string contained within the target Parquet filename.
Raises:: FileNotFoundError – If no file is found with the specified hash.
Returns:: The loaded well log data as a DataFrame.
Return type:: pd.DataFrame

quick_pp.machine_learning.train_pipeline.preprocess_data(df: DataFrame) → DataFrame[source]

Preprocess the DataFrame by generating features and cleaning the data.

Parameters:

df (pd.DataFrame) – The raw input DataFrame.

Returns:

The preprocessed DataFrame with engineered features and: duplicates removed.

Return type:

pd.DataFrame

quick_pp.machine_learning.train_pipeline.split_data(df: DataFrame, target_column: list[str], features: list[str], test_size=0.2, random_state=42) → list[source]

Split the data into training and testing sets.

Parameters:

df (pd.DataFrame) – The DataFrame to be split.
target_column (list[str]) – List of target column names.
features (list[str]) – List of feature column names.
test_size (float, optional) – Proportion of the dataset to include in the test split. Defaults to 0.2.
random_state (int, optional) – Random seed for reproducibility. Defaults to 42.

Returns:

A list containing [X_train, X_test, y_train, y_test].

Return type:

list

quick_pp.machine_learning.train_pipeline.train_model(alg, X_train: DataFrame, y_train: DataFrame)[source]

Train the model using the specified algorithm.

Parameters:

alg (callable) – The scikit-learn model class to instantiate.
X_train (pd.DataFrame) – The training feature DataFrame.
y_train (pd.DataFrame) – The training target DataFrame.

Returns:

The trained model instance.

Return type:

object

quick_pp.machine_learning.train_pipeline.evaluate_model(model, X_test: DataFrame, y_test: DataFrame) → dict[source]

Evaluate the model using the test data.

Parameters:

model (object) – The trained model to evaluate.
X_test (pd.DataFrame) – The testing feature DataFrame.
y_test (pd.DataFrame) – The testing target DataFrame.

Returns:

A dictionary of evaluation metrics (e.g., ‘f1_score’, ‘r2_score’).

Return type:

dict

quick_pp.machine_learning.train_pipeline.train_pipeline(model_config: str, data_hash: str, env: str = 'local')[source]

Execute the end-to-end model training pipeline.

This function automates training, evaluating, and logging multiple models as defined in the configuration, leveraging MLflow for experiment tracking and model management.

Parameters:

model_config (str) – The key for the model configuration (e.g., ‘clastic’).
data_hash (str) – The unique hash identifying the data file.
env (str, optional) – The MLflow environment (‘local’ or ‘remote’). Defaults to ‘local’.

Raises:

TypeError – If the targets or features are not lists of strings.

Utils module

quick_pp.machine_learning.utils.unique_id(df: DataFrame) → str[source]

Generate a unique ID for the DataFrame based on its content.

Parameters:: df (pd.DataFrame) – DataFrame to hash.
Returns:: An 8-character unique hexadecimal ID for the DataFrame.
Return type:: str

quick_pp.machine_learning.utils.is_mlflow_server_running(host, port)[source]

Check if the MLflow server is running on the specified host and port.

Parameters:

host (str) – Hostname or IP address of the MLflow server.
port (int) – Port number of the MLflow server.

Returns:

True if the server is running, False otherwise.

Return type:

bool

quick_pp.machine_learning.utils.run_mlflow_server(env)[source]

Start an MLflow tracking server if not already running.

This function checks for a running MLflow server based on the environment configuration and sets the MLflow tracking URI accordingly.

Parameters:: env (str) – The environment key to select the MLflow server configuration from MLFLOW_CONFIG.
Raises:: KeyError – If the specified environment is not found in MLFLOW_CONFIG.

quick_pp.machine_learning.utils.get_model_info(registered_model)[source]

Extract key information from a registered MLflow model version object.

Parameters:: registered_model (list[mlflow.entities.model_registry.ModelVersion]) – A list containing one or more registered model version objects. This function processes the first one.
Returns:: A dictionary containing the model’s name, run ID, version, URI, and stage.
Return type:: dict

quick_pp.machine_learning.utils.get_latest_registered_models(client: MlflowClient, experiment_name: str, data_hash: str) → dict[source]

Get the latest versions of registered models from MLflow for a given experiment and data hash.

Parameters:

client (MlflowClient) – MLflow client to interact with the tracking server.
experiment_name (str) – Name of the experiment to filter registered models.
data_hash (str) – The unique hash of the data used to train the models.

Returns:

A dictionary where keys are registered model names and values are dicts of their details.

Return type:

dict