- move models to separate repository

This commit is contained in:
Giuseppe Nucifora 2024-12-10 23:29:54 +01:00
commit 56c1c3cc1f
69 changed files with 49903 additions and 0 deletions

3
.dvc/.gitignore vendored Normal file
View File

@ -0,0 +1,3 @@
/config.local
/tmp
/cache

6
.dvc/config Normal file
View File

@ -0,0 +1,6 @@
[core]
autostage = true
remote = storage
['remote "storage"']
url = s3://olive-oil-dataset
region = eu-west-1

3
.dvcignore Normal file
View File

@ -0,0 +1,3 @@
!*.h5
!*.keras
!*.parquet

5
.gitignore vendored Normal file
View File

@ -0,0 +1,5 @@
/sources
.idea
*.parquet
*.h5
*.keras

49
README.md Normal file
View File

@ -0,0 +1,49 @@
# Olive Oil Transformer Model
This repository contains transformer-based models for various predictions, including olive oil production forecasting. Here's a guide to the key components of the project:
## Project Structure
### Model Notebooks Location
The model notebooks are located in the `/models` directory, organized by different prediction tasks:
- **Olive Oil Model**: `/models/olive_oli/olive_oil-v2.ipynb`
- Contains the implementation of the transformer model for olive oil production forecasting
- Includes model training, evaluation, and visualization components
- **Solar Energy Model**: `/models/solarenergy/solarenergy_model_v1.ipynb`
- Transformer model for solar energy prediction
- **Solar Radiation Model**: `/models/solarradiation/solarradiation_model.ipynb`
- Implementation for solar radiation forecasting
- **UV Index Model**: `/models/uv_index/uv_index_model.ipynb`
- Model for UV index prediction
### Synthetic Data Generation
The script for generating synthetic training data is located at: ```/olive_oil_train_dataset/create_train_dataset.py```
This script is responsible for creating synthetic data used in training the olive oil production model.
### Utility Functions
Common utility functions and helper methods are stored in: ```/utils/helpers.py```
## Model Artifacts
Each model directory contains its associated artifacts, including:
- Trained model weights
- Scalers for data normalization
- Training logs
- Model architecture visualizations
- Performance analysis plots
For example, the olive oil model directory contains:
- Model weights in the `weights` subdirectory
- Scalers for static and temporal features
- Training logs in the `logs` subdirectory
- Model architecture and performance visualization plots
## Getting Started
To work with the models:
1. Start with the respective notebook in the `/models` directory
2. For olive oil prediction, first generate synthetic data using the script in `/olive_oil_train_dataset`
3. Utilize the utility functions from `/utils/helpers.py` as needed

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,2 @@
model_checkpoint_path: "weights"
all_model_checkpoint_paths: "weights"

Binary file not shown.

View File

@ -0,0 +1,2 @@
model_checkpoint_path: "weights"
all_model_checkpoint_paths: "weights"

Binary file not shown.

After

Width:  |  Height:  |  Size: 1021 KiB

Binary file not shown.

Binary file not shown.

File diff suppressed because one or more lines are too long

Binary file not shown.

After

Width:  |  Height:  |  Size: 372 KiB

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.7 MiB

View File

@ -0,0 +1 @@
["uvindex", "cloudcover", "visibility", "temp", "pressure", "humidity", "solarradiation", "solar_elevation", "solar_angle", "day_length", "hour_sin", "hour_cos", "day_of_year_sin", "day_of_year_cos", "month_sin", "month_cos", "solar_noon", "daylight_correction", "clear_sky_index", "atmospheric_attenuation", "theoretical_radiation", "expected_radiation", "cloud_elevation", "visibility_elevation", "uv_cloud_interaction", "temp_radiation_potential", "air_mass_index", "atmospheric_stability", "vapor_pressure_deficit", "diffusion_index", "atmospheric_transmittance", "temp_humidity_interaction", "clear_sky_factor", "cloud_rolling_12h", "temp_rolling_12h", "uv_rolling_12h", "cloudcover_rolling_mean_6h", "temp_rolling_mean_6h", "energy_rolling_mean_6h", "uv_rolling_mean_6h", "energy_volatility", "uv_volatility", "temp_1h_lag", "cloudcover_1h_lag", "humidity_1h_lag", "energy_lag_1h", "uv_lag_1h", "temp_losses", "soiling_loss_factor", "estimated_efficiency", "production_potential", "system_performance_ratio", "conversion_efficiency_ratio", "clear_sky_duration", "weather_variability_index", "temp_stability", "humidity_stability", "cloudcover_stability", "season_Spring", "season_Summer", "season_Autumn", "season_Winter", "time_period_Morning", "time_period_Afternoon", "time_period_Evening", "time_period_Night"]

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.4 MiB

Binary file not shown.

Binary file not shown.

File diff suppressed because one or more lines are too long

Binary file not shown.

After

Width:  |  Height:  |  Size: 712 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 723 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 981 KiB

View File

@ -0,0 +1 @@
["uvindex", "cloudcover", "visibility", "temp", "pressure", "humidity", "solar_elevation", "solar_angle", "day_length", "hour_sin", "hour_cos", "day_of_year_sin", "day_of_year_cos", "month_sin", "month_cos", "clear_sky_index", "atmospheric_attenuation", "theoretical_radiation", "expected_radiation", "cloud_elevation", "visibility_elevation", "uv_cloud_interaction", "temp_radiation_potential", "cloud_rolling_12h", "temp_rolling_12h", "uv_rolling_12h", "cloudcover_rolling_mean_6h", "temp_rolling_mean_6h", "temp_1h_lag", "cloudcover_1h_lag", "humidity_1h_lag", "uv_lag_1h", "season_Spring", "season_Summer", "season_Autumn", "season_Winter", "time_period_Morning", "time_period_Afternoon", "time_period_Evening", "time_period_Night"]

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.0 MiB

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.9 MiB

View File

@ -0,0 +1 @@
["uvindex", "cloudcover", "visibility", "temp", "pressure", "humidity", "solar_elevation", "solar_angle", "day_length", "hour_sin", "hour_cos", "day_of_year_sin", "day_of_year_cos", "month_sin", "month_cos", "clear_sky_index", "atmospheric_attenuation", "theoretical_radiation", "expected_radiation", "cloud_elevation", "visibility_elevation", "uv_cloud_interaction", "temp_radiation_potential", "cloud_rolling_12h", "temp_rolling_12h", "uv_rolling_12h", "cloudcover_rolling_mean_6h", "temp_rolling_mean_6h", "temp_1h_lag", "cloudcover_1h_lag", "humidity_1h_lag", "uv_lag_1h", "season_Spring", "season_Summer", "season_Autumn", "season_Winter", "time_period_Morning", "time_period_Afternoon", "time_period_Evening", "time_period_Night"]

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.9 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.7 MiB

Binary file not shown.

Binary file not shown.

File diff suppressed because one or more lines are too long

Binary file not shown.

After

Width:  |  Height:  |  Size: 904 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 996 KiB

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

Binary file not shown.

View File

@ -0,0 +1 @@
["temp", "humidity", "cloudcover", "visibility", "clear_sky_index", "atmospheric_transparency", "hour_sin", "hour_cos", "day_of_year_sin", "day_of_year_cos", "solar_angle", "solar_elevation", "day_length", "solar_noon", "solar_cloud_effect", "cloud_temp_interaction", "visibility_cloud_interaction", "temp_humidity_interaction", "solar_clarity_index", "cloud_rolling_12h", "temp_rolling_mean_6h", "season_Autumn", "season_Spring", "season_Summer", "season_Unknown", "season_Winter", "time_period_Afternoon", "time_period_Evening", "time_period_Morning", "time_period_Night"]

Binary file not shown.

After

Width:  |  Height:  |  Size: 904 KiB

Binary file not shown.

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,27 @@
{
"model_params": {
"input_shape": [
24,
30
],
"n_features": 30,
"sequence_length": 24
},
"training_params": {
"batch_size": 128,
"total_epochs": 100,
"best_epoch": 94
},
"performance_metrics": {
"final_loss": 0.06517542153596878,
"final_mae": 0.13557544350624084,
"best_val_loss": 0.06352950632572174,
"out_of_range_predictions": 0
},
"prediction_stats": {
"n_predictions_added": 227879,
"mean_predicted_uv": 0.585634171962738,
"min_predicted_uv": 0.0032156717497855425,
"max_predicted_uv": 3.182884931564331
}
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 996 KiB

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,516 @@
import pandas as pd
import numpy as np
from concurrent.futures import ProcessPoolExecutor, as_completed
import multiprocessing
import psutil
from tqdm import tqdm
import os
import argparse
import sys
import gc
from utils.helpers import clean_column_name, get_growth_phase, calculate_weather_effect, calculate_water_need, \
create_technique_mapping, preprocess_weather_data
def get_optimal_workers():
"""Calcola il numero ottimale di workers basato sulle risorse del sistema"""
cpu_count = multiprocessing.cpu_count()
memory = psutil.virtual_memory()
available_memory_gb = memory.available / (1024 ** 3)
memory_per_worker_gb = 2
max_workers_by_memory = int(available_memory_gb / memory_per_worker_gb)
optimal_workers = min(
cpu_count - 1,
max_workers_by_memory,
32
)
print(f'CPU count : {cpu_count} - Memory : {memory} = Max Worker by memory : {max_workers_by_memory}')
return max(1, optimal_workers)
def simulate_zone(base_weather, olive_varieties, year, zone, all_varieties, variety_techniques):
"""
Simula la produzione di olive per una singola zona.
Args:
base_weather: DataFrame con dati meteo di base per l'anno selezionato
olive_varieties: DataFrame con le informazioni sulle varietà di olive
zone: ID della zona
all_varieties: Array con tutte le varietà disponibili
variety_techniques: Dict con le tecniche disponibili per ogni varietà
Returns:
Dict con i risultati della simulazione per la zona
"""
# Crea una copia dei dati meteo per questa zona specifica
zone_weather = base_weather.copy()
# Genera variazioni meteorologiche specifiche per questa zona
zone_weather['temp_mean'] *= np.random.uniform(0.95, 1.05, len(zone_weather))
zone_weather['precip_sum'] *= np.random.uniform(0.9, 1.1, len(zone_weather))
zone_weather['solarenergy_sum'] *= np.random.uniform(0.95, 1.05, len(zone_weather))
# Genera caratteristiche specifiche della zona
num_varieties = np.random.randint(1, 4) # 1-3 varietà per zona
selected_varieties = np.random.choice(all_varieties, size=num_varieties, replace=False)
hectares = np.random.uniform(1, 10) # Dimensione del terreno
percentages = np.random.dirichlet(np.ones(num_varieties)) # Distribuzione delle varietà
# Inizializzazione contatori annuali
annual_production = 0
annual_min_oil = 0
annual_max_oil = 0
annual_avg_oil = 0
annual_water_need = 0
# Inizializzazione dizionario dati varietà
variety_data = {clean_column_name(variety): {
'tech': '',
'pct': 0,
'prod_t_ha': 0,
'oil_prod_t_ha': 0,
'oil_prod_l_ha': 0,
'min_yield_pct': 0,
'max_yield_pct': 0,
'min_oil_prod_l_ha': 0,
'max_oil_prod_l_ha': 0,
'avg_oil_prod_l_ha': 0,
'l_per_t': 0,
'min_l_per_t': 0,
'max_l_per_t': 0,
'avg_l_per_t': 0,
'olive_prod': 0,
'min_oil_prod': 0,
'max_oil_prod': 0,
'avg_oil_prod': 0,
'water_need': 0
} for variety in all_varieties}
# Simula produzione per ogni varietà selezionata
for i, variety in enumerate(selected_varieties):
# Seleziona tecnica di coltivazione casuale per questa varietà
technique = np.random.choice(variety_techniques[variety])
percentage = percentages[i]
# Ottieni informazioni specifiche della varietà
variety_info = olive_varieties[
(olive_varieties['Varietà di Olive'] == variety) &
(olive_varieties['Tecnica di Coltivazione'] == technique)
].iloc[0]
# Calcola produzione base con variabilità
base_production = variety_info['Produzione (tonnellate/ettaro)'] * 1000 * percentage * hectares / 12
base_production *= np.random.uniform(0.9, 1.1)
# Calcola effetti meteo sulla produzione
weather_effect = zone_weather.apply(
lambda row: calculate_weather_effect(row, variety_info['Temperatura Ottimale']),
axis=1
)
monthly_production = base_production * (1 + weather_effect / 10000)
monthly_production *= np.random.uniform(0.95, 1.05, len(zone_weather))
# Calcola produzione annuale per questa varietà
annual_variety_production = monthly_production.sum()
# Calcola rese di olio con variabilità
min_yield_factor = np.random.uniform(0.95, 1.05)
max_yield_factor = np.random.uniform(0.95, 1.05)
avg_yield_factor = (min_yield_factor + max_yield_factor) / 2
min_oil_production = annual_variety_production * variety_info[
'Min Litri per Tonnellata'] / 1000 * min_yield_factor
max_oil_production = annual_variety_production * variety_info[
'Max Litri per Tonnellata'] / 1000 * max_yield_factor
avg_oil_production = annual_variety_production * variety_info[
'Media Litri per Tonnellata'] / 1000 * avg_yield_factor
# Calcola fabbisogno idrico
base_water_need = (
variety_info['Fabbisogno Acqua Primavera (m³/ettaro)'] +
variety_info['Fabbisogno Acqua Estate (m³/ettaro)'] +
variety_info['Fabbisogno Acqua Autunno (m³/ettaro)'] +
variety_info['Fabbisogno Acqua Inverno (m³/ettaro)']
) / 4
monthly_water_need = zone_weather.apply(
lambda row: calculate_water_need(row, base_water_need, variety_info['Temperatura Ottimale']),
axis=1
)
monthly_water_need *= np.random.uniform(0.95, 1.05, len(monthly_water_need))
annual_variety_water_need = monthly_water_need.sum() * percentage * hectares
# Aggiorna totali annuali
annual_production += annual_variety_production
annual_min_oil += min_oil_production
annual_max_oil += max_oil_production
annual_avg_oil += avg_oil_production
annual_water_need += annual_variety_water_need
# Aggiorna dati varietà
clean_variety = clean_column_name(variety)
variety_data[clean_variety].update({
'tech': clean_column_name(technique),
'pct': percentage,
'prod_t_ha': variety_info['Produzione (tonnellate/ettaro)'] * np.random.uniform(0.95, 1.05),
'oil_prod_t_ha': variety_info['Produzione Olio (tonnellate/ettaro)'] * np.random.uniform(0.95, 1.05),
'oil_prod_l_ha': variety_info['Produzione Olio (litri/ettaro)'] * np.random.uniform(0.95, 1.05),
'min_yield_pct': variety_info['Min % Resa'] * min_yield_factor,
'max_yield_pct': variety_info['Max % Resa'] * max_yield_factor,
'min_oil_prod_l_ha': variety_info['Min Produzione Olio (litri/ettaro)'] * min_yield_factor,
'max_oil_prod_l_ha': variety_info['Max Produzione Olio (litri/ettaro)'] * max_yield_factor,
'avg_oil_prod_l_ha': variety_info['Media Produzione Olio (litri/ettaro)'] * avg_yield_factor,
'l_per_t': variety_info['Litri per Tonnellata'] * np.random.uniform(0.98, 1.02),
'min_l_per_t': variety_info['Min Litri per Tonnellata'] * min_yield_factor,
'max_l_per_t': variety_info['Max Litri per Tonnellata'] * max_yield_factor,
'avg_l_per_t': variety_info['Media Litri per Tonnellata'] * avg_yield_factor,
'olive_prod': annual_variety_production,
'min_oil_prod': min_oil_production,
'max_oil_prod': max_oil_production,
'avg_oil_prod': avg_oil_production,
'water_need': annual_variety_water_need
})
# Appiattisci i dati delle varietà
flattened_variety_data = {
f'{variety}_{key}': value
for variety, data in variety_data.items()
for key, value in data.items()
}
# Restituisci il risultato della zona
return {
'year': year,
'zone_id': zone + 1,
'temp_mean': zone_weather['temp_mean'].mean(),
'precip_sum': zone_weather['precip_sum'].sum(),
'solar_energy_sum': zone_weather['solarenergy_sum'].sum(),
'ha': hectares,
'zone': f"zone_{zone + 1}",
'olive_prod': annual_production,
'min_oil_prod': annual_min_oil,
'max_oil_prod': annual_max_oil,
'avg_oil_prod': annual_avg_oil,
'total_water_need': annual_water_need,
**flattened_variety_data
}
def simulate_olive_production_parallel(weather_data, olive_varieties, num_simulations=5, num_zones=None,
random_seed=None,
max_workers=None, batch_size=500,
output_path='olive_simulation_dataset.parquet'):
"""
Versione corretta della simulazione parallelizzata con gestione batch e salvataggio file
Args:
weather_data: DataFrame con dati meteo
olive_varieties: DataFrame con varietà di olive
num_simulations: numero di simulazioni da eseguire (default: 5)
num_zones: numero di zone per simulazione (default: None, usa num_simulations se non specificato)
random_seed: seed per riproducibilità (default: None)
max_workers: numero massimo di workers (default: None, usa get_optimal_workers)
batch_size: dimensione del batch per gestione memoria (default: 500)
output_path: percorso del file di output (default: 'olive_simulation_dataset.parquet')
Returns:
DataFrame con i risultati delle simulazioni
"""
if random_seed is not None:
np.random.seed(random_seed)
# Se num_zones non è specificato, usa num_simulations
if num_zones is None:
num_zones = num_simulations
# Preparazione dati
create_technique_mapping(olive_varieties)
monthly_weather = preprocess_weather_data(weather_data)
all_varieties = olive_varieties['Varietà di Olive'].unique()
variety_techniques = {
variety: olive_varieties[olive_varieties['Varietà di Olive'] == variety]['Tecnica di Coltivazione'].unique()
for variety in all_varieties
}
# Calcolo workers ottimali usando get_optimal_workers
if max_workers is None:
max_workers = get_optimal_workers()
print(f"Utilizzando {max_workers} workers ottimali basati sulle risorse del sistema")
# Calcolo numero di batch
num_batches = (num_simulations + batch_size - 1) // batch_size
print(f"Elaborazione di {num_simulations} simulazioni con {num_zones} zone in {num_batches} batch")
print(f"Totale record attesi: {num_simulations * num_zones:,}")
# Lista per contenere tutti i DataFrame dei batch
all_batches = []
# Elaborazione per batch
for batch_num in range(num_batches):
start_sim = batch_num * batch_size
end_sim = min((batch_num + 1) * batch_size, num_simulations)
current_batch_size = end_sim - start_sim
batch_results = []
# Parallelizzazione usando ProcessPoolExecutor per il batch corrente
with ProcessPoolExecutor(max_workers=max_workers) as executor:
# Calcola il numero totale di task per questo batch
# Ogni simulazione nel batch corrente genererà num_zones zone
total_tasks = current_batch_size * num_zones
with tqdm(total=total_tasks,
desc=f"Batch {batch_num + 1}/{num_batches}") as pbar:
# Dizionario per tenere traccia delle futures e dei loro sim_id
future_to_sim_id = {}
# Sottometti i lavori per tutte le simulazioni e zone nel batch corrente
for sim in range(start_sim, end_sim):
selected_year = np.random.choice(monthly_weather['year'].unique())
base_weather = monthly_weather[monthly_weather['year'] == selected_year].copy()
base_weather.loc[:, 'growth_phase'] = base_weather['month'].apply(get_growth_phase)
# Sottometti i lavori per tutte le zone di questa simulazione
for zone in range(num_zones):
future = executor.submit(
simulate_zone,
base_weather=base_weather,
olive_varieties=olive_varieties,
year=selected_year,
zone=zone,
all_varieties=all_varieties,
variety_techniques=variety_techniques
)
future_to_sim_id[future] = (sim + 1, zone + 1)
# Raccogli i risultati man mano che vengono completati
for future in as_completed(future_to_sim_id.keys()):
sim_id, zone_id = future_to_sim_id[future]
try:
result = future.result()
result['simulation_id'] = sim_id
result['zone_id'] = zone_id
batch_results.append(result)
pbar.update(1)
except Exception as e:
print(f"Errore nella simulazione {sim_id}, zona {zone_id}: {str(e)}")
continue
# Converti batch_results in DataFrame e aggiungi alla lista dei batch
batch_df = pd.DataFrame(batch_results)
all_batches.append(batch_df)
# Stampa statistiche del batch
print(f"\nStatistiche Batch {batch_num + 1}:")
print(f"Righe processate: {len(batch_df):,}")
print(f"Memoria utilizzata: {batch_df.memory_usage(deep=True).sum() / 1024 ** 2:.2f} MB")
# Libera memoria
del batch_results
del batch_df
gc.collect() # Forza garbage collection
# Concatena tutti i batch e salva
print("\nConcatenazione dei batch e salvataggio...")
final_df = pd.concat(all_batches, ignore_index=True)
# Crea directory output se necessario
os.makedirs(os.path.dirname(output_path) if os.path.dirname(output_path) else '.', exist_ok=True)
# Salva il dataset
final_df.to_parquet(output_path)
# Stampa statistiche finali
print("\nStatistiche Finali:")
print(f"Totale simulazioni completate: {len(final_df):,}")
print(f"Memoria totale utilizzata: {final_df.memory_usage(deep=True).sum() / 1024 ** 2:.2f} MB")
print(f"\nDataset salvato in: {output_path}")
return final_df
def calculate_production(variety_info, weather, percentage, hectares, seed):
"""Calcola produzione e parametri correlati per una varietà"""
np.random.seed(seed)
base_production = variety_info['Produzione (tonnellate/ettaro)'] * percentage * hectares
base_production *= np.random.uniform(0.8, 1.2)
# Effetti ambientali
temp_effect = calculate_temperature_effect(
weather['temp_mean'],
variety_info['Temperatura Ottimale']
)
water_effect = calculate_water_effect(
weather['precip_sum'],
variety_info['Resistenza alla Siccità']
)
solar_effect = calculate_solar_effect(
weather['solarradiation_mean']
)
actual_production = base_production * temp_effect * water_effect * solar_effect
# Calcolo olio
oil_yield = np.random.uniform(
variety_info['Min % Resa'],
variety_info['Max % Resa']
)
oil_production = actual_production * oil_yield
# Calcolo acqua
base_water_need = (
variety_info['Fabbisogno Acqua Primavera (m³/ettaro)'] +
variety_info['Fabbisogno Acqua Estate (m³/ettaro)'] +
variety_info['Fabbisogno Acqua Autunno (m³/ettaro)'] +
variety_info['Fabbisogno Acqua Inverno (m³/ettaro)']
) / 4 * percentage * hectares
water_need = (
base_water_need *
(1 + max(0, (weather['temp_mean'] - 20) / 50)) *
max(0.6, 1 - (weather['precip_sum'] / 1000))
)
return {
'variety': variety_info['Varietà di Olive'],
'technique': variety_info['Tecnica di Coltivazione'],
'percentage': percentage,
'production': actual_production,
'oil_production': oil_production,
'water_need': water_need,
'temp_effect': temp_effect,
'water_effect': water_effect,
'solar_effect': solar_effect,
'yield': oil_yield
}
# Funzioni di effetto ambientale rimangono invariate
def calculate_temperature_effect(temp, optimal_temp):
temp_diff = abs(temp - optimal_temp)
if temp_diff <= 5:
return np.random.uniform(0.95, 1.0)
elif temp_diff <= 10:
return np.random.uniform(0.8, 0.9)
else:
return np.random.uniform(0.6, 0.8)
def calculate_water_effect(precip, drought_resistance):
if 'alta' in str(drought_resistance).lower():
min_precip = 20
elif 'media' in str(drought_resistance).lower():
min_precip = 30
else:
min_precip = 40
if precip >= min_precip:
return np.random.uniform(0.95, 1.0)
else:
base_factor = max(0.6, precip / min_precip)
return base_factor * np.random.uniform(0.8, 1.2)
def calculate_solar_effect(radiation):
if radiation >= 200:
return np.random.uniform(0.95, 1.0)
else:
base_factor = max(0.7, radiation / 200)
return base_factor * np.random.uniform(0.8, 1.2)
def parse_arguments():
"""
Configura e gestisce i parametri da riga di comando
"""
parser = argparse.ArgumentParser(
description='Generatore dataset di training per produzione olive',
formatter_class=argparse.ArgumentDefaultsHelpFormatter # Mostra i valori default nell'help
)
parser.add_argument(
'--random-seed',
type=int,
default=None,
help='Seed per la riproducibilità dei risultati'
)
parser.add_argument(
'--num-simulations',
type=int,
default=100000,
help='Numero totale di simulazioni da eseguire'
)
parser.add_argument(
'--num-zones',
type=int,
default=None,
help='Numero di zone per simulazione (default: uguale a num-simulations)'
)
parser.add_argument(
'--batch-size',
type=int,
default=10000,
help='Dimensione di ogni batch di simulazioni'
)
parser.add_argument(
'--output-path',
type=str,
default='./sources/olive_training_dataset.parquet',
help='Percorso del file di output'
)
parser.add_argument(
'--max-workers',
type=int,
default=None,
help='Quantità di workers (default: usa get_optimal_workers)'
)
return parser.parse_args()
if __name__ == "__main__":
print("Generazione dataset di training...")
# Parsing argomenti
args = parse_arguments()
# Carica dati
try:
weather_data = pd.read_parquet('./sources/weather_data_solarenergy.parquet')
olive_varieties = pd.read_parquet('./sources/olive_varieties.parquet')
except Exception as e:
print(f"Errore nel caricamento dei dati: {str(e)}")
sys.exit(1)
# Stampa configurazione
print("\nConfigurazione:")
print(f"Random seed: {args.random_seed}")
print(f"Numero simulazioni: {args.num_simulations:,}")
print(f"Numero zone per simulazione: {args.num_zones if args.num_zones is not None else args.num_simulations:,}")
print(f"Workers: {args.max_workers if args.max_workers is not None else 'auto'}")
print(f"Dimensione batch: {args.batch_size:,}")
print(f"File output: {args.output_path}")
# Genera dataset
try:
df = simulate_olive_production_parallel(
weather_data=weather_data,
olive_varieties=olive_varieties,
num_simulations=args.num_simulations,
num_zones=args.num_zones,
random_seed=args.random_seed,
batch_size=args.batch_size,
output_path=args.output_path,
max_workers=args.max_workers
)
except Exception as e:
print(f"Errore durante la generazione del dataset: {str(e)}")
sys.exit(1)

0
utils/__init__.py Executable file
View File

504
utils/helpers.py Executable file
View File

@ -0,0 +1,504 @@
import psutil
import multiprocessing
import re
import pandas as pd
import numpy as np
from typing import List, Dict
import os
import joblib
def get_optimal_workers() -> int:
"""
Calcola il numero ottimale di workers basandosi sulle risorse del sistema.
Returns
-------
int
Numero ottimale di workers
"""
# Ottiene il numero di CPU logiche (inclusi i thread virtuali)
cpu_count = multiprocessing.cpu_count()
# Ottiene la memoria totale e disponibile in GB
memory = psutil.virtual_memory()
total_memory_gb = memory.total / (1024 ** 3)
available_memory_gb = memory.available / (1024 ** 3)
# Stima della memoria necessaria per worker (esempio: 2GB per worker)
memory_per_worker_gb = 2
# Calcola il numero massimo di workers basato sulla memoria disponibile
max_workers_by_memory = int(available_memory_gb / memory_per_worker_gb)
# Usa il minimo tra:
# - numero di CPU disponibili - 1 (lascia una CPU libera per il sistema)
# - numero massimo di workers basato sulla memoria
# - un limite massimo arbitrario (es. 32) per evitare troppo overhead
optimal_workers = min(
cpu_count - 1,
max_workers_by_memory,
32 # limite massimo arbitrario
)
# Assicura almeno 1 worker
return max(1, optimal_workers)
def clean_column_name(name: str) -> str:
"""
Rimuove caratteri speciali e spazi, converte in snake_case e abbrevia.
Parameters
----------
name : str
Nome della colonna da pulire
Returns
-------
str
Nome della colonna pulito
"""
# Rimuove caratteri speciali
name = re.sub(r'[^a-zA-Z0-9\s]', '', name)
# Converte in snake_case
name = name.lower().replace(' ', '_')
# Abbreviazioni comuni
abbreviations = {
'production': 'prod',
'percentage': 'pct',
'hectare': 'ha',
'tonnes': 't',
'litres': 'l',
'minimum': 'min',
'maximum': 'max',
'average': 'avg'
}
for full, abbr in abbreviations.items():
name = name.replace(full, abbr)
return name
def clean_column_names(df: pd.DataFrame) -> List[str]:
"""
Pulisce tutti i nomi delle colonne in un DataFrame.
Parameters
----------
df : pd.DataFrame
DataFrame con le colonne da pulire
Returns
-------
list
Lista dei nuovi nomi delle colonne puliti
"""
new_columns = []
for col in df.columns:
# Usa regex per separare le varietà
varieties = re.findall(r'([a-z]+)_([a-z_]+)', col)
if varieties:
new_columns.append(f"{varieties[0][0]}_{varieties[0][1]}")
else:
new_columns.append(col)
return new_columns
def to_camel_case(text: str) -> str:
"""
Converte una stringa in camelCase.
Gestisce stringhe con spazi, trattini o underscore.
Se è una sola parola, la restituisce in minuscolo.
Parameters
----------
text : str
Testo da convertire
Returns
-------
str
Testo convertito in camelCase
"""
# Rimuove eventuali spazi iniziali e finali
text = text.strip()
# Se la stringa è vuota, ritorna stringa vuota
if not text:
return ""
# Sostituisce trattini e underscore con spazi
text = text.replace('-', ' ').replace('_', ' ')
# Divide la stringa in parole
words = text.split()
# Se non ci sono parole dopo lo split, ritorna stringa vuota
if not words:
return ""
# Se c'è una sola parola, ritorna in minuscolo
if len(words) == 1:
return words[0].lower()
# Altrimenti procedi con il camelCase
result = words[0].lower()
for word in words[1:]:
result += word.capitalize()
return result
def get_full_data(simulated_data: pd.DataFrame,
olive_varieties: pd.DataFrame) -> pd.DataFrame:
"""
Ottiene il dataset completo combinando dati simulati e varietà di olive.
Parameters
----------
simulated_data : pd.DataFrame
DataFrame con i dati simulati
olive_varieties : pd.DataFrame
DataFrame con le informazioni sulle varietà
Returns
-------
pd.DataFrame
DataFrame completo con tutte le informazioni
"""
# Colonne base rilevanti
relevant_columns = [
'year', 'temp_mean', 'precip_sum', 'solar_energy_sum',
'ha', 'zone', 'olive_prod'
]
# Aggiungi colonne specifiche per varietà
all_varieties = olive_varieties['Varietà di Olive'].unique()
varieties = [clean_column_name(variety) for variety in all_varieties]
for variety in varieties:
relevant_columns.extend([
f'{variety}_olive_prod',
f'{variety}_tech'
])
# Seleziona solo le colonne rilevanti
full_data = simulated_data[relevant_columns].copy()
# Aggiungi feature calcolate
for variety in varieties:
# Calcola efficienza produttiva
if f'{variety}_olive_prod' in full_data.columns:
full_data[f'{variety}_efficiency'] = (
full_data[f'{variety}_olive_prod'] / full_data['ha']
)
# Aggiungi indicatori tecnici
if f'{variety}_tech' in full_data.columns:
technique_dummies = pd.get_dummies(
full_data[f'{variety}_tech'],
prefix=f'{variety}_technique'
)
full_data = pd.concat([full_data, technique_dummies], axis=1)
# Aggiungi feature temporali
full_data['month'] = 1 # Assumiamo dati annuali
full_data['day'] = 1 # Assumiamo dati annuali
# Calcola medie mobili
for col in ['temp_mean', 'precip_sum', 'solar_energy_sum']:
full_data[f'{col}_ma3'] = full_data[col].rolling(window=3, min_periods=1).mean()
full_data[f'{col}_ma5'] = full_data[col].rolling(window=5, min_periods=1).mean()
return full_data
def prepare_static_features_multiple(varieties_info: List[Dict],
percentages: List[float],
hectares: float,
all_varieties: List[str]) -> np.ndarray:
"""
Prepara le feature statiche per multiple varietà.
Parameters
----------
varieties_info : List[Dict]
Lista di dizionari contenenti le informazioni sulle varietà selezionate
percentages : List[float]
Lista delle percentuali corrispondenti a ciascuna varietà selezionata
hectares : float
Numero di ettari totali
all_varieties : List[str]
Lista di tutte le possibili varietà nel dataset originale
Returns
-------
np.ndarray
Array numpy contenente tutte le feature statiche
"""
# Inizializza un dizionario per tutte le varietà possibili
variety_data = {variety.lower(): {
'pct': 0,
'prod_t_ha': 0,
'tech': '',
'oil_prod_t_ha': 0,
'oil_prod_l_ha': 0,
'min_yield_pct': 0,
'max_yield_pct': 0,
'min_oil_prod_l_ha': 0,
'max_oil_prod_l_ha': 0,
'avg_oil_prod_l_ha': 0,
'l_per_t': 0,
'min_l_per_t': 0,
'max_l_per_t': 0,
'avg_l_per_t': 0,
'water_need_spring': 0,
'water_need_summer': 0,
'water_need_autumn': 0,
'water_need_winter': 0,
'annual_water_need': 0,
'optimal_temp': 0,
'drought_resistance': 0
} for variety in all_varieties}
# Aggiorna i dati per le varietà selezionate
for variety_info, percentage in zip(varieties_info, percentages):
variety_name = clean_column_name(variety_info['variet_di_olive']).lower()
technique = clean_column_name(variety_info['tecnica_di_coltivazione']).lower()
if variety_name not in variety_data:
print(f"Attenzione: La varietà '{variety_name}' non è presente nella lista delle varietà conosciute.")
continue
variety_data[variety_name].update({
'pct': percentage / 100,
'prod_t_ha': variety_info['produzione_tonnellateettaro'],
'tech': technique,
'oil_prod_t_ha': variety_info['produzione_olio_tonnellateettaro'],
'oil_prod_l_ha': variety_info['produzione_olio_litriettaro'],
'min_yield_pct': variety_info['min__resa'],
'max_yield_pct': variety_info['max__resa'],
'min_oil_prod_l_ha': variety_info['min_produzione_olio_litriettaro'],
'max_oil_prod_l_ha': variety_info['max_produzione_olio_litriettaro'],
'avg_oil_prod_l_ha': variety_info['media_produzione_olio_litriettaro'],
'l_per_t': variety_info['litri_per_tonnellata'],
'min_l_per_t': variety_info['min_litri_per_tonnellata'],
'max_l_per_t': variety_info['max_litri_per_tonnellata'],
'avg_l_per_t': variety_info['media_litri_per_tonnellata'],
'water_need_spring': variety_info['fabbisogno_acqua_primavera_mettaro'],
'water_need_summer': variety_info['fabbisogno_acqua_estate_mettaro'],
'water_need_autumn': variety_info['fabbisogno_acqua_autunno_mettaro'],
'water_need_winter': variety_info['fabbisogno_acqua_inverno_mettaro'],
'annual_water_need': variety_info['fabbisogno_idrico_annuale_mettaro'],
'optimal_temp': variety_info['temperatura_ottimale'],
'drought_resistance': variety_info['resistenza_alla_siccit']
})
# Crea il vettore delle feature
static_features = [hectares]
# Lista delle feature per ogni varietà
variety_features = ['pct', 'prod_t_ha', 'oil_prod_t_ha', 'oil_prod_l_ha',
'min_yield_pct', 'max_yield_pct', 'min_oil_prod_l_ha',
'max_oil_prod_l_ha', 'avg_oil_prod_l_ha', 'l_per_t',
'min_l_per_t', 'max_l_per_t', 'avg_l_per_t',
'water_need_spring', 'water_need_summer', 'water_need_autumn',
'water_need_winter', 'annual_water_need', 'optimal_temp',
'drought_resistance']
# Appiattisci i dati delle varietà
for variety in all_varieties:
variety_lower = variety.lower()
# Feature esistenti
for feature in variety_features:
static_features.append(variety_data[variety_lower][feature])
# Feature binarie per le tecniche
for technique in ['tradizionale', 'intensiva', 'superintensiva']:
static_features.append(1 if variety_data[variety_lower]['tech'] == technique else 0)
return np.array(static_features).reshape(1, -1)
def get_feature_names(all_varieties: List[str]) -> List[str]:
"""
Genera i nomi delle feature nell'ordine corretto.
Parameters
----------
all_varieties : List[str]
Lista di tutte le varietà possibili
Returns
-------
List[str]
Lista dei nomi delle feature
"""
feature_names = ['hectares']
variety_features = ['pct', 'prod_t_ha', 'oil_prod_t_ha', 'oil_prod_l_ha',
'min_yield_pct', 'max_yield_pct', 'min_oil_prod_l_ha',
'max_oil_prod_l_ha', 'avg_oil_prod_l_ha', 'l_per_t',
'min_l_per_t', 'max_l_per_t', 'avg_l_per_t']
techniques = ['tradizionale', 'intensiva', 'superintensiva']
for variety in all_varieties:
for feature in variety_features:
feature_names.append(f"{variety}_{feature}")
for technique in techniques:
feature_names.append(f"{variety}_tech_{technique}")
return feature_names
def add_controlled_variation(base_value: float, max_variation_pct: float = 0.20) -> float:
"""
Aggiunge una variazione controllata a un valore base.
Parameters
----------
base_value : float
Valore base da modificare
max_variation_pct : float
Percentuale massima di variazione (default 20%)
Returns
-------
float
Valore con variazione applicata
"""
variation = np.random.uniform(-max_variation_pct, max_variation_pct)
return base_value * (1 + variation)
def get_growth_phase(month):
if month in [12, 1, 2]:
return 'dormancy'
elif month in [3, 4, 5]:
return 'flowering'
elif month in [6, 7, 8]:
return 'fruit_set'
else:
return 'ripening'
def calculate_weather_effect(row, optimal_temp):
# Effetti base
temp_effect = -0.1 * (row['temp_mean'] - optimal_temp) ** 2
rain_effect = -0.05 * (row['precip_sum'] - 600) ** 2 / 10000
sun_effect = 0.1 * row['solarenergy_sum'] / 1000
# Fattori di scala basati sulla fase di crescita
if row['growth_phase'] == 'dormancy':
temp_scale = 0.5
rain_scale = 0.2
sun_scale = 0.1
elif row['growth_phase'] == 'flowering':
temp_scale = 2.0
rain_scale = 1.5
sun_scale = 1.0
elif row['growth_phase'] == 'fruit_set':
temp_scale = 1.5
rain_scale = 1.0
sun_scale = 0.8
else: # ripening
temp_scale = 1.0
rain_scale = 0.5
sun_scale = 1.2
# Calcolo dell'effetto combinato
combined_effect = (
temp_scale * temp_effect +
rain_scale * rain_effect +
sun_scale * sun_effect
)
# Aggiustamenti specifici per fase
if row['growth_phase'] == 'flowering':
combined_effect -= 0.5 * max(0, row['precip_sum'] - 50) # Penalità per pioggia eccessiva durante la fioritura
elif row['growth_phase'] == 'fruit_set':
combined_effect += 0.3 * max(0, row['temp_mean'] - (optimal_temp + 5)) # Bonus per temperature più alte durante la formazione dei frutti
return combined_effect
def calculate_water_need(weather_data, base_need, optimal_temp):
# Calcola il fabbisogno idrico basato su temperatura e precipitazioni
temp_factor = 1 + 0.05 * (weather_data['temp_mean'] - optimal_temp) # Aumenta del 5% per ogni grado sopra l'ottimale
rain_factor = 1 - 0.001 * weather_data['precip_sum'] # Diminuisce leggermente con l'aumentare delle precipitazioni
return base_need * temp_factor * rain_factor
def create_technique_mapping(olive_varieties, mapping_path='./sources/technique_mapping.joblib'):
# Estrai tutte le tecniche uniche dal dataset e convertile in lowercase
all_techniques = olive_varieties['Tecnica di Coltivazione'].str.lower().unique()
# Crea il mapping partendo da 1
technique_mapping = {tech: i + 1 for i, tech in enumerate(sorted(all_techniques))}
# Salva il mapping
os.makedirs(os.path.dirname(mapping_path), exist_ok=True)
joblib.dump(technique_mapping, mapping_path)
return technique_mapping
def encode_techniques(df, mapping_path='./sources/technique_mapping.joblib'):
if not os.path.exists(mapping_path):
raise FileNotFoundError(f"Mapping not found at {mapping_path}. Run create_technique_mapping first.")
technique_mapping = joblib.load(mapping_path)
# Trova tutte le colonne delle tecniche
tech_columns = [col for col in df.columns if col.endswith('_tech')]
# Applica il mapping a tutte le colonne delle tecniche
for col in tech_columns:
df[col] = df[col].str.lower().map(technique_mapping).fillna(0).astype(int)
return df
def decode_techniques(df, mapping_path='./sources/technique_mapping.joblib'):
if not os.path.exists(mapping_path):
raise FileNotFoundError(f"Mapping not found at {mapping_path}")
technique_mapping = joblib.load(mapping_path)
reverse_mapping = {v: k for k, v in technique_mapping.items()}
reverse_mapping[0] = '' # Aggiungi un mapping per 0 a stringa vuota
# Trova tutte le colonne delle tecniche
tech_columns = [col for col in df.columns if col.endswith('_tech')]
# Applica il reverse mapping a tutte le colonne delle tecniche
for col in tech_columns:
df[col] = df[col].map(reverse_mapping)
return df
def decode_single_technique(technique_value, mapping_path='./sources/technique_mapping.joblib'):
if not os.path.exists(mapping_path):
raise FileNotFoundError(f"Mapping not found at {mapping_path}")
technique_mapping = joblib.load(mapping_path)
reverse_mapping = {v: k for k, v in technique_mapping.items()}
reverse_mapping[0] = ''
return reverse_mapping.get(technique_value, '')
def preprocess_weather_data(weather_df):
# Calcola statistiche mensili per ogni anno
monthly_weather = weather_df.groupby(['year', 'month']).agg({
'temp': ['mean', 'min', 'max'],
'humidity': 'mean',
'precip': 'sum',
'windspeed': 'mean',
'cloudcover': 'mean',
'solarradiation': 'sum',
'solarenergy': 'sum',
'uvindex': 'max'
}).reset_index()
monthly_weather.columns = ['year', 'month'] + [f'{col[0]}_{col[1]}' for col in monthly_weather.columns[2:]]
return monthly_weather