boiger_r · fee39b62
--- a/home.md
+++ b/home.md
 Code for machine learning algorithms for Jungfraujoch Data sets.

-Time series clustering is an unsupervised method for grouping data points based on their similarity. The goal is to maximize data similarity within clusters and minimize it across clusters.
+1.) to clone the directory, copy the url and write in the command line:

+git clone url

-to clone the directory, copy the url and write in the command line:
+2.) The dataset is located on Merlin: '/data/project/general/aerosolretriev/Jungfraujoch_data/Instrument Data/'. 

-git clone url
\ No newline at end of file
+For the analysis we use the file:
+'/data/project/general/aerosolretriev/Jungfraujoch_data/Instrument_Data/merged_data/aerosol_data_JFJ_2020.csv'
+
+Here the data are already merged and it includes 8784 rows × 325 columns. 
+
+The diameters related to the size distribution are in the file:
+ '/data/project/general/aerosolretriev/Jungfraujoch_data/Instrument_Data/merged_data/midpoint_diameters_size_distr_JFJ_2020.csv'
+
+3.) First run the jupyter notebook "Preprocess_Data_Set.ipynb"
+
+The data are read, NaN and Inf are replaced by 0 (check!)
+
+And the starting and end dates for the sagara dust are typed there. From the paper: https://acp.copernicus.org/articles/21/18029/2021/
+Table A1. I rounded up the hours of the events (using e.g. 14:00 instead of 13:42) (check!)
+
+Then I used this times to specify the dust events and added it to the pandas dataframe columns "sde_event", "sde_event_nr".
+
+The final dataframe including the duste vents is stored at:
+'/data/project/general/aerosolretriev/Jungfraujoch_data/data/aerosol_data.h5'
+
+
+
+
+