I have 2 separate
DataFrames which contains pieces of information for around half a million images summing upto 6+ GBs. There are 4
.parquet files which I had to
pd.concat() one by one to make a new DataFrame
imgs containing the pixels of
137*236, values ranging from
0-32331 and the image’s id column.
imgs >> image_id 0 1 ... 32330 32331 0 Train_50210 246 253 ... 251 250 1 Train_50211 250 245 ... 241 244 ... ... ... 453651 Train_50210 0 253 ... 251 250 453652 Train_50211 250 245 ... 241 244
csv contains the image’s labels and the values of three different classes that each image belongs to. I imported the csv in
train >> image_id class_1 class_2 class_3 0 Train_5 15 9 5 1 Train_1 159 0 0 ... ... ... 453651 Train_342524 0 15 34 453652 Train_9534 18 0 7
Number of rows in
train are equal to rows in
imgs. It means that the Y-Labels of the images are stored in
train and the corresponding pixel attributes are in
I tried merging both the the dataframes using
pd.merge(imgs,train,on='image_id').drop('image_id') and It took a long time and my kernal died every time while processing the above 2 steps. Please do suggest an alternate approach if there is any
Could somebody please tell me how to make a custom Data Generator for
1. producing batches of images 2. Augmented images for robustness
keras or any other library for fast processing.
Alternatively, could someone please tell me how to use ImageDataGenerator().flow() in my case
This is what i would suggest, load the dataframe, piece by piece, do not load the entirety of it at the same time, this might actually exceed your RAM, hence the dying kernel.
Then iterate through the dataframe line by line, take the 32332 columns, and reshape them into an image of 137×236 and save them to disk with a the appropriate name in to the folder train_data/class_number/, you can then use keras ImageDataGenerator().flowfromDirectory()
the issue is that the 32332 columns dont make sense to me, if the image was a single channel 137×236 image, then the number of columns would be 137*236 = 29972.So theres like 2k columns unaccounted for. Are you sure of the format of the data?