The final dissertation project as part of the MSc Artificial Intelligence program started in September 2022, this is the third month, and I thought to start sharing publically.
The major topic of my research is the healthcare sector, specifically maternal, specifically Maternal & Infant Health. I have created a details project proposal and submitted it to the university.
Here you will get the technical aspects of the work I am doing to complete my research, as it involves artificial intelligence and Machine learning on a huge dataset. I am into an interesting challenge, which almost daily enables me to learn something new.
Today onward, I will update here the progress and will also try to update the key aspects of the last 3 months work. let’s get started
A few tweets will give you a glimpse of the work over the last few weeks.
November 23, 2022: It began with the writing of this blog based on the difficulties of “preprocessing” datasets, I decided to revisit the CSV files and shorten the columns to the required only columns. Earlier, I did all this work after uploading it to the autonomous database.
Challenge: CSV file has approximately 3.6 million records, and MS excel can only support 1,048,576 rows. found some ideas to split it, and here is the method I have used to split the CSV file into multiple files and then work on it.
moved it into the desired folder, where my dataset files are available, and used the terminal window to execute the below command.
split -l 1000000 2021natdata.csv
This has created the files based on the data, so I have 4 files with naming conventions like xaa, xab, xac, xad.
now, these files need to be converted to CSV format, so I have used this statement, and all this found on google research.
for i in *;
do mv "$i" "$i.csv";
done
It gives me all four files in CSV format, and I have to do it for all my source dataset files.
How to convert large CSV files into multiple files.
Nov 24, 2022: Today’s task was to update the column name to some meaningful name so that it is easy to understand while just reading the column’s name.
for example, column name ‘dmar’ to ‘MaritalStatus’ and ‘rf_cesar’ to ‘PreviousCesarean’
It was pretty challenging with the 16 different CSV files, and the average records were 1 million in each CSV file.
Loaded all CSVs data into Oracle Autonomous Datawarehouse in less than 30 minutes
15-March-2023
first of all my apology for not being able to update this, as the initial idea was to document the journey.
However with the extensive work required me to do for this dissertation, I was not able to cope with the pace and was not able to update this post.
I submitted my dissertation during the first week of March, and from now onward I will try to write separate blogs to help others to pursue their career dreams.