My aim is to rank and retrieve top-k records from distributed database.
1) Input k (k is the no of records user wants to retrieve)
2) There are 4 input csv files in a folder each having 5 columns, DIR,SPD,P,T,RH with its values
2) In each file i need to sort the data with respect to any of these columns,say, SPD, in descending order
3)After sorting, i need to rank each row in every file (1st row as rank 1, 2nd as rank 2 and so on)
4) The middle row data along with its rank of the largest csv file need to be passed to all other remaining csv files.
5) From every file, all rows having rank lesser than the passed row need to be put in separate output csv file.
6) Once all files are processed the output file need to be sorted again with respect to same attribute SPD
7) Finally, top-k records need to be retrieved and stored in another separate csv file.
8) Once retrieval is done, it has to check if any new tuple gets added to any input file. If so, that row has to be checked if it can be a candidate in top-k record.
9) To check so, the file in which new tuple got added needs to be sorted and ranked again. Then this new tuple's rank need to compared with the kth record's rank( in final ouptut file) .
If rank is lesser,the new tuple gets added to the output file and thereby discarding the kth record obtained before.
Else, the result is same as before.
Kindly help me by providing the code for the above