I encountered this problem of sorting a million records of same size on a unix box. My unix box is often overloaded with lot of processes, so I had to do it with a less amount of user and process memory. As usual, the time for implementation is quite less. So following is the method I adopted:
1. Read the input file which the complete input values, each line has one big number. Basically, each line is a big integer representation of a certain AES output value. Big Integer is created with the open source GMP library.
2. Read the first 4 digits and append the line to a file named based on the first 4 digits. For instance, if the the line is 8923523345012545454646,
It is written out to a file called 8923.ref. Any other number starting with the same 4 digits will be appended to the same file.
3. Repeat this process until every line is processed. This might created utmost 10K files. I wrote this code in C, so to make the file writing fast, a character array of large size is used to minimize the number of file writes.
4. Sort each and every file using any sorting method. If the items are of fixed size, sort utility in unix can be used.
5. You are only left with a cat operation of all files in the ascending order to a single file now.
This may not be a very efficient way, but very quick and flexible and requires very less memory. Choosing the file name based on n first digits should be done by analysing the input data. You may want to reduce or increase the number of digits based on the input data you have.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment