Faster than Quicksort for strings? [Archive]

View Full Version : Faster than Quicksort for strings?

cronodragon

27-07-2007, 03:32 PM

Hello!

I'm trying to speed up the quicksort algorithm to sort arrays of strings. As you might know, internally the algorithm compares two members of an array at a time, finding the lower and the greater. Probably Delphi does a scan comparison, checking each byte (word, dword, etc) until finding a difference, if no difference is found the scan will run until the end of both strings. Large strings would take a lot of processing to be compared. One of the ideas I have is compressing the strings using different methods like:

1. RLE (run-length encoding), but might not be a good idea since the strings I want to sort are mostly word-like (identifiers, file paths, etc) with very few characters repeating.

2. Limit the character set, using less bits to store a character. The comparison would not need to be bit-wise, as long as the symbols respect the order of ASCII, and each has the same size in bits, those can be compared in groups of byte, word, etc.

3. Some sort of Huffman compression, which stores the most common characters with less bits. But it should be designed in such way that allows byte groups comparisons, otherwise won't help.

Does anyone have other ideas to make a faster quicksort? :D

Setharian

27-07-2007, 04:49 PM

Well quicksort is fast, the problem is in string comparison performance. :) Are you sure it is a bottleneck at all? I have an IntroSort implementation (QuickSort, HeapSort and InsertionSort combination for quaranteed O(n log n) worst performance) which could sort 750000 strings in 50-300 ms depending on the input and I don't think it can get much faster than that.

You could possibly precalculate the numerical value of the strings and then compare numbers however I'd say such an approch could overflow 32 bits quite easily (as well as 64 bits) especially with long file paths.

What most string comparison algorithms do is they compare individual characters and while they are the same, they continue on comparing. If they encounter inequality, the string with the character of greater value is greater than the one with a character of a smaller value and this is where comparison ends. One possible solution to speed this process up is to seperate comparisons into multiple stages - dividing the input by shared long filepaths (ommiting filename) and then one can sort those smaller groups only based on the short filenames. Once those are sorted, the small lists are merged together and you get a sorted list. To ilustrate it let's suppose we have 5 files.

C:\LongFolderA\FileB.txt
C:\LongFolderB\File2A.txt
C:\LongFolderA\FileA.txt
C:\LongFolderB\File2C.txt
C:\LongFolderB\File2B.txt

Now you sort by paths and you get 2 groups.

Group 1
Path - C:\LongFolderA\
Files: (we sort these, no longer need to compare paths, they are the same)
FileA.txt
FileB.txt

Group 2
Path - C:\LongFolderB\
Files: (we sort these, no longer need to compare paths, they are the same)
File2A.txt
File2B.txt
File2C.txt

Result (created merging Group 1 and 2)
C:\LongFolderA\FileA.txt
C:\LongFolderA\FileB.txt
C:\LongFolderB\File2A.txt
C:\LongFolderB\File2B.txt
C:\LongFolderB\File2C.txt

Once sorted, we merge the lists and we get a final sorted list. Note that looking for shared paths is also a string comparison but because there are a lot less shared paths than the total number of files, it is a lot faster. Hope you understood. :)

The problem here that it is not a general purpose approach and will improve performance only with many strings with same prefixes. I'm using this same approach for debug symbol sorting where many start with the same prefix (method names aka Namespace.Class.Method, Namespace and Class is shared for all the methods of that class).

Compression won't help you because it tampers the strings and you cannot sort them in the normal order (you'd be comparing compressed data).

jdarling

27-07-2007, 05:16 PM

You explained what you wanted to do well enough, but how about why you are wanting to do it? I can think of faster ways to index strings then sorting or standard pascal lists all together, problem is unless I know why your doing it I don't know where to tell you to start :).

As a quick example, the best/fastest way to find if a string is in a given list is to use a modified DAWG instead of a list at all. There are plenty of other advantages to a DAWG but its BIG down side is that getting the list of words in a sorted order back out of it isn't trivial. I have units lying around for DAWGs (though I called it a word dictionary) if it looks like this will help.

As for speed sorting, using PChars instead of strings and writing your own Binary sort method is about the fastest you can get. Better is when you have a list and can perform the sort on the insert. This way you get sorted strings and the speed your looking for. Unique hashing algos work very quickly as well, but they have a down side of the overhead and the hash calc costs.

BTW: one thing to keep in mind is non alpha-numeric weighting. If you can have non alpha-numeric characters in your strings you need to weight the character values appropriately.

cronodragon

27-07-2007, 05:29 PM

Setharian, I never heard before about Introsort, seems pretty interesting. Do you know where can I find more information, or better an implementation of it in Delphi? Also I should learn more about hashes and hash tables. I understand the concepts of the hashes, hash generation and buckets, but all the good implementations of hash tables I have found are in C or C++, and they use some "extreme" binary operations on types which I not sure are fully compatible with Delphi. I'd like to find a good hash implementation in Delphi too. The idea of dividing the string is pretty cool too, and it's easy to implement for me.

jdarling, what I'm using the sorting for is to index objects by strings... exactly what a hash table should do, but simpler to implement for me, also uses the lower amount of memory. I made a modified version of quicksort to search tables, and now I'm extensively using it to define dynamic classes for skinnable interface (GUI) elements (copied the idea of Delphi's form definition files, and made it in XML) and language localization, also to re-use resources like textures, and I hope I'll be writing an scripting language soon, similar to actionscript (Flash) which adds properties to objects at runtime. All that for my engine :D

Setharian

27-07-2007, 06:35 PM

IntroSort (or introspective sort) is a combined sorting algorithm whose main part is a QuickSort. However for some inputs quicksort degrades to O(n^2) performance which is unacceptable. IntroSort keeps count of the recursions of QuickSort and when it hits a certain threshold, it determines the input is bad and it switches to a HeapSort. Also if the number of the elements is low (usually 16) it uses InsertionSort and also it uses it to sort small partitions created during QuickSort (also usually 16 elelements) because it is faster than further subdividing the partition.

The code here has been changed but most of the credits go to its creator. Unfortunately I do not know his/her name.

(Removed code, it has been changed once posted by some forum script for an unknown reason, download from here - http://filebeam.com/921a9b20e23b8d85b7b80e0e122bed0a)

The hash functions are from Paul Hsieh's site (I hope I'm spelling his name correctly).

I hope this helps.

cronodragon

27-07-2007, 07:01 PM

Hey, thanks! I'll study (or copy-paste) that code :D

:think: hmm, I see a repeat without until, maybe the code highlighter is messing up the text.

EDIT: thx for moving the code! :D

jdarling

30-07-2007, 12:43 PM

Well, as long as your strings are unique then this may be of some help to you. Its a unit that I built out quite a few years ago and then recently re-built in FPC to work with DAWG's and attach a data pointer to a word.

Some things of note:
1) Strings are case insensitive (search the code for UpCase if you want to change that)
2) The method .WordValid is used to check to see if words are valid to be placed into the list. You can change it to allow more, currently it only accepts A..Z and a..z. Simple change to make it accept anything or whatever you need.
3) Its amazingly fast, but does take more memory then StringLists's would.
4) TWordList.AddWord makes a call to WordValid and then another call to WordExists. If you really wanted to you could remove the call to WordExists, it isn't necessary at all but I like the safety of it being in place. This would double the insertion speed.
5) Use WordSymbol as you would Strings.Objects[Index] only instead use WordSymbol[Word]^.Data
6) A really cool feature is that you can find all words made from a set of characters, don't know if you will need it or not but its fun to play with :)
7) If you want to load a word file then make sure that the file contains a list of words with each being on their own separate line.

Download link: http://www.eonclash.com/PGD/uWordList.pas

Sample Code:var
MyWordList : TWordList;
begin
MyWordList.LoadFromFile('TWL98.TXT');
// set a data pointer
MyWordList.WordSymbol['Dog']^.Data := Pointer(123);
// Write out a few of the data pointers for good measure
writeln(Integer(MyWordList.WordSymbol['Cat']^.Data));
writeln(Integer(MyWordList.WordSymbol['Dog']^.Data));
// Find and list all of the words made up of the letters between 2 and 5 characters
MyWordList.ListWordsFromLetters('DOCAGTRV', Memo1.Lines, 2, 5);
MyWordList.Free;
end;

cronodragon

23-08-2007, 03:49 PM

Finally implemented the Introsort and also hash tables, the speed up is huge! But I have a doubt, how many buckets should I use? I inserted 1 million sorted strings in the hash table using 128 buckets in one test and 1024 buckets in the other, and got almost the same performance. Should I keep it in 128? Since it's is improbable to add more than 1 million elements.

jdarling

23-08-2007, 05:07 PM

The total number of buckets help to lower your chances of collision within the hash keys. Thus the higher the bucket count, the lower your chance for a collision between strings (two strings with different content generating the same hash value).

Hope that helps.