PDA

View Full Version : String tokenization



Viro
20-02-2003, 08:15 AM
In Java, we've got the handy StringTokenizer class, in C, we've got strtok. What have we got in Delphi to tokenize strings? I haven't found anything in the docs, and a quick search on google returns zip.

Are there any library functions that do this already, or do I have to write my own (*shudder*)?

Alimonster
20-02-2003, 09:55 AM
If you want to do comma-separated text, you could check out TStrings.CommaText. Here's a general purpose tokenizer that I've whacked up in a couple of minutes:

[background=#FFFFFF][comment=#0000FF][normal=#000000]
[number=#C00000][reserved=#000000][string=#00C000]procedure Tokenize(const UseMe: TStrings; const str, delimiters: string);
var
Start: Integer;
i: Integer;
StrLength: Integer;
Delims: set of Char;
begin
UseMe.Clear;

Delims := [];
for i := 1 to Length(Delimiters) do
Include(Delims, Delimiters[i]);

Start := 1;
StrLength := Length(str);
for i := 1 to StrLength do
begin
if str[i] in delims then
begin
if Start <= i - 1 then
UseMe.Add(Copy(str, Start, i - start));

Start := i + 1;
end;
end;

// handle any trailing stuff (since there may not be a delimiter, it wouldn't
// be flushed out in the above loop)
if not (str[strLength] in delims) then
begin
UseMe.Add(Copy(str, Start, strLength - start + 1));
end;
end;

I'd advise you to test the above for a couple of minutes on general inputs. If you find any bugs, let me know and I'll update the code.

You really shouldn't have a *shudder* in your above post, you know - Delphi's set capabilities make it rock for this sort of task! :king:

EFG also has a parse, but I've not tried it yet. Nonetheless: here ya go (http://homepages.borland.com/efg2lab/Library/Delphi/Strings/Tokens.pas.TXT)

Alimonster
20-02-2003, 10:07 AM
Here are a couple more:

http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=828&lngWId=7
http://delphi.about.com/bltip0902.htm

Something else of interest: there's a class called TParser in classes.pas. For some reason, this isn't shown in the help files, so I couldn't say how to use it, but... there you go! If you don't have the VCL source code then let me know.

Viro
20-02-2003, 11:25 AM
Sets! Of course! Why didn't I think of using them (kicks self in behind). I'll try it out when I get back from Uni. The delimiters I'm interested in are :

- Space
- Tab
- Comma
- New Line

How do I get the Space/Tab/Netlines in Pascal? In C and Java its \t\n, how do I get them in Pascal?

Alimonster
20-02-2003, 11:42 AM
You can refer to a char by its ascii value using #number - for example, #13 is carriage return, #9 is tab, #10 is line feed, #32 is space, etc. (handy link (http://www.asciitable.com)) You can also use the Ord() function to get the ascii value of a char - for example, Ord(' ') -> 32, Ord(',') ->whatever, and so on. Chr() does the reverse - turns a byte into a char. If you want really archaic, you can use constants defined as ^Letter, which will give that char the appropriate thing (e.g. const tab: Char = ^I;). Nobody uses that notation these days, as far as I can tell.

Here's a quick example to show how you'd check against it:

if MyChar = #9 then // we've found a tab

You might want to adapt the above code I gave with an array of char, instead of a string, for the delimiters.

Note that the above uses one-char-per-delimiter. Technically, a new line is represented by #13#10, so you may have to modify it to take that into account (shouldn't take too long, I guess). You could read in your values line-by-line to take account of that, passing each row to the function in turn. You'd have to get rid of UseMe.Clear to keep the old results handy.

EDIT: Come to think of it, the above function would work fine if you included both #13 and #10 in the delimiters string, since it would ignore the #10 and take no action when it bumped into one.

Alimonster
20-02-2003, 11:56 AM
The following might do the trick for you. Once again, I've not tested it (I know it compiles though):

[background=#FFFFFF][comment=#0000FF][normal=#000000]
[number=#C00000][reserved=#000000][string=#00C000]type
TCharSet = set of char;

// note that this procedure does *not* clear UseMe any more!
procedure Tokenize(const UseMe: TStrings; const str: string; delims: TCharSet);
var
Start: Integer;
i: Integer;
StrLength: Integer;
begin
Start := 1;
StrLength := Length(str);
for i := 1 to StrLength do
begin
if str[i] in delims then
begin
if Start <= i - 1 then
UseMe.Add(Copy(str, Start, i - start));

Start := i + 1;
end;
end;

if not (str[strLength] in delims) then
begin
UseMe.Add(Copy(str, Start, strLength - start + 1));
end;
end;

procedure ReadTokensFromFile(const FileName: string);
var
Tokens: TStringList;
InFile: TStringList;
i: Integer;
begin
if not FileExists(Filename) then
raise Exception.Create('The file ' + Filename + ' was not found');

Tokens := TStringList.Create;
try
InFile := TStringList.Create;
try
InFile.LoadFromFile(FileName);

for i := 0 to InFile.Count - 1 do
Tokenize(Tokens, InFile[i], [' ', #9, ',']);
finally
InFile.Free;
end;

// todo: use tokens somehow
finally
Tokens.Free;
end;
end;

Viro
20-02-2003, 03:31 PM
Alimonster, thanks for the help!! :D I'll work on it later in the evening.