Results 1 to 7 of 7

Thread: String tokenization

  1. #1

    String tokenization

    In Java, we've got the handy StringTokenizer class, in C, we've got strtok. What have we got in Delphi to tokenize strings? I haven't found anything in the docs, and a quick search on google returns zip.

    Are there any library functions that do this already, or do I have to write my own (*shudder*)?

  2. #2

    String tokenization

    If you want to do comma-separated text, you could check out TStrings.CommaText. Here's a general purpose tokenizer that I've whacked up in a couple of minutes:

    [pascal][background=#FFFFFF][comment=#0000FF][normal=#000000]
    [number=#C00000][reserved=#000000][string=#00C000]procedure Tokenize(const UseMe: TStrings; const str, delimiters: string);
    var
    Start: Integer;
    i: Integer;
    StrLength: Integer;
    Delims: set of Char;
    begin
    UseMe.Clear;

    Delims := [];
    for i := 1 to Length(Delimiters) do
    Include(Delims, Delimiters[i]);

    Start := 1;
    StrLength := Length(str);
    for i := 1 to StrLength do
    begin
    if str[i] in delims then
    begin
    if Start <= i - 1 then
    UseMe.Add(Copy(str, Start, i - start));

    Start := i + 1;
    end;
    end;

    // handle any trailing stuff (since there may not be a delimiter, it wouldn't
    // be flushed out in the above loop)
    if not (str[strLength] in delims) then
    begin
    UseMe.Add(Copy(str, Start, strLength - start + 1));
    end;
    end;[/pascal]

    I'd advise you to test the above for a couple of minutes on general inputs. If you find any bugs, let me know and I'll update the code.

    You really shouldn't have a *shudder* in your above post, you know - Delphi's set capabilities make it rock for this sort of task! :king:

    EFG also has a parse, but I've not tried it yet. Nonetheless: here ya go
    "All paid jobs absorb and degrade the mind."
    <br />-- Aristotle

  3. #3

    String tokenization

    Here are a couple more:

    http://www.planet-source-code.com/vb...d=828&lngWId=7
    http://delphi.about.com/bltip0902.htm

    Something else of interest: there's a class called TParser in classes.pas. For some reason, this isn't shown in the help files, so I couldn't say how to use it, but... there you go! If you don't have the VCL source code then let me know.
    "All paid jobs absorb and degrade the mind."
    <br />-- Aristotle

  4. #4

    String tokenization

    Sets! Of course! Why didn't I think of using them (kicks self in behind). I'll try it out when I get back from Uni. The delimiters I'm interested in are :

    - Space
    - Tab
    - Comma
    - New Line

    How do I get the Space/Tab/Netlines in Pascal? In C and Java its \t\n, how do I get them in Pascal?

  5. #5

    String tokenization

    You can refer to a char by its ascii value using #number - for example, #13 is carriage return, #9 is tab, #10 is line feed, #32 is space, etc. (handy link) You can also use the Ord() function to get the ascii value of a char - for example, Ord(' ') -> 32, Ord(',') ->whatever, and so on. Chr() does the reverse - turns a byte into a char. If you want really archaic, you can use constants defined as ^Letter, which will give that char the appropriate thing (e.g. const tab: Char = ^I. Nobody uses that notation these days, as far as I can tell.

    Here's a quick example to show how you'd check against it:

    if MyChar = #9 then // we've found a tab

    You might want to adapt the above code I gave with an array of char, instead of a string, for the delimiters.

    Note that the above uses one-char-per-delimiter. Technically, a new line is represented by #13#10, so you may have to modify it to take that into account (shouldn't take too long, I guess). You could read in your values line-by-line to take account of that, passing each row to the function in turn. You'd have to get rid of UseMe.Clear to keep the old results handy.

    EDIT: Come to think of it, the above function would work fine if you included both #13 and #10 in the delimiters string, since it would ignore the #10 and take no action when it bumped into one.
    "All paid jobs absorb and degrade the mind."
    <br />-- Aristotle

  6. #6

    String tokenization

    The following might do the trick for you. Once again, I've not tested it (I know it compiles though):

    [pascal][background=#FFFFFF][comment=#0000FF][normal=#000000]
    [number=#C00000][reserved=#000000][string=#00C000]type
    TCharSet = set of char;

    // note that this procedure does *not* clear UseMe any more!
    procedure Tokenize(const UseMe: TStrings; const str: string; delims: TCharSet);
    var
    Start: Integer;
    i: Integer;
    StrLength: Integer;
    begin
    Start := 1;
    StrLength := Length(str);
    for i := 1 to StrLength do
    begin
    if str[i] in delims then
    begin
    if Start <= i - 1 then
    UseMe.Add(Copy(str, Start, i - start));

    Start := i + 1;
    end;
    end;

    if not (str[strLength] in delims) then
    begin
    UseMe.Add(Copy(str, Start, strLength - start + 1));
    end;
    end;

    procedure ReadTokensFromFile(const FileName: string);
    var
    Tokens: TStringList;
    InFile: TStringList;
    i: Integer;
    begin
    if not FileExists(Filename) then
    raise Exception.Create('The file ' + Filename + ' was not found');

    Tokens := TStringList.Create;
    try
    InFile := TStringList.Create;
    try
    InFile.LoadFromFile(FileName);

    for i := 0 to InFile.Count - 1 do
    Tokenize(Tokens, InFile[i], [' ', #9, ',']);
    finally
    InFile.Free;
    end;

    // todo: use tokens somehow
    finally
    Tokens.Free;
    end;
    end;[/pascal]
    "All paid jobs absorb and degrade the mind."
    <br />-- Aristotle

  7. #7

    String tokenization

    Alimonster, thanks for the help!! I'll work on it later in the evening.

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •