Page 1 of 3 123 LastLast
Results 1 to 10 of 27

Thread: a text parser

  1. #1

    a text parser

    A simple task: parse a line of text, separated by ",", "tab", "space", ";" into a array of strings, also multiple separator characters can separate values.

    There's a utility function "StripAfter" which strips one-linecomments from a single line string (for stripping // comments from a string)

    This is my implementation, feel free to use it, but can you do it better or faster?

    The data is stored in a global variable, this is a class-free code, if you think using a class would be better give me your reasons and thoughts.

    [pascal]

    { ************************************************** ********************* }
    { }
    { Crazy text parsing Unit }
    { }
    { ************************************************** ********************* }

    unit textparser;

    interface

    uses sysutils;

    procedure Explode(str: string; cleanup: boolean);
    function Parsed(n: integer): string;
    function ParsedInt(n: integer): integer;
    function ParsedLW(n: integer): longword;
    function ParsedBool(n: integer): boolean;
    function ParsedFloat(n: integer): single;

    function StripAfter(const Text, After: string): string;

    var
    Tokens: array of string;
    Items: integer = 0;

    implementation

    procedure Explode(str: string; cleanup: boolean);
    var
    i: integer;
    begin

    if cleanup = True then
    begin
    // cleanup old data
    for i := 0 to Items - 1 do
    begin
    Tokens[i] := '';
    end;

    Items := 0;
    setlength(Tokens, Items + 1);

    end;

    // parse new string
    for i := 1 to length(str) do
    begin
    if ((str[i] = ' ') or (str[i] = ';') or (str[i] = ',')) and (Tokens[Items] <> '') then
    begin
    Items := Items + 1;
    setlength(Tokens, Items + 1);
    end
    else
    Tokens[Items] := Tokens[Items] + trim(string(str[i]));
    end;

    end;

    function Parsed(n: integer): string;
    begin
    if Items = -1 then
    Result := ''
    else
    Result := Tokens[n];
    end;

    function ParsedInt(n: integer): integer;
    begin
    Result := StrToInt(Parsed(n));
    end;

    function ParsedLW(n: integer): longword;
    begin
    {$R-}
    Result := StrToInt(Parsed(n));
    {$R+}
    end;

    function ParsedBool(n: integer): boolean;
    begin
    Result := (Parsed(n) = '1') or (lowercase(Parsed(n)) = 'true') or (lowercase(Parsed(n)) = 'yes');
    end;

    function ParsedFloat(n: integer): single;
    begin
    Result := strtofloat(Parsed(n));
    end;

    function StripAfter(const Text, After: string): string;
    var
    p: integer;
    begin
    p := pos(After, Text);
    if p <> 0 then
    Result := copy(Text, 0, p - 1)
    else
    Result := Text;
    end;

    end.
    [/pascal]
    This is my game project - Top Down City:
    http://www.pascalgamedevelopment.com...y-Topic-Reboot

    My OpenAL audio wrapper with Intelligent Source Manager to use unlimited:
    http://www.pascalgamedevelopment.com...source+manager

  2. #2
    PGD Community Manager AthenaOfDelphi's Avatar
    Join Date
    Dec 2004
    Location
    South Wales, UK
    Posts
    1,245
    Blog Entries
    2

    Re: a text parser

    Hi Delfi,

    Interesting challenge... can you make it faster :-)

    Here's my quick attempt.

    [pascal]
    type
    TStringArray = array of string;

    procedure explode(src:string;var tokens:TStringArray;cleanup:boolean);
    var
    idx : integer;
    len : integer;
    count : integer;
    temp : string;
    aChar : char;
    toklen : integer;
    begin
    if cleanup then
    begin
    finalize(tokens);
    end;

    setlength(tokens,10);
    count:=0;
    idx:=1;
    len:=length(src);
    toklen:=length(tokens);
    temp:='';

    while (idx<=len) do
    begin
    aChar:=src[idx];
    inc(idx);

    if (aChar=' ') or (aChar=';') or
    (aChar=',') or (aChar='.') or
    (aChar=':') or (aChar=#$0d) or
    (aChar=#$0a) then
    begin
    if (temp<>'') then
    begin
    if (count+1>toklen) then
    begin
    toklen:=toklen+(toklen div 2);
    setlength(tokens,toklen);
    end;
    tokens[count]:=temp;
    temp:='';
    inc(count);
    end;
    end
    else
    begin
    temp:=temp+aChar;
    end;
    end;

    if (temp<>'') then
    begin
    if (count+1>toklen) then
    begin
    inc(toklen);
    setlength(tokens,toklen);
    end;
    tokens[count]:=temp;
    inc(count);
    end;

    if (toklen>count) then
    begin
    setlength(tokens,count);
    end;
    end;
    [/pascal]

    Based on my timings, this is I believe a little under twice as fast as your explode routine.

    So, why is it faster? Well, this is why I believe it is faster.

    First up, I've tried to reduce the number of calls to functions that are relatively slow prefering instead to keep track of the values they would return myself. Thats why I have more variables involved and the code is somewhat longer.

    Secondly, setlength is quite a costly routine. So, instead of calling it for each token, I set the token buffer large and then only call it when I run out of room, growing the list by half it's size each time I do (things like TList etc. do something similar I believe). This reduces the time wasted in costly routines.

    And third, although I'm not sure how much difference this makes, instead of having an IF that compares string[ x ] against characters, I read the character I'm interested in, store it in a char and then compare that. I say I'm not sure how much difference it makes because I'm no assembly language expert (at least as far as x86 goes) but the code generated by my version looks a lot cleaner consisting only of cmp's and jnz's I believe.

    In terms of timing, I timed just the explode routine using the high performance counters. Neither routine were using their own cleanup. Parsing the string "the quick brown fox jumps over the lazy dog", the timings were as follows:-


    DescriptionMineYours
    Minimum.028ms0.045ms
    Maximum3.3ms2.4ms
    Average0.032ms0.051ms


    I've also just tried parsing a string with 225 tokens in it. Mine took, on average, 0.73ms with yours taking 1.4ms. All these timings were conducted with a loop which parsed the same string 10,000 times.
    :: AthenaOfDelphi :: My Blog :: My Software ::

  3. #3

    Re: a text parser

    Here is my version, based on AthenaOfDelphi one

    [pascal]function MyExplode(const src: string): TStringArray;
    var
    idx: Integer;
    //len: Integer;
    count: Integer;
    CharPtr: PChar;
    aChar : Char;
    toklen: Integer;
    f : Integer;
    begin
    CharPtr := Pointer(src);
    if CharPtr = nil then Exit;
    idx := 1;
    f := 1;
    //len := 0;

    toklen := 10;
    SetLength(Result, toklen);
    count := 0;

    while CharPtr^ <> #0 do
    begin
    aChar := CharPtr^;
    Inc(CharPtr);

    if (aChar = (' ')) or (aChar = (';')) or
    (aChar = (',')) or (aChar = ('.')) or
    (aChar = (':')) or (aChar = (#$0D)) or
    (aChar = (#$0A)) then
    begin
    if (f <> 0) then
    begin
    if (count + 1 > toklen) then
    begin
    toklen := toklen + (toklen div 2);
    SetLength(Result, toklen);
    end;
    Result[count] := Copy(src, f, idx - f);
    f := idx + 1;
    Inc(count);
    end;
    end;
    Inc(idx);
    //Inc(len);
    end;

    if (idx >= f) then // f < len
    begin
    if (count + 1 > toklen) then
    begin
    Inc(toklen);
    SetLength(Result, toklen);
    end;
    Result[count] := Copy(src, f, MaxInt); // (len - f + 1)
    Inc(count);
    end;

    if toklen > count then
    SetLength(Result, count);
    end;[/pascal]

    - turned it into a function for cleaner code and to avoid passing a non empty tokens array to the procedure...
    - changed src parameter to const
    - used a PChar pointer and Inc() to navigate in src string
    - removed length calculation
    - concatenating a temp string is slow, so i used a "f" variable to track the locations and then Copy() to get the substrings

    4 to 7 times faster than AthenaOfDelphi version

    i don't think it can get much faster than this using only common optimizations

    Arthur.
    From brazil (:

    Pascal pownz!

  4. #4

    Re: a text parser

    Would be nice to know which one is the fastest

    I got 2 functions, text parser with free separation character or string, and another function you can use to put any contained values to variables too:

    [pascal]function ReadStrings(const s,separator: string; var strings: array of string): integer;
    var i,p,l,c2: integer; c,cs,cur: string;
    begin
    c2:=high(strings);
    if c2<1 then begin
    result:=0; exit;
    end;
    if c2>10 then c2:=10;
    for i:=0 to c2 do strings[i]:='';
    p:=0; l:=length(separator); i:=1; cur:='';
    while i<=length(s) do begin
    c:=copy(s,i,1); cs:=copy(s,i,l);
    if cs=separator then begin
    if cur<>'' then begin
    strings[p]:=cur; cur:='';
    inc(p);
    if p>high(strings) then break;
    strings[p]:='';
    end;
    i:=i+l-1;
    end else cur:=cur+c;
    inc(i);
    end;
    if cur<>'' then begin
    strings[p]:=cur; inc(p);
    end;
    result:=p;
    end;[/pascal]

    [pascal]type
    TCustomRead = (crString,crInt,crSingle,crDouble,crByte,crWord,
    crShortInt,crSmallInt,crCardinal,crBool,crInt64);

    function ReadCustom(const s,separator: string; const arr: array of pointer;
    const arrt: array of TCustomRead): integer;
    var p,arrtl: integer; cur: string; defaultStr: boolean;
    procedure SetValue;
    begin
    if arr[p]=nil then exit;
    if defaultStr then string(arr[p]^):=cur
    else case arrt[p mod arrtl] of
    crString: string(arr[p]^):=cur;
    crInt: integer(arr[p]^):=strtointdef(cur,0);
    crSingle: single(arr[p]^):=strtofloat(cur);
    crDouble: double(arr[p]^):=strtofloat(cur);
    crByte: byte(arr[p]^):=strtointdef(cur,0);
    crWord: word(arr[p]^):=strtointdef(cur,0);
    crShortInt: shortint(arr[p]^):=strtointdef(cur,0);
    crSmallInt: smallint(arr[p]^):=strtointdef(cur,0);
    crCardinal: cardinal(arr[p]^):=strtointdef(cur,0);
    crBool: boolean(arr[p]^):=(cur<>'0') and (lowercase(cur)<>'false');
    crInt64: int64(arr[p]^):=strtoint(cur);
    end;
    end;
    var ha,i,l: integer; c,cs: string;
    begin
    ha:=high(arr);
    if ha<1 then begin
    result:=0; exit;
    end;
    defaultStr:=high(arrt)<1; arrtl:=length(arrt);
    p:=0; l:=length(separator); i:=1; cur:='';
    while i<=length(s) do begin
    c:=copy(s,i,1); cs:=copy(s,i,l);
    if cs=separator then begin
    if cur<>'' then begin
    SetValue; cur:=''; inc(p);
    if p>ha then break;
    end;
    inc(i,l-1);
    end else cur:=cur+c;
    inc(i);
    end;
    if cur<>'' then begin
    SetValue; inc(p);
    end;
    result:=p;
    end;[/pascal]

    Edit: There's no button for pascal tags...

  5. #5

    Re: a text parser

    Here's my try.
    [pascal]
    procedure SplitString(const AOutput: TStrings; const AText, ASeparator: String);
    var
    S1, S2: String;
    begin
    AOutput.Clear();
    S2 := AText + ASeparator;
    repeat
    S1 := Copy(S2, 0, Pos(ASeparator, S2) - 1);
    AOutput.Add(S1);
    Delete(S2, 1, Length(S1 + ASeparator));
    until (S2 = '');
    end;
    [/pascal]

  6. #6
    PGD Community Manager AthenaOfDelphi's Avatar
    Join Date
    Dec 2004
    Location
    South Wales, UK
    Posts
    1,245
    Blog Entries
    2

    Re: a text parser

    Ok, I've plugged everyones attempt (so far) into my test harness and got some timings from them all.

    The average results are as follows:-


    UserAverage Time
    arthurps0.011ms
    Brainer0.023ms
    AthenaOfDelphi0.029ms
    Delfi0.049ms
    User1370.057ms


    The one thing I will say is that two of them don't operate in the same way. User137 and Brainer, your routines aren't capable of splitting using a variety of different characters at the same time, and Brainer, you return a string list as opposed to an array of strings.

    But, none the less... nice job everyone

    Howevere... at the moment, we have a clear winner... arthurps, your version of my routine is neat. As you've identified, I'm wasting time with string concatenation, a very wasteful operation as it reallocates memory for the result string. So, nice job.

    Anyone else got anything faster?
    :: AthenaOfDelphi :: My Blog :: My Software ::

  7. #7

    Re: a text parser

    Quote Originally Posted by AthenaOfDelphi
    The one thing I will say is that two of them don't operate in the same way. User137 and Brainer, your routines aren't capable of splitting using a variety of different characters at the same time...
    There may be uses for all different functions, i came from idea that user defines separator and there should never be more than 1. I could for example use '<>' or tab as separator and there are times you don't want ',' or ' ' or anything certain to cut strings.

    Thanks for timing test, there seems to be need for optimize.

  8. #8

    Re: a text parser

    Wow, i didnt realise mine was such rubbish and i didn't actually think there was a lot of room for optimization there, obviously i was very wrong and i'm still learning how to speed up various parts of my game.

    arthurprs's code is really nice, it is similar to mine but he uses pointers which is clearly faster. I have some doubts it would work with widestring, which might be a important factor in the future, especially for translating games.

    This string parser is a integral part of most games which parse text files and i use this in practically every project i make that reads plaintext files.

    I am planning to write a proper token parser soon, which would parse JSON, i kinda dislike XML's syntax and json seems so well human readable and editeable, i'll try to use tricks i see in arthur's code in it.
    This is my game project - Top Down City:
    http://www.pascalgamedevelopment.com...y-Topic-Reboot

    My OpenAL audio wrapper with Intelligent Source Manager to use unlimited:
    http://www.pascalgamedevelopment.com...source+manager

  9. #9
    PGD Community Manager AthenaOfDelphi's Avatar
    Join Date
    Dec 2004
    Location
    South Wales, UK
    Posts
    1,245
    Blog Entries
    2

    Re: a text parser

    Quote Originally Posted by User137
    Quote Originally Posted by AthenaOfDelphi
    The one thing I will say is that two of them don't operate in the same way. User137 and Brainer, your routines aren't capable of splitting using a variety of different characters at the same time...
    There may be uses for all different functions, i came from idea that user defines separator and there should never be more than 1. I could for example use '<>' or tab as separator and there are times you don't want ',' or ' ' or anything certain to cut strings.

    Thanks for timing test, there seems to be need for optimize.
    Oh absolutely hon, I would go down the same line as you and have a single separator. I wasn't saying it was a bad thing, I was just pointing out the subtle differences between everyones code.
    :: AthenaOfDelphi :: My Blog :: My Software ::

  10. #10
    PGD Community Manager AthenaOfDelphi's Avatar
    Join Date
    Dec 2004
    Location
    South Wales, UK
    Posts
    1,245
    Blog Entries
    2

    Re: a text parser

    Quote Originally Posted by Delfi
    arthurprs's code is really nice, it is similar to mine but he uses pointers which is clearly faster. I have some doubts it would work with widestring, which might be a important factor in the future, especially for translating games.

    This string parser is a integral part of most games which parse text files and i use this in practically every project i make that reads plaintext files.

    I am planning to write a proper token parser soon, which would parse JSON, i kinda dislike XML's syntax and json seems so well human readable and editeable, i'll try to use tricks i see in arthur's code in it.
    The main differences between your code and mine is the optimised sizing of the array. That makes quite a difference I believe as setlength is quite slow. And yeah, Arthur's, taken that and tweaked it some more.

    In terms of wide character support, he's used inc, so providing the right variable types are used (widestring and, I think, PWideChar) it should still work.

    As for writing a JSON parser... I don't want to discourage anyone from learning, but I also wouldn't encourage re-inventing the wheel (unless of course the wheel you were looking at was square). There is already a JSON parser for Delphi, so if it were me, I'd look at their code, see if could be optimised... and if not... work on something else

    Anyhow... maybe when I get the competition site sorted we could have a regular... 'see who can do it fastest/smallest competition'... just a thought
    :: AthenaOfDelphi :: My Blog :: My Software ::

Page 1 of 3 123 LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •