a text parser

**JernejL** · 05-06-2009, 07:19 PM

A simple task: parse a line of text, separated by ",", "tab", "space", ";" into a array of strings, also multiple separator characters can separate values.

There's a utility function "StripAfter" which strips one-linecomments from a single line string (for stripping // comments from a string)

This is my implementation, feel free to use it, but can you do it better or faster?

The data is stored in a global variable, this is a class-free code, if you think using a class would be better give me your reasons and thoughts.

[pascal]

{ ************************************************** ********************* }
{ }
{ Crazy text parsing Unit }
{ }
{ ************************************************** ********************* }

unit textparser;

interface

uses sysutils;

procedure Explode(str: string; cleanup: boolean);
function Parsed(n: integer): string;
function ParsedInt(n: integer): integer;
function ParsedLW(n: integer): longword;
function ParsedBool(n: integer): boolean;
function ParsedFloat(n: integer): single;

function StripAfter(const Text, After: string): string;

var
Tokens: array of string;
Items: integer = 0;

implementation

procedure Explode(str: string; cleanup: boolean);
var
i: integer;
begin

if cleanup = True then
begin
// cleanup old data
for i := 0 to Items - 1 do
begin
Tokens[i] := '';
end;

Items := 0;
setlength(Tokens, Items + 1);

end;

// parse new string
for i := 1 to length(str) do
begin
if ((str[i] = ' ') or (str[i] = ';') or (str[i] = ',')) and (Tokens[Items] <> '') then
begin
Items := Items + 1;
setlength(Tokens, Items + 1);
end
else
Tokens[Items] := Tokens[Items] + trim(string(str[i]));
end;

end;

function Parsed(n: integer): string;
begin
if Items = -1 then
Result := ''
else
Result := Tokens[n];
end;

function ParsedInt(n: integer): integer;
begin
Result := StrToInt(Parsed(n));
end;

function ParsedLW(n: integer): longword;
begin
{$R-}
Result := StrToInt(Parsed(n));
{$R+}
end;

function ParsedBool(n: integer): boolean;
begin
Result := (Parsed(n) = '1') or (lowercase(Parsed(n)) = 'true') or (lowercase(Parsed(n)) = 'yes');
end;

function ParsedFloat(n: integer): single;
begin
Result := strtofloat(Parsed(n));
end;

function StripAfter(const Text, After: string): string;
var
p: integer;
begin
p := pos(After, Text);
if p <> 0 then
Result := copy(Text, 0, p - 1)
else
Result := Text;
end;

end.
[/pascal]

**AthenaOfDelphi** · 06-06-2009, 01:33 AM

Hi Delfi,

Interesting challenge... can you make it faster :-)

Here's my quick attempt.

[pascal]
type
TStringArray = array of string;

procedure explode(src:string;var tokens:TStringArray;cleanup:boolean);
var
idx : integer;
len : integer;
count : integer;
temp : string;
aChar : char;
toklen : integer;
begin
if cleanup then
begin
finalize(tokens);
end;

setlength(tokens,10);
count:=0;
idx:=1;
len:=length(src);
toklen:=length(tokens);
temp:='';

while (idx<=len) do
begin
aChar:=src[idx];
inc(idx);

if (aChar=' ') or (aChar=';') or
(aChar=',') or (aChar='.') or
(aChar=':') or (aChar=#$0d) or
(aChar=#$0a) then
begin
if (temp<>'') then
begin
if (count+1>toklen) then
begin
toklen:=toklen+(toklen div 2);
setlength(tokens,toklen);
end;
tokens[count]:=temp;
temp:='';
inc(count);
end;
end
else
begin
temp:=temp+aChar;
end;
end;

if (temp<>'') then
begin
if (count+1>toklen) then
begin
inc(toklen);
setlength(tokens,toklen);
end;
tokens[count]:=temp;
inc(count);
end;

if (toklen>count) then
begin
setlength(tokens,count);
end;
end;
[/pascal]

Based on my timings, this is I believe a little under twice as fast as your explode routine.

So, why is it faster? Well, this is why I believe it is faster.

First up, I've tried to reduce the number of calls to functions that are relatively slow prefering instead to keep track of the values they would return myself. Thats why I have more variables involved and the code is somewhat longer.

Secondly, setlength is quite a costly routine. So, instead of calling it for each token, I set the token buffer large and then only call it when I run out of room, growing the list by half it's size each time I do (things like TList etc. do something similar I believe). This reduces the time wasted in costly routines.

And third, although I'm not sure how much difference this makes, instead of having an IF that compares string[ x ] against characters, I read the character I'm interested in, store it in a char and then compare that. I say I'm not sure how much difference it makes because I'm no assembly language expert (at least as far as x86 goes) but the code generated by my version looks a lot cleaner consisting only of cmp's and jnz's I believe.

In terms of timing, I timed just the explode routine using the high performance counters. Neither routine were using their own cleanup. Parsing the string "the quick brown fox jumps over the lazy dog", the timings were as follows:-

DescriptionMineYours
Minimum.028ms0.045ms
Maximum3.3ms2.4ms
Average0.032ms0.051ms

I've also just tried parsing a string with 225 tokens in it. Mine took, on average, 0.73ms with yours taking 1.4ms. All these timings were conducted with a loop which parsed the same string 10,000 times.

**arthurprs** · 06-06-2009, 05:42 AM

Here is my version, based on AthenaOfDelphi one

[pascal]function MyExplode(const src: string): TStringArray;
var
idx: Integer;
//len: Integer;
count: Integer;
CharPtr: PChar;
aChar : Char;
toklen: Integer;
f : Integer;
begin
CharPtr := Pointer(src);
if CharPtr = nil then Exit;
idx := 1;
f := 1;
//len := 0;

toklen := 10;
SetLength(Result, toklen);
count := 0;

while CharPtr^ <> #0 do
begin
aChar := CharPtr^;
Inc(CharPtr);

if (aChar = (' ')) or (aChar = (';')) or
(aChar = (',')) or (aChar = ('.')) or
(aChar = (':')) or (aChar = (#$0D)) or
(aChar = (#$0A)) then
begin
if (f <> 0) then
begin
if (count + 1 > toklen) then
begin
toklen := toklen + (toklen div 2);
SetLength(Result, toklen);
end;
Result[count] := Copy(src, f, idx - f);
f := idx + 1;
Inc(count);
end;
end;
Inc(idx);
//Inc(len);
end;

if (idx >= f) then // f < len
begin
if (count + 1 > toklen) then
begin
Inc(toklen);
SetLength(Result, toklen);
end;
Result[count] := Copy(src, f, MaxInt); // (len - f + 1)
Inc(count);
end;

if toklen > count then
SetLength(Result, count);
end;[/pascal]

- turned it into a function for cleaner code and to avoid passing a non empty tokens array to the procedure...
- changed src parameter to const
- used a PChar pointer and Inc() to navigate in src string
- removed length calculation
- concatenating a temp string is slow, so i used a "f" variable to track the locations and then Copy() to get the substrings

4 to 7 times faster than AthenaOfDelphi version

i don't think it can get much faster than this using only common optimizations

Arthur.

**User137** · 06-06-2009, 08:42 AM

Would be nice to know which one is the fastest

I got 2 functions, text parser with free separation character or string, and another function you can use to put any contained values to variables too:

[pascal]function ReadStrings(const s,separator: string; var strings: array of string): integer;
var i,p,l,c2: integer; c,cs,cur: string;
begin
c2:=high(strings);
if c2<1 then begin
result:=0; exit;
end;
if c2>10 then c2:=10;
for i:=0 to c2 do strings[i]:='';
p:=0; l:=length(separator); i:=1; cur:='';
while i<=length(s) do begin
c:=copy(s,i,1); cs:=copy(s,i,l);
if cs=separator then begin
if cur<>'' then begin
strings[p]:=cur; cur:='';
inc(p);
if p>high(strings) then break;
strings[p]:='';
end;
i:=i+l-1;
end else cur:=cur+c;
inc(i);
end;
if cur<>'' then begin
strings[p]:=cur; inc(p);
end;
result:=p;
end;[/pascal]

[pascal]type
TCustomRead = (crString,crInt,crSingle,crDouble,crByte,crWord,
crShortInt,crSmallInt,crCardinal,crBool,crInt64);

function ReadCustom(const s,separator: string; const arr: array of pointer;
const arrt: array of TCustomRead): integer;
var p,arrtl: integer; cur: string; defaultStr: boolean;
procedure SetValue;
begin
if arr[p]=nil then exit;
if defaultStr then string(arr[p]^):=cur
else case arrt[p mod arrtl] of
crString: string(arr[p]^):=cur;
crInt: integer(arr[p]^):=strtointdef(cur,0);
crSingle: single(arr[p]^):=strtofloat(cur);
crDouble: double(arr[p]^):=strtofloat(cur);
crByte: byte(arr[p]^):=strtointdef(cur,0);
crWord: word(arr[p]^):=strtointdef(cur,0);
crShortInt: shortint(arr[p]^):=strtointdef(cur,0);
crSmallInt: smallint(arr[p]^):=strtointdef(cur,0);
crCardinal: cardinal(arr[p]^):=strtointdef(cur,0);
crBool: boolean(arr[p]^):=(cur<>'0') and (lowercase(cur)<>'false');
crInt64: int64(arr[p]^):=strtoint(cur);
end;
end;
var ha,i,l: integer; c,cs: string;
begin
ha:=high(arr);
if ha<1 then begin
result:=0; exit;
end;
defaultStr:=high(arrt)<1; arrtl:=length(arrt);
p:=0; l:=length(separator); i:=1; cur:='';
while i<=length(s) do begin
c:=copy(s,i,1); cs:=copy(s,i,l);
if cs=separator then begin
if cur<>'' then begin
SetValue; cur:=''; inc(p);
if p>ha then break;
end;
inc(i,l-1);
end else cur:=cur+c;
inc(i);
end;
if cur<>'' then begin
SetValue; inc(p);
end;
result:=p;
end;[/pascal]

Edit: There's no button for pascal tags...

**Brainer** · 06-06-2009, 08:55 AM

Here's my try.

[pascal]
procedure SplitString(const AOutput: TStrings; const AText, ASeparator: String);
var
S1, S2: String;
begin
AOutput.Clear();
S2 := AText + ASeparator;
repeat
S1 := Copy(S2, 0, Pos(ASeparator, S2) - 1);
AOutput.Add(S1);
Delete(S2, 1, Length(S1 + ASeparator));
until (S2 = '');
end;
[/pascal]

**AthenaOfDelphi** · 06-06-2009, 11:20 AM

Ok, I've plugged everyones attempt (so far) into my test harness and got some timings from them all.

The average results are as follows:-

UserAverage Time
arthurps0.011ms
Brainer0.023ms
AthenaOfDelphi0.029ms
Delfi0.049ms
User1370.057ms

The one thing I will say is that two of them don't operate in the same way. User137 and Brainer, your routines aren't capable of splitting using a variety of different characters at the same time, and Brainer, you return a string list as opposed to an array of strings.

But, none the less... nice job everyone

Howevere... at the moment, we have a clear winner... arthurps, your version of my routine is neat. As you've identified, I'm wasting time with string concatenation, a very wasteful operation as it reallocates memory for the result string. So, nice job.

Anyone else got anything faster?

**User137** · 06-06-2009, 12:12 PM

Originally Posted by AthenaOfDelphi

The one thing I will say is that two of them don't operate in the same way. User137 and Brainer, your routines aren't capable of splitting using a variety of different characters at the same time...

There may be uses for all different functions, i came from idea that user defines separator and there should never be more than 1. I could for example use '<>' or tab as separator and there are times you don't want ',' or ' ' or anything certain to cut strings.

Thanks for timing test, there seems to be need for optimize.

**JernejL** · 06-06-2009, 12:37 PM

Wow, i didnt realise mine was such rubbish and i didn't actually think there was a lot of room for optimization there, obviously i was very wrong and i'm still learning how to speed up various parts of my game.

arthurprs's code is really nice, it is similar to mine but he uses pointers which is clearly faster. I have some doubts it would work with widestring, which might be a important factor in the future, especially for translating games.

This string parser is a integral part of most games which parse text files and i use this in practically every project i make that reads plaintext files.

I am planning to write a proper token parser soon, which would parse JSON, i kinda dislike XML's syntax and json seems so well human readable and editeable, i'll try to use tricks i see in arthur's code in it.

**AthenaOfDelphi** · 06-06-2009, 01:32 PM

Originally Posted by User137

Originally Posted by AthenaOfDelphi

The one thing I will say is that two of them don't operate in the same way. User137 and Brainer, your routines aren't capable of splitting using a variety of different characters at the same time...

There may be uses for all different functions, i came from idea that user defines separator and there should never be more than 1. I could for example use '<>' or tab as separator and there are times you don't want ',' or ' ' or anything certain to cut strings.

Thanks for timing test, there seems to be need for optimize.

Oh absolutely hon, I would go down the same line as you and have a single separator. I wasn't saying it was a bad thing, I was just pointing out the subtle differences between everyones code.

**AthenaOfDelphi** · 06-06-2009, 01:46 PM

Originally Posted by Delfi

arthurprs's code is really nice, it is similar to mine but he uses pointers which is clearly faster. I have some doubts it would work with widestring, which might be a important factor in the future, especially for translating games.

This string parser is a integral part of most games which parse text files and i use this in practically every project i make that reads plaintext files.

I am planning to write a proper token parser soon, which would parse JSON, i kinda dislike XML's syntax and json seems so well human readable and editeable, i'll try to use tricks i see in arthur's code in it.

The main differences between your code and mine is the optimised sizing of the array. That makes quite a difference I believe as setlength is quite slow. And yeah, Arthur's, taken that and tweaked it some more.

In terms of wide character support, he's used inc, so providing the right variable types are used (widestring and, I think, PWideChar) it should still work.

As for writing a JSON parser... I don't want to discourage anyone from learning, but I also wouldn't encourage re-inventing the wheel (unless of course the wheel you were looking at was square). There is already a JSON parser for Delphi, so if it were me, I'd look at their code, see if could be optimised... and if not... work on something else

Anyhow... maybe when I get the competition site sorted we could have a regular... 'see who can do it fastest/smallest competition'... just a thought

Moderation Process Reminder

Thread: a text parser

Thread Tools

Display

a text parser

Re: a text parser

Re: a text parser

Re: a text parser

Re: a text parser

Re: a text parser

Re: a text parser

Re: a text parser

Re: a text parser

Re: a text parser

Bookmarks

Bookmarks

Posting Permissions