Thursday, October 04, 2007

String Splitting and Tokenization Techniques in .NET

At start I'll give the explanation what token and tokenization is.

According to wikipedia article: "A token is a categorized block of text, usually consisting of indivisible characters known as lexemes. A lexical analyzer processes lexemes to categorize them according to function, giving them meaning. This assignment of meaning is known as tokenization. A token can look like anything: English, gibberish symbols, anything; it just needs to be a useful part of the structured text."

In .NET string class has a Split method. It returns an array of strings that are substrings delimited by a specified separator. Split method is convenient, however somewhat inefficient. Let us consider an example.

Input string is "59;21;67;72;111". String.Split(';') will return an array of five strings. But what will happen if we need to find specific token, say "21". String.Split will return an array, and then we'll have to iterate through it and search for "21". Do you notice pitfall here? No?

String.Spilt inefficiency comes from its API contract - it returns an array of string, every item of which consumes memory. But we're searching only for "21", we do not need other tokens.

This example shows inappropriate scenario for String.Split(...) method.

Other approach for string tokenizing is using special tokenizers that are returning one token at a time. An example of such tokenizer can be StringTokenizer class from Java world.
Sample code in Java:

StringTokenizer st = new StringTokenizer("this is a test");
while (st.hasMoreTokens()) {
System.out.println(st.nextToken());
}

We can see that StringTokenizer is returning one token at a time. It is extremely efficient for token search scenarios as we may not need to iterate through all the tokens thus allocate less memory and perform faster.

In .NET base class library (BCL) there is no such class. But there are 3-d party analogs or ports of Java StringTokenizer.

.NET BCL actually, has very powerful set of classes that can cope with string tokenization scenario easily. These classes "live" in System.Text.RegularExpressions.
Good example of string tokenization using Regex class is provided by TrackerRealm design team in their blog.

So, there are at least three ways in .NET how strings can be split into tokens. There is a rule of thumb here, if algorithm doesn't need all tokens from the string - do not use String.Split.

No comments:

Post a Comment