Article: Regex performance, removing some characters

Home Page


Consultancy

  • Service Vouchers
  • Escrow Service

Shop



Programming
  • Articles
  • Tools
  • Links

Search

 

Contact

 

PHPinfo


$_SERVER







Comparing Regex.Replace with explicit string operations.

category 'perfo test', language C#, created 26-Jun-2009, version V1.0, by Luc Pattyn


License: The author hereby grants you a worldwide, non-exclusive license to use and redistribute the files and the source code in the article in any way you see fit, provided you keep the copyright notice in place; when code modifications are applied, the notice must reflect that. The author retains copyright to the article, you may not republish or otherwise make available the article, in whole or in part, without the prior written consent of the author.

Disclaimer: This work is provided as is, without any express or implied warranties or conditions or guarantees. You, the user, assume all risk in its use. In no event will the author be liable to you on any legal theory for any special, incidental, consequential, punitive or exemplary damages arising out of this license or the use of the work or otherwise.


The Regex class is a powerful tool for performing string operations; however I have always been suspicious as to its performance level. This little experiment compares several ways of stripping a set of characters from a given string.

The test program

The environment used is the Microsoft .NET Framework (version 2.0 or above) and the C# programming language.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;

namespace RegexTest {
	class Program {
		private static int REPEAT=1000000;
		private static Stopwatch sw=new Stopwatch();
		private static List<string> logs=new List<string>();

		static void Main(string[] args) {
			string s="3232323 sdsadsd 171617181 sddsddfe 323243";
			string sremove="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";

			sw.Reset();
			sw.Start();
			for (int i=0; i<REPEAT; i++) Regex.Replace(s, "[a-z]", "");
			sw.Stop();
			log("regex(26)="+sw.ElapsedMilliseconds);

			sw.Reset();
			sw.Start();
			for (int i=0; i<REPEAT; i++) Regex.Replace(s, "[A-Za-z]", "");
			sw.Stop();
			log("regex(52)="+sw.ElapsedMilliseconds);

			sw.Reset();
			sw.Start();
			for (int i=0; i<REPEAT; i++) Regex.Replace(s, "[A-Za-z!@#]", "");
			sw.Stop();
			log("regex(55)="+sw.ElapsedMilliseconds);

			sw.Reset();
			sw.Start();
			Regex regex=new Regex("[A-Za-z!@#]", RegexOptions.Compiled);
			for (int i=0; i<REPEAT; i++) regex.Replace(s, "");
			sw.Stop();
			log("regex(compiled-incl)="+sw.ElapsedMilliseconds);

			sw.Reset();
			sw.Start();
			Regex regex2=new Regex("[A-Za-z!@#]", RegexOptions.Compiled);
			for (int i=0; i<REPEAT; i++) regex2.Replace(s, "");
			sw.Stop();
			log("regex(compiled-excl)="+sw.ElapsedMilliseconds);

			sw.Reset();
			sw.Start();
			for (int i=0; i<REPEAT; i++) RemoveLetters(s);
			sw.Stop();
			log("for+isalpha="+sw.ElapsedMilliseconds);

			sw.Reset();
			sw.Start();
			for (int i=0; i<REPEAT; i++) RemoveLetters(s, sremove);
			sw.Stop();
			log("for+IndexOf="+sw.ElapsedMilliseconds);

			File.WriteAllLines("RegexTest.txt", logs.ToArray());
			log("Done (hit ENTER to exit)");
			Console.ReadKey();
		}
		
		public static void log(string s) {
			Console.WriteLine(s);
			logs.Add(s);
		}

		public static string RemoveLetters(string original) {
			StringBuilder sb=new StringBuilder();
			for (int i=0; i<original.Length; i++) {
				if (!char.IsLetter(original, i)) sb.Append(original[i]);
			}
			return sb.ToString();
		}

		public static string RemoveLetters(string original, string remove) {
			StringBuilder sb=new StringBuilder();
			foreach (char c in original) {
				if (remove.IndexOf(c)<0) sb.Append(c);
			}
			return sb.ToString();
		}
	}
}

Results

This is what got logged, all times are in milliseconds:

regex(26)=8620
regex(52)=8698
regex(55)=8820
regex(compiled-incl)=5845
regex(compiled-excl)=5751
for+isalpha=1158
for+IndexOf=3500

Removing all letters is five or more times faster with explicit code based on IsAlpha; removing an arbitrary set of characters using IndexOf is some two times faster than any regex attempt, even the compiled one that did not measure the constructor time.

Conclusion

When performance is key, I will consider writing some C# code rather than using Regex unless the job at hand is sufficiently complex to benefit from its expressive power.



Perceler

Copyright © 2012, Luc Pattyn

Last Modified 04-May-2025