Parsing search terms from CGI.http_referrer using regular expressions

Filed under: regular expressions

comments (2) Views: 10,174

Today someone over at the House of Fusion mailing list asked how to parse out search terms from an incoming referrer link. Several people responded with options of looping over the referrer, and other suggestions. Then I put forth the idea of using a simple regex to extract the required string. Here's what I came up with, maybe it'll help you?

This searches for a literal string 'q=' that is immediately preceded by a ? or an & and looks for any text after the = but before an &.

  • [?|&] Either ? or &
  • q=literal string, used to know where to start the match
  • [^&]+The actual string to match. Any character, any number of times, except for an &

That worked really well, but I didn't like the fact that the q= is also returned in the match, especially when that meant I'd have to remove that string from my final result. So I asked on Twitter and was referred to Ben Nadel's post on REMatchGroup where I found out about negative look-behinds. That did the trick. So here's the new regex.

This regular expression uses functionality called negative look-behind. It basically says "only match the target string if it's immediately preceded by another string". Let's break it down.

  • ( Opens the negative look-behind
  • ?<=begins the negative look behind matching
  • [?|&]q=matches ? or & followed by a literal string q=
  • )Closes the negative look-behind
  • [^&]+The actual string to match. Any character, any number of times, except for an &

Hope this helped you out. It was a great challenge, and I learned something new about regex that I didn't know.

Amazon logo

If this article was interesting, or helpful, or even wrong, please consider leaving a comment, or buying something from my wishlist. It's appreciated!

comments powered by Disqus