The Problem

Adapted from ICPC Indonesia National Contest (INC) 2014, Problem H

Someone is ranting on his Facebook’s wall.

Notice that there are several blue-colored texts in the rant (which are hashtags, since they begin with a hash character / #). However, not all of hash-beginning words and texts which are not counted as hashtags, such as #)(*, #*C), and ##CM (only the #CM part of the text is considered one).

A valid hashtag (in this case, as a guideline for answering this problem) should begin with a hash character (#), following with an alphabet (a-z, A-Z) and optionally, a set of (one or more) alphanumeric characters (a-z, A-Z, 0-9). Hashtags should not immediately follow another hashtag, hence each adjacent hashtag should be separated by a space or non-alphanumeric character.

Invalid: #1234, #$1-13LL, #alpha#beta (only #alpha is valid), #%#{#[}#hello (only #hello is valid)
Valid: #t0D4y1L3aRnED, #alpha£#beta (both #alpha and #beta are considered separate)

You will be given a T number of test cases (1 <= T <= 100), and for each case you will be given an input string to be processed. Output the number of valid hashtags in the format of “Case #X: A”, where X is the case number from 1 and A is the number of valid hashtags, followed by printing every valid hashtags separated by line breaks (\n).

The Regex-based solution

Behold,

(#[A-Za-z][0-9A-Za-z]*)(#[A-Za-z][0-9A-Za-z]*){0,1}

This regular expression pattern utilizes capture groups to detect individual hashtags. The pattern is structured into

(< Capture Group 1 >)(< Capture Group 2 >){0,1}

The first capture group contains the following pattern.

#[A-Za-z][0-9A-Za-z]*

If you’re familiar with regular expressions, you can clearly understand that the above pattern exactly matches the aforementioned hashtag rules:

Started with #,
then followed with a alphabetic character ([A-Za-z]),
and finally followed with any (0 or more) alphanumeric characters ([0-9A-Za-z]*).

The second capture group simply repeats the first one, and it is intended to capture valid hashtags which follow any valid hashtag, such as #alpha#beta. {0,1} is then appended on the entire regex pattern to ensure that the second group is optional, so any other isolated valid hashtags can still be detected, such as 2847#helloworld:$24. This also prevents strings such as #hello#to#the#world to be detected as one hashtag, since according to the rules the string should be treated as 2 separate ones (#hello and #the).

Competitive programming competitions such as ICPC and INC allows submissions in C, C++, Java or Python. And fortunately, once we have found the valid regex pattern, we can directly implement it into C++, Java and Python code using standard regex libraries. Note that in order to answer the original question, you’ll need to get/extract the result using the first capture group, or else the judging system will consider a wrong answer.

Conclusion

I believe that this problem can also be solved in many ways, perhaps by utilizing certain string matching algorithms to optimize the hashtag-counting process. However, it’s quite interesting that this problem can be solved by a one-liner regular expression pattern, which is acceptable in many programming languages. I wonder whether and how quick the original contestants realized that a regex-based solution is possible for this problem.

This Competitive Programming question can be solved by a simple Regular Expression!

The Problem

The Regex-based solution

Conclusion