Be Careful with String’s Substring Method in Java

Jeremy Grifski - Mar 1 '19 - - Dev Community

Every once in awhile, I’ll come across a well-established library in a programming language that has its quirks. As an instructor, I have to make sure I’m aware of these quirks when I’m teaching. For instance, last time I talked a bit about the various Scanner input methods and how they don’t all behave the same way. Well today, I want to talk about the substring method from Java’s String library.

Documentation

When using a library for the first time, I find it useful to check out the documentation. But with a library so established, it sometimes feels silly to dig into the documentation. After all, a lot of languages support strings. Personally, all I need to know is the name of the command before I can figure out the rest.

However, every once in awhile, I’ll come across a function that is less intuitive than I thought. In this case, I’m talking about Java’s substring method. As you can probably imagine, it grabs a substring from a string and returns it. So, what’s the catch?

Well for starters, the substring method is actually an overloaded method. As a result, there are two different forms of the same method in the documentation. Take a look:

public String substring(int beginIndex)

Returns a new string that is a substring of this string. The substring begins with the character at the specified index and extends to the end of this string.

Java API, 2019

public String substring(int beginIndex, int endIndex)

Returns a new string that is a substring of this string. The substring begins at the specified beginIndex and extends to the character at index endIndex - 1. Thus the length of the substring is endIndex-beginIndex.

Java API, 2019

At this point, don’t fixate too much on their descriptions as we’ll get to those. Just be aware that there are two different versions of the same method.

Usage

At this point, I’d like to take a moment to show how to use the substring method. If this is your first time poking around the Java API, this would be a good time to follow along.

First, notice that the method header does not contain the static keyword. In other words, subtring is an instance method which makes sense. We need an instance of a string in order to get a substring:

String str = "Hello, World!";
String subOne = str.substring(7);
String subTwo = str.substring(0, 5);
Enter fullscreen mode Exit fullscreen mode

In this example, we’ve created two new substrings: one from position 7 to the end and the other from position 0 to position 5. Without looking at the documentation, can you figure out what the resulting strings will be?

Interval Notation

Before I give away the answer, I think it’s important to discuss some terminology from mathematics. In particular, I’d like to talk a bit about interval notation.

In interval notation, the goal is to explicitly state the range of some subset. For instance, we may be interested in all integers greater than 0. In interval notation, that would look something like:

(0, +∞)
Enter fullscreen mode Exit fullscreen mode

In this example, we’ve chosen to exclude the value of 0 from the range using parentheses. We could have just as easily defined the interval starting with 1—pay attention to the brackets:

[1, +∞)
Enter fullscreen mode Exit fullscreen mode

In either case, we’re describing the same set: all integers greater than 0.

So, how does this tie into the substring method? As it turns out, a substring is a subset of a string, so we can use interval notation to define our substring. Why don’t we try a couple examples? Given “Hello, World!”, determine the substring using the following intervals:

  • [0, 2]
  • (0, 5]
  • (1, 3)
  • (-1, 7]

Once you’re done, check out the answers below:

  • “Hel”
  • “ello,”
  • “l”
  • “Hello, W”

We’ll need to keep this idea in the back of our mind moving forward.

The Truth

The truth of the matter is the substring method is a bit weird. On one hand, we can use a single index to specify the starting point of our new substring. On the other hand, we can use two indices to grab an arbitrary subset of a string.

However, in practice, I find that the second option gives a lot of students trouble, and I don’t blame them. After all, the bounds are deceptive. For example, let’s revisit some code from above:

String str = "Hello, World!";
String subOne = str.substring(7);
String subTwo = str.substring(0, 5);
Enter fullscreen mode Exit fullscreen mode

Here, we can confidently predict that subOne has a value of “World!”, and we’d be right. After all, index 7 is ‘W’, the method automatically grabs the rest of the string.

As for subTwo, we’d probably guess “Hello,”, and we’d be incorrect. It’s actually “Hello” because the end index is exclusive (i.e. [0, 5) ). In the next section, we’ll take a look at why that is and how I feel about it.

My Take

From what I understand, the inclusive/exclusive model is the standard for ranges in the Java API. That said, I do occasionally question the design choice.

On one hand, there’s the advantage of being able to use the length of the string as the end point of the substring:

String jokerQuote = "Madness, as you know, is like gravity, all it takes is a little push.";
String newtonTheory = jokerQuote.substring(30, jokerQuote.length());
Enter fullscreen mode Exit fullscreen mode

But, is this really necessary? Java already provides an overload to the substring method which captures exactly this behavior.

That said, there is a nice mathematical explanation for this notation, and part of it has to do with the difference between the starting and ending points. In particular, we get the length of the new substring:

int length = endIndex - startIndex;
Enter fullscreen mode Exit fullscreen mode

In addition, this particular notation allows adjacent substrings to share a midpoint:

String s = "Luck is great, but most of life is hard work.";
String whole = s.substring(0, s.length()/2) + s.substring(s.length()/2, s.length());
Enter fullscreen mode Exit fullscreen mode

Both of these properties are nice, but I think they're likely a byproduct of indexing by zero (perpetuated by Dijkstra) which isn't all that intuitive either. And for those of you who are going to take exception to that comment, be aware that I'm all for indexing by zero and and this inclusive/exclusive subset convention.

All I'm trying to say is that I've seen my own students get tripped up over both conventions, so I feel for them in a way. That's why I went through such lengths to write this article in the first place.

Let me know if you feel the same or if I’m totally off base. Otherwise, thanks for taking some time to read my work. I hope you enjoyed it!

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .