09 February 2014
Imagine you have a string s = "01234"
. You want to extract the substring of “123” from it.
The example of the string that I use here is simple as each of the characters represents its index.
That means to get the substring “123”, the start index is 1, and the end index is 3, right? Unfortunately,
this is wrong in most of programming languages out there. Let me exemplify:
Language | Code | Output |
---|---|---|
python | s[1:3] | 12 |
java | s.substring(1, 3); | 12 |
javascript | s.slice(1, 3) | 12 |
Observe that we are getting “12” instead of “123”. The right end index is in fact 4. Why is this strange behavior? This is because
the substring method’s end index is exclusive
Without rationalization, our little brain just forgets this behavior. I will rationalize this behavior thus remembering it will be easier.
There are two conventions being discussed to extract the substring of “123” (I will use only python for shorter code starting from here). The two conventions are:
Clearly there are good reasons why the language creators preferred Convention 1).
As a proposition, I will borrow some of the argument from Djikstra’s essay regarding Why numbering should start at 0, since they are reciprocal.
the difference between the bounds as mentioned equals the length of the subsequence
How is this related to our problem? Observe the following in Convention 1):
Yes, the difference between the start index and end index equals to the length of the substring.
endIndex - startIndex == len(substring)
To understand the value of this formula better, let’s see how our code look like with both of the conventions, given the use cases below.
Use case | Convention 1) | Convention 2) |
---|---|---|
a) First x chars | s[0:0 + x] | s[0:0 + x - 1] |
b) Last x chars | s[len(s) - x:len(s)] | s[len(s) - x:len(s) - 1] |
c) Index x with length of y | s[x:x + y] | s[x:x + y - 1] |
d) Index x to index of separator y | s[x:s.index(y)] | s[x:s.index(y) - 1] |
e) Index x to index y | s[x:y + 1] | s[x:y] |
Observe that in the first four use cases, Convention 2) looks ugly. These common use cases can be coded prettily with Convention 1). Hence Convention 1), which has an exclusive end index, had been seen as a more intuitive way for slicing a string. It is prettier to solve a more common use cases.
Now, if you have noticed, the last use case looks pretty with Convention 2). Use case e) is usually a problem that has been solved with human eyes and to be coded into a program. This is actually the problem that has been given at the very beginning of this post. We already know that we want to extract “123” before even we code them. The caret index method will help you find the end index for this particular use case.
Finding the end index for a predetermined character is easy with the caret index method. This method requires a simple tweak of the definition index. The index should be seen as a caret index.
Therefore:
|01234 # caret index 0
0|1234 # caret index 1
0123|4 # caret index 4
01234| # caret index 5
Going back to the initial problem, “123” will be extracted out like so:
0|123|4 # caret start index 1, caret end index 4
Be careful when you are using Groovy. If you slice the string the Groovy way, you will get “123”:
s[1..3]
I’m not going to say that Groovy is not adhering Convention 1). It is more like
a gotcha. The 1..3
represents a range, it is more intuitive if it includes 3 as you read it:
1 until 3. Do the following to make the end index exclusive:
s[1..<3]