Introduction to Strings
Definition and Key Concepts
Strings:
Strings are sequences of characters used to represent text. They are one of the most fundamental data types in programming and are used for various purposes such as storing and manipulating text data.
Key Concepts:
- String Representation: Strings can be represented using different character encodings such as ASCII and Unicode.
- Immutability: In many programming languages, strings are immutable, meaning once created, their content cannot be changed.
- String Indexing: Strings are typically indexed starting from 0, allowing access to individual characters.
String Operations and Characteristics
Basic Operations:
- Creation: Strings can be created using literal notation or constructors.
- Concatenation: Combining two or more strings using operators or methods.
- Repetition: Repeating a string a specified number of times.
Characteristics:
- Indexing and Slicing: Accessing individual characters and substrings.
- Immutability: Strings cannot be changed after creation in some languages.
- Character Encoding: Understanding encoding schemes such as UTF-8 for multilingual text support.
Differences between String Handling Paradigms
Strings vs. Character Arrays:
- Strings: Higher-level abstraction with built-in methods for manipulation and querying.
- Character Arrays: Lower-level data structure where individual characters can be manipulated directly.
Example:
-
Strings: Java
String
, Pythonstr
-
Character Arrays: C
char[]
Strings vs. StringBuilder/StringBuffer:
- Strings: Immutable, resulting in new string objects for every modification.
- StringBuilder/StringBuffer: Mutable, allowing modification of strings without creating new objects.
Example:
-
StringBuilder/StringBuffer: Java
StringBuilder
, C++std::stringstream
Examples for String Matching Algorithms
String matching algorithms are used to find occurrences of a substring (pattern) within a larger string (text). Below are implementations of three popular string matching algorithms: Brute Force, Knuth-Morris-Pratt (KMP), Boyer-Moore, and Rabin-Karp.
Brute Force String Matching
The Brute Force algorithm checks for the presence of the pattern in the text by sliding the pattern over the text one character at a time and comparing characters.
Here’s a Python implementation:
def brute_force_string_match(text, pattern):
n = len(text)
m = len(pattern)
# Iterate through each possible starting position in text
for i in range(n - m + 1):
# Check for pattern match starting at position i
for j in range(m):
if text[i + j] != pattern[j]:
break
else:
# If inner loop completes without breaking, a match is found
print(f"Pattern found at index {i}")
# Example usage
text = "ababcababcabcabc"
pattern = "abc"
brute_force_string_match(text, pattern)
Knuth-Morris-Pratt (KMP) Algorithm
The KMP algorithm improves efficiency by avoiding unnecessary comparisons using a partial match table (also known as the Longest Prefix Suffix or LPS array) to skip over characters that have already been matched.
Here’s a Python implementation:
def get_lps(pattern):
lps = [0] * len(pattern)
length = 0
i = 1
while i < len(pattern):
if pattern[i] == pattern[length]:
length += 1
lps[i] = length
i += 1
else:
if length != 0:
length = lps[length - 1]
else:
lps[i] = 0
i += 1
return lps
def find_all_matches_kmp(text, pattern):
n, m = len(text), len(pattern)
if m > n:
return []
if m == n:
return [0] if pattern == text else []
lps = get_lps(pattern)
i, j = 0, 0
matches = []
while i < n:
if text[i] == pattern[j]:
i += 1
j += 1
if j == m:
matches.append(i - j)
j = lps[j - 1]
elif i < n and text[i] != pattern[j]:
if j != 0:
j = lps[j - 1]
else:
i += 1
return matches
# Example usage
text = "abxabcabcabydhjbjabcabyjfd"
pattern = "abcaby"
result = find_all_matches_kmp(text, pattern)
print(result)
Boyer-Moore Algorithm
The Boyer-Moore algorithm improves search efficiency by skipping sections of the text that are guaranteed not to contain the pattern, using two heuristics: the Bad Character Rule and the Good Suffix Rule.
Here’s a Python implementation:
def preprocess_bad_character(pattern):
bad_char = {}
m = len(pattern)
for i in range(m):
bad_char[pattern[i]] = i
return bad_char
def preprocess_good_suffix(pattern):
m = len(pattern)
good_suffix = [0] * m
f = [-1] * m
k = m - 1
for j in range(m - 2, -1, -1):
if pattern[j + 1] != pattern[k]:
k = j + 1
f[j] = k
for i in range(m - 1):
good_suffix[m - f[i] - 1] = i + 1
return good_suffix
def boyer_moore_search(text, pattern):
n, m = len(text), len(pattern)
if m > n:
return []
bad_char = preprocess_bad_character(pattern)
good_suffix = preprocess_good_suffix(pattern)
matches = []
i = 0
while i <= n - m:
j = m - 1
while j >= 0 and pattern[j] == text[i + j]:
j -= 1
if j < 0:
matches.append(i)
i += good_suffix[0] if i + m < n else 1
else:
i += max(good_suffix[j + 1] if j + 1 < len(good_suffix) else 1,
j - bad_char.get(text[i + j], -1))
return matches
# Example usage
text = "abxabcabcabydhjbjabcabyjfd"
pattern = "abcaby"
result = boyer_moore_search(text, pattern)
print(result)
Rabin-Karp Algorithm
The Rabin-Karp algorithm uses hashing to find the pattern in the text. It computes hash values for the pattern and all substrings of the text of the same length, and compares them.
Here’s a Python implementation:
def rabin_karp(text, pattern, d=256, q=101):
n, m = len(text), len(pattern)
if m > n:
return []
# Precompute hash values
p = 0 # Hash value for pattern
t = 0 # Hash value for text
h = 1 # The value of d^(m-1) % q
matches = []
# Compute the hash value of pattern and first window of text
for i in range(m):
p = (d * p + ord(pattern[i])) % q
t = (d * t + ord(text[i])) % q
if i < m - 1:
h = (h * d) % q
# Check for matches
for i in range(n - m + 1):
if p == t:
if text[i:i + m] == pattern:
matches.append(i)
if i < n - m:
t = (d * (t - ord(text[i]) * h) + ord(text[i + m])) % q
t = (t + q) % q # Ensure t is non-negative
return matches
# Example usage
text = "abcabcabcdabcdabc"
pattern = "abcd"
result = rabin_karp(text, pattern)
print(result)