Strings, Design and Analysis of Algorithms

Harsh Mishra - Aug 2 - - Dev Community

Introduction to Strings

Definition and Key Concepts

Strings:
Strings are sequences of characters used to represent text. They are one of the most fundamental data types in programming and are used for various purposes such as storing and manipulating text data.

Key Concepts:

  1. String Representation: Strings can be represented using different character encodings such as ASCII and Unicode.
  2. Immutability: In many programming languages, strings are immutable, meaning once created, their content cannot be changed.
  3. String Indexing: Strings are typically indexed starting from 0, allowing access to individual characters.

String Operations and Characteristics

Basic Operations:

  • Creation: Strings can be created using literal notation or constructors.
  • Concatenation: Combining two or more strings using operators or methods.
  • Repetition: Repeating a string a specified number of times.

Characteristics:

  • Indexing and Slicing: Accessing individual characters and substrings.
  • Immutability: Strings cannot be changed after creation in some languages.
  • Character Encoding: Understanding encoding schemes such as UTF-8 for multilingual text support.

Differences between String Handling Paradigms

Strings vs. Character Arrays:

  • Strings: Higher-level abstraction with built-in methods for manipulation and querying.
  • Character Arrays: Lower-level data structure where individual characters can be manipulated directly.

Example:

  • Strings: Java String, Python str
  • Character Arrays: C char[]

Strings vs. StringBuilder/StringBuffer:

  • Strings: Immutable, resulting in new string objects for every modification.
  • StringBuilder/StringBuffer: Mutable, allowing modification of strings without creating new objects.

Example:

  • StringBuilder/StringBuffer: Java StringBuilder, C++ std::stringstream

Examples for String Matching Algorithms

String matching algorithms are used to find occurrences of a substring (pattern) within a larger string (text). Below are implementations of three popular string matching algorithms: Brute Force, Knuth-Morris-Pratt (KMP), Boyer-Moore, and Rabin-Karp.

Brute Force String Matching

The Brute Force algorithm checks for the presence of the pattern in the text by sliding the pattern over the text one character at a time and comparing characters.

Here’s a Python implementation:

def brute_force_string_match(text, pattern):
    n = len(text)
    m = len(pattern)

    # Iterate through each possible starting position in text
    for i in range(n - m + 1):
        # Check for pattern match starting at position i
        for j in range(m):
            if text[i + j] != pattern[j]:
                break
        else:
            # If inner loop completes without breaking, a match is found
            print(f"Pattern found at index {i}")

# Example usage
text = "ababcababcabcabc"
pattern = "abc"
brute_force_string_match(text, pattern)
Enter fullscreen mode Exit fullscreen mode

Knuth-Morris-Pratt (KMP) Algorithm

The KMP algorithm improves efficiency by avoiding unnecessary comparisons using a partial match table (also known as the Longest Prefix Suffix or LPS array) to skip over characters that have already been matched.

Here’s a Python implementation:

def get_lps(pattern):
    lps = [0] * len(pattern)
    length = 0
    i = 1
    while i < len(pattern):
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                length = lps[length - 1]
            else:
                lps[i] = 0
                i += 1
    return lps

def find_all_matches_kmp(text, pattern):
    n, m = len(text), len(pattern)
    if m > n:
        return []
    if m == n:
        return [0] if pattern == text else []

    lps = get_lps(pattern)
    i, j = 0, 0
    matches = []
    while i < n:
        if text[i] == pattern[j]:
            i += 1
            j += 1

        if j == m:
            matches.append(i - j)
            j = lps[j - 1]
        elif i < n and text[i] != pattern[j]:
            if j != 0:
                j = lps[j - 1]
            else:
                i += 1

    return matches

# Example usage
text = "abxabcabcabydhjbjabcabyjfd"
pattern = "abcaby"
result = find_all_matches_kmp(text, pattern)
print(result)
Enter fullscreen mode Exit fullscreen mode

Boyer-Moore Algorithm

The Boyer-Moore algorithm improves search efficiency by skipping sections of the text that are guaranteed not to contain the pattern, using two heuristics: the Bad Character Rule and the Good Suffix Rule.

Here’s a Python implementation:

def preprocess_bad_character(pattern):
    bad_char = {}
    m = len(pattern)
    for i in range(m):
        bad_char[pattern[i]] = i
    return bad_char

def preprocess_good_suffix(pattern):
    m = len(pattern)
    good_suffix = [0] * m
    f = [-1] * m
    k = m - 1
    for j in range(m - 2, -1, -1):
        if pattern[j + 1] != pattern[k]:
            k = j + 1
        f[j] = k
    for i in range(m - 1):
        good_suffix[m - f[i] - 1] = i + 1
    return good_suffix

def boyer_moore_search(text, pattern):
    n, m = len(text), len(pattern)
    if m > n:
        return []

    bad_char = preprocess_bad_character(pattern)
    good_suffix = preprocess_good_suffix(pattern)
    matches = []
    i = 0

    while i <= n - m:
        j = m - 1
        while j >= 0 and pattern[j] == text[i + j]:
            j -= 1
        if j < 0:
            matches.append(i)
            i += good_suffix[0] if i + m < n else 1
        else:
            i += max(good_suffix[j + 1] if j + 1 < len(good_suffix) else 1,
                     j - bad_char.get(text[i + j], -1))

    return matches

# Example usage
text = "abxabcabcabydhjbjabcabyjfd"
pattern = "abcaby"
result = boyer_moore_search(text, pattern)
print(result)
Enter fullscreen mode Exit fullscreen mode

Rabin-Karp Algorithm

The Rabin-Karp algorithm uses hashing to find the pattern in the text. It computes hash values for the pattern and all substrings of the text of the same length, and compares them.

Here’s a Python implementation:

def rabin_karp(text, pattern, d=256, q=101):
    n, m = len(text), len(pattern)
    if m > n:
        return []

    # Precompute hash values
    p = 0  # Hash value for pattern
    t = 0  # Hash value for text
    h = 1  # The value of d^(m-1) % q
    matches = []

    # Compute the hash value of pattern and first window of text
    for i in range(m):
        p = (d * p + ord(pattern[i])) % q
        t = (d * t + ord(text[i])) % q
        if i < m - 1:
            h = (h * d) % q

    # Check for matches
    for i in range(n - m + 1):
        if p == t:
            if text[i:i + m] == pattern:
                matches.append(i)

        if i < n - m:
            t = (d * (t - ord(text[i]) * h) + ord(text[i + m])) % q
            t = (t + q) % q  # Ensure t is non-negative

    return matches

# Example usage
text = "abcabcabcdabcdabc"
pattern = "abcd"
result = rabin_karp(text, pattern)
print(result)
Enter fullscreen mode Exit fullscreen mode
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .