Advanced Linux Troubleshooting in DevOps and Cloud: Practical Use Cases and Commands

akhil mittal - Oct 14 - - Dev Community

In a DevOps or Cloud role, Linux administrators are often responsible for ensuring that applications and infrastructure run smoothly, and debugging and troubleshooting are essential skills. Below, I’ll provide more detailed examples of Linux system debugging and troubleshooting from a DevOps and cloud operations perspective, focusing on real-world issues you might encounter.

Scenario 1: Application Running Slowly on an EC2 Instance

You’re responsible for maintaining a web application running on an EC2 instance in AWS, and users report that the application has become very slow.

Steps for Debugging and Troubleshooting:


1. Check System Resource Usage (CPU, Memory, Disk)

The first step is to check if the system is running out of resources such as CPU, memory, or disk I/O, which could cause slowdowns.

  1. Monitor CPU and Memory Usage:
    • Use the top or htop command to see the most resource-consuming processes:
   top
Enter fullscreen mode Exit fullscreen mode

Example output:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1234 www-data  20   0  149880   5140   2956 S   25.0   2.0   0:15.00 apache2
Enter fullscreen mode Exit fullscreen mode

In this case, Apache (apache2) is using 25% of CPU, which could indicate heavy web traffic or inefficient code execution.

  1. Check Disk I/O: High disk I/O can be a bottleneck in web applications, especially if they read/write data frequently (such as with database operations).

Use iostat to check disk I/O:

   sudo apt-get install sysstat
   iostat -x 1 5
Enter fullscreen mode Exit fullscreen mode

Example output:

   Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
   xvda              9.10        204.00        819.20      1224      49152
Enter fullscreen mode Exit fullscreen mode

High disk I/O may indicate heavy database or log file operations.

  1. Check Available Disk Space: If the disk is nearly full, it can cause the system to slow down.
   df -h
Enter fullscreen mode Exit fullscreen mode

Example output:

   Filesystem      Size  Used Avail Use% Mounted on
   /dev/xvda1       30G   29G  1G   97% /
Enter fullscreen mode Exit fullscreen mode

In this example, the root partition is 97% full, which could cause performance issues. You may need to clean up logs or expand the disk.

2. Check Application Logs for Errors

After checking system resources, the next step is to examine application logs to identify any errors or warnings.

  1. Check Web Server Logs (e.g., Apache or NGINX): If your web application is slow, logs may reveal issues like timeouts, connection errors, or misconfigurations.

For NGINX:

   sudo tail -f /var/log/nginx/access.log /var/log/nginx/error.log
Enter fullscreen mode Exit fullscreen mode

For Apache:

   sudo tail -f /var/log/apache2/access.log /var/log/apache2/error.log
Enter fullscreen mode Exit fullscreen mode

Example error in an Apache log:

   [error] [client 192.168.1.100] script timed out before returning headers: index.php
Enter fullscreen mode Exit fullscreen mode

This could indicate a problem with a slow PHP script or database queries.

  1. Check Application-Specific Logs: For web applications running on Node.js, Python, or Java, check the application-specific logs (e.g., /var/log/app/app.log).

Example:

   sudo tail -f /var/log/myapp/app.log
Enter fullscreen mode Exit fullscreen mode

Look for errors such as:

   Error: Database connection timed out
Enter fullscreen mode Exit fullscreen mode

This might indicate a problem with the database server (e.g., high load, unresponsive database).


3. Troubleshoot Network and Load Balancer Issues

In cloud environments, the network plays a critical role in system performance. If the system is responding slowly, you may have a network bottleneck or load balancer misconfiguration.

  1. Ping and Network Latency: Check if there is network latency or packet loss between the application server and its dependencies (e.g., database, external services):
   ping 8.8.8.8
Enter fullscreen mode Exit fullscreen mode

High latency or packet loss might indicate a networking issue between the EC2 instance and external services.

  1. Check Load Balancer Health: If your application is behind an Application Load Balancer (ALB), ensure that the health checks are passing, and traffic is being routed properly.

Check the ALB target group health:

   aws elbv2 describe-target-health --target-group-arn <target-group-arn>
Enter fullscreen mode Exit fullscreen mode

Example output:

   TargetHealthDescriptions:
   - Target:
       Id: i-0123456789abcdef0
     TargetHealth:
       State: healthy
Enter fullscreen mode Exit fullscreen mode

If an instance is marked as unhealthy, there could be issues with the application itself, causing failed health checks.


Scenario 2: High Memory Usage on an EC2 Instance

You notice that the memory usage on one of your EC2 instances is constantly high, leading to performance issues and Out of Memory (OOM) kills.

Steps for Debugging and Troubleshooting:


1. Check Memory Usage

  1. Check Overall Memory Usage: Use the free -h command to check total, used, and free memory:
   free -h
Enter fullscreen mode Exit fullscreen mode

Example output:

                 total        used        free      shared  buff/cache   available
   Mem:           7.8G        6.0G        500M        1.2G        1.3G        2.0G
Enter fullscreen mode Exit fullscreen mode

This shows that 6 GB of memory is used, with only 500 MB free.

  1. Check Which Processes Are Consuming Memory: Use top or htop to see which processes are consuming the most memory:
   top -o %MEM
Enter fullscreen mode Exit fullscreen mode

Look for processes consuming excessive memory. For example:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
   5678 mysql     20   0   205m    1.2g    500m  S  12.5  60.0   2:12.35 mysqld
Enter fullscreen mode Exit fullscreen mode

In this case, mysqld (MySQL) is using a large amount of memory (1.2 GB), which could cause memory pressure.

2. Investigate Memory Leaks or Inefficient Resource Usage

  1. Check Application Logs for Memory-Related Errors: Review application logs for memory-related issues like Out of Memory (OOM) kills or memory allocation errors.

Example OOM error:

   Out of memory: Kill process 12345 (node) score 100 or sacrifice child
Enter fullscreen mode Exit fullscreen mode
  1. Check for Zombie or Hanging Processes: Use the ps aux | grep defunct command to check for zombie processes:
   ps aux | grep defunct
Enter fullscreen mode Exit fullscreen mode

Zombie processes might indicate that the parent process is not handling child processes correctly.

3. Add Swap Space (Temporary Fix)

If memory usage is consistently high, adding swap space can provide temporary relief by extending memory onto disk.

  1. Create a Swap File:
   sudo fallocate -l 2G /swapfile
Enter fullscreen mode Exit fullscreen mode
  1. Set Up Swap Space:
   sudo chmod 600 /swapfile
   sudo mkswap /swapfile
   sudo swapon /swapfile
Enter fullscreen mode Exit fullscreen mode
  1. Verify Swap Space:
   free -h
Enter fullscreen mode Exit fullscreen mode

This shows that additional swap space has been added, allowing the system to handle higher memory usage without running out of RAM.


Scenario 3: Database Connection Timeout

Your application frequently encounters database connection timeouts, especially during periods of high traffic.

Steps for Debugging and Troubleshooting:


1. Check Database Logs

  1. Check Database Server Logs: If using MySQL, for example, check the logs at /var/log/mysql/error.log:
   sudo tail -f /var/log/mysql/error.log
Enter fullscreen mode Exit fullscreen mode

Look for errors such as:

   [ERROR] Too many connections
Enter fullscreen mode Exit fullscreen mode

This indicates that the database has reached its maximum number of connections.

2. Increase Database Connection Limits

  1. Increase Max Connections: Edit the MySQL configuration file /etc/mysql/my.cnf and increase the max_connections parameter:
   sudo nano /etc/mysql/my.cnf
Enter fullscreen mode Exit fullscreen mode

Add the following line under the [mysqld] section:

   max_connections = 500
Enter fullscreen mode Exit fullscreen mode
  1. Restart MySQL: Restart the database service to apply the changes:


   sudo systemctl restart mysql
Enter fullscreen mode Exit fullscreen mode

Conclusion:

Linux administrators working in DevOps or cloud environments frequently face challenges related to system performance, application errors, resource bottlenecks, and network issues. Being able to debug and troubleshoot effectively involves checking system logs, monitoring resource usage, and adjusting configurations to optimize performance. The examples provided give practical steps and commands that can be applied in real-world scenarios to diagnose and resolve issues.

Let me know if you'd like more details on any specific scenario!

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .