Logging is an essential part of the software development process. Debugging application and infrastructure performance has traditionally relied heavily on logs. They help to provide visibility into how our apps function across each infrastructure component.
Log data includes out-of-memory errors and hard drive failures. This is really helpful information that will aid us in establishing the "why" behind an issue brought to our attention or discovered by us.
Log data frequently contains critical information about your applications, infrastructure, and databases. When compared to the security mechanisms used to control access to a production database, log security may be lacking.
Furthermore, there is a significant temptation to log sensitive client data, such as names and email addresses, as a simple manner of determining who is responsible for the occurrence of an application event and generating a comprehensive audit trail when debugging.
As a result, millions of people's personal information has been exposed, which is frequently discovered in organizational log files and database backups.
Whether you operate in a highly sensitive industry like health tech or finance or not, recording user PII (Personal Identifiable Information) is a compliance and security risk that has been the basis of numerous big data breaches.
In this post, we will define sensitive data, evaluate the risks of logging it, and demonstrate how to avoid this issue by adhering to best practices for logging sensitive user data.
Let’s dive into it!
What Does Sensitive Data Mean?
Before we go into the best practices, let's talk about what constitutes sensitive data. As a result, sensitive data is private information that must be safeguarded against unauthorized access, such as personal information, passwords, credentials, and so on.
Sensitive data can be broadly categorized as one of the following:
- PII(Personal Identifiable Information) - This includes information such as full names, addresses, email addresses, driver’s license numbers, security pins, phone numbers, etc.
- Financial Data - Bank Account Number, ATM PIN, etc.
- Healthcare Data - This includes healthcare records and medical history, etc.
- Passwords
- IP Address, etc.
Although the data listed above is considered sensitive and is subject to compliance requirements such as GDPR (General Data Protection Regulation), PCI (Payment Card Industry Data Security Standard), and HIPPA (Health Insurance Portability and Accountability Act of 1996), it is critical to assess data sensitivity in the context of your business and product.
Consider the following question when logging any data: "What will be the likely impact on my organization if this information enters into the wrong hands?"
If disclosing this data may harm your company's brand or consumer trust, you should regard it as sensitive information and avoid logging it.
Now that we recognize the necessity of keeping sensitive data out of logs, let's look at some best practices for logging that might assist us in doing so.
Why Should You Keep Sensitive Data Out of Your Logs?
Compliance and security are the primary reasons for keeping sensitive data out of logs. In terms of compliance, users have the right to seek information about the data acquired on them, as well as information about why their data is being held and the deletion of their data, under privacy rules.
If the user data is replicated or scattered around the system via logs, database dumps, and backups, complying with any of these demands becomes exceedingly challenging.
Furthermore, logs are frequently the subject of data intrusions, resulting in unintentional data disclosures. Keeping sensitive data out of logs, for example, can greatly minimize the impact of any attack.
Now that we understand the importance of keeping sensitive data out of logs, let’s learn about the best practices to follow while logging that can help in achieving this.
Best Practices for Keeping Sensitive Data Out of Your Logs
1. Encrypt Data in Transit
Encrypting data in transit and at rest assures that even if someone steals your log file or database dump, they will require the key to decode and use the data.
Furthermore, because web servers log requests frequently, data in transit, even between internal systems, must be secured. This will help to keep encrypted sensitive data out of your records.
2. Isolate Sensitive Data
When you move sensitive data across your systems, such as a user's name, email, address, and phone number, it is probable that some API will log it or that some system will keep it in your database.
A single source of truth, such as a data privacy vault, would be a better solution to sensitive data management.
A data privacy vault may help you isolate and safeguard all sensitive data within a vault, ensuring that your application never communicates sensitive data through internal APIs or stores sensitive fields within the application database.
Sensitive data cannot be included in database backups, SQL logs, application logs, or server logs since it is never present in the systems being monitored or backed up.
3. Tokenize Sensitive Data
When adding logs to an application, having a user identifier such as a name or an email may be very helpful in debugging and save a lot of time, so it may appear tempting, but you should avoid doing so.
A simpler solution is to attach a reference to the raw value to a log record via the tokenization process. As a result, you may exchange sensitive data for a token.
All application references become tokens after your sensitive data has been segregated and stored in a data privacy vault. When data isolation and tokenization are combined, you have data privacy as well as the usefulness and ease of keeping a kind of identification in your information.
If necessary, you can detokenize the tokens to get the original sensitive data.
4. Keeping Sensitive Data Out of URLs
It’s a common practice to log URL requests on web servers, and if you have a URL pattern such as users/ or users/, the names and emails of the individuals are likely to be logged on the server, thus making it vulnerable.
To mitigate this, replace the sensitive data in the URLs with an arbitrary user ID. This might be something like the user's main key, a UUID (Universally Unique Identifier), or any other form of token.
5. Mask or Redact Sensitive Data
In addition to tokenization, combining redaction and masking is an efficient way to keep sensitive information out of your logs. Some applications may simply require the last four digits of a credit card or social security number (SSN).
Data masking is an unreversible, one-way procedure for securing sensitive data. Masking generates a version of sensitive data that seems structurally identical to the original but conceals the most sensitive data contained inside a field.
Unlike masking, redaction hides all of the information within a field.
There are also situations when the application doesn’t need to know even the partial information, in such cases, the sensitive data can be redacted instead of masking.
You can efficiently keep sensitive data out of your logs if the recommended practices stated above are followed correctly; nonetheless, mistakes are unavoidable when people are involved.
Here are some more technical best practices for avoiding logging sensitive data to limit and eliminate human errors during logging:
6. Code Reviews
Because code reviews are a routine and frequent activity in software development, reviewers should ensure that there are no log statements that might reveal sensitive data while completing code reviews.
If you're using a Pull Request Template, consider including a checkbox for the reviewer to ensure that they've validated the logging statements in the modifications.
7. Structured Logging
Structured logging converts logs into relational data sets, such as key/value pairs, rather than just text. Structured logging has the advantage of being easy to detect and evaluate. It can also aid in keeping sensitive information out of your logs.
Because of JSON's simplicity and adaptability, it is an excellent choice for constructing structured log statements; log data may be retrieved and inspected automatically, but the messages remain comprehensible to people.
All major computer languages support JSON logging natively or through libraries.
For example, a combined log format would look like this -
31.22.86.126 - - 17/May/2015:08:05:24 +0000 "GET /downloads/product_1 HTTP/1.1" 304 0 "-" "Debian APT-HTTP/1.3 (0.8.16~exp12ubuntu10.16)"
Whereas an example log line generated by an Nginx web server and formatted in JSON would look something like this:
{
"time": "17/May/2015:08:05:24 +0000",
"remote_ip": "31.22.86.126",
"remote_user": "-",
"request": "GET /downloads/product_1 HTTP/1.1",
"response": 304,
"bytes": 0,
"referrer": "-",
"agent": "Debian APT-HTTP/1.3 (0.8.16~exp12ubuntu10.16)"
}
In your logging process, you can utilize heuristics to determine whether any of the data set keys correlate to known sensitive data fields. If this is the case, none of these datasets should be published in the logs.
Heuristics can be used to compare fields such as name, email, and password. This method isn't flawless, but it includes some automated testing.
8. Automated Alerts
The last stage is to create an automated service that searches existing logs for sensitive information and notifies the team if it discovers any. This may appear to be excessive or unneeded, yet it can help in the detection of system flaws.
The following are some frequent points to be mentioned in alerts:
- Time and date
- Name of the host
- Name of the application
- Customer or account that was affected by the mistake
- The visitor's IP address or other geographical indicators
- Unprocessed exception data
- The line number on which it appeared (if applicable)
- Error classification (fatal, warning, etc)
Conclusion
In a nutshell, it's vital to make every effort to keep your logging system from becoming a weak link in the security and privacy of your infrastructure, whether by purpose or by accident.
As a consequence, we examined eight best practices in this post that may help you and your team keep sensitive data out of your logs.
Developing an engineering culture that is aware of the hazards of recording sensitive information will help to avoid future problems. Making sure sensitive data is not logged should be a shared duty of the whole technical organization, not the sole responsibility of one person.
I hope you found it useful. Please let me know if there is anything I missed.