Data Validation

Abstract

Data Validation or more accurately the lack of it is the single largest cause of vulnerabilities among all the categories of our security frame. From the vulnerability of the nineties – the infamous buffer overflow to the bane of web applications – cross site scripting; all of these are examples of problems that can be easily mitigated by effective data validation. However, all too often data validation is an afterthought, if at all. This can result in embarrassing and dangerous vulnerabilities manifesting themselves in your applications that in hindsight cause developers to think – “… that would have been so easy to fix”.

 

Introduction

For those readers that have followed this column from the beginning, you will note that we are now more than half way through our categories in the security frame. We have talked firstly about Configuration Management, then Data Protection in Storage and Transit, Authentication and Authorization and finally User and Session Management. Continuing this series we now proceed to discuss Data Validation.

 

As was the case with the previous categories, the fundamental cause of problems in this category also stems from the fact that all too often data validation is not treated as part of the software design but rather something that is almost “common sense” and therefore the responsibility of each individual developer. As someone once said however, “common sense is not so common” and “performance” considerations, impending project deadlines among other factors often end up trumping the apparent need for data validation. The individual developers then consider best case scenarios and normal user behavior and overlook the fact that malicious attackers do not play by the rules or by the norm. What would have been a simple check – often a single line of code – can then result in a catastrophic vulnerability affecting the entire user population. On the other hand in our experience, when data validation is embodied in the architecture and a focus early in the application development lifecycle, more often than not, even when new vulnerability types are discovered the application is secure by default or easily fixed.

 

So what is data validation then?

Unfortunately, there are a number of false beliefs when it comes to data validation. First off is the assumption that data validation means input validation. It is extremely important to note however, that data validation goes beyond input validation and includes output validation as well. Further, even within input validation, as the table below shows, the threats to an application go far beyond merely user inputs. For instance, consider object validation or news feeds from third parties? Could those potentially be inputs? Could those be tampered with to cause harmful effects on your application?

 

Input Validation

Injection Flaws

SQL Injection[1]

Shell Injection

XPath / XML Injection

Overflows

Buffer Overflows

Integer Overflows

Array Bounds Overflows

Output Validation

 

Cross Site Scripting

Cross Site Request Forgery

Table 1: Threats that could be realized due to a lack of data validation. This table is not meant to be an exhaustive list but is intended to be indicative of the types of issues that plague various types of application. Mitigation to these threats is discussed later in this article and will help justify why the issues have been categorized as above.

 

At a very fundamental level, data validation essentially comes down to verifying four basic properties of the data - length, range, format and type checks.

  • Length as the name would suggest implies the size of the data, for instance, the number of bytes in a string or the number of characters to be copied into a buffer – and note that these two could be different depending on encoding formats being used.
  • Range determines which are valid values and which are not. This will often depend on the business logic. For instance, when dealing with the price of an item in an e-commerce store, a negative value should raise a lot of red flags for violating sanity checks.
  • Format implies how the data is meant to look like. For instance, is it meant to be only a sequence of 9 digits with specific rules for blocks within those digits (as would be a US Social Security Number) or is it meant to be a phone number and if so from which country? Is it meant to be an alphanumeric address? Are special characters and punctuations allowed? Once again valid formats will be dictated by the real world entity the data describes.
  • Type is the raw data type associated with the underlying item. Often the bane of loosely typed languages such as the scripting languages, type mismatches can result in unpredictable results from an application. For instance, consider what happens when instead of passing in a numeric field such as your age, you pass a string containing SQL fragments?

 

With that basic definition in place, we can now go about defining effective and efficient strategies around data validation.

 

Strategies for Effective Data Validation

First and foremost, as was mentioned above, it is important that data validation is not left to the individual developer but is considered an integral and necessary part of the system architecture. By considering data validation upfront in the application development lifecycle it is possible to centralize data validation across the application or components within the application. A coordinated data validation strategy across multiple components has the advantage of being easy to implement, test, maintain and update. But perhaps more importantly it frees the individual components and the developers that own those components from the need to validate the data they handle. Those components can focus purely on application logic and can rest assured that they data will be validated elsewhere. End result is the reduction or possibly elimination of inconsistencies across modules in the application. Further, common programming errors such as validation exclusively on the client side can be much more easily tested for and discovered.

 

Once you have decided on employing a centralized strategy, the obvious follow-up question is how do you implement such a strategy. A common approach to this is to employ what is often called a validation funnel. As the figure below illustrates a validation funnel siphons all inputs through a single validation module and handles outputs similarly as well. It is important to note however that the funnels need not be separate physical components but could in fact be shared classes for instance that filter and sanitize all inputs and outputs based on a set of rules. Ideally this set of rules should be configurable declaratively without having to rebuild the module. Further, the rules must account for varying degrees of access control. For instance, data that can be influenced by an anonymous internet user must be treated with far less trust than data that requires you to be an administrator in the first place.

 

At Foundstone we created such a proof of concept implementation with Validator.NET[2].

 

Validation Funnel 

Figure 1: Validation Funnel

Once the application architecture has a centralized data validation chokepoint, the next question that needs to be tackled is where to perform this centralized validation and how often to validate. Answering these questions accurately usually requires understanding of the trust boundaries associated with a system. A trust boundary is a logical edge at which one side does not trust the other. For instance, as indicated in the side note, a trust boundary could exist between the client and the server or between a remoting API that is shared by both internal applications and partner applications. Most often the trust boundary can be defined at the location where the policies associated with a system change. A good way to identify these is to look for network devices such as firewalls or VLANs or authentication mechanisms.

 

When identifying trust boundaries it is important to avoid a few common misconceptions. For instance, data validation is not a problem that is merely a concern with web applications. In fact in today’s ever more connected world, organizations are opening more and more of their internal IT systems to partners and telecommuting employees. The traditional notion of a closed environment is slowly dying away and legacy applications that made assumptions about such environments are increasingly becoming prime targets as they inadvertently or perhaps even intentionally get exposed to the outside world. The other aspect that also comes up here is that data validation is not merely restricted to user inputs and outputs but extends to non-conventional data paths such as news feeds, object serialization and deserialization as well data manipulated from sockets, inter-process communication channels, environment variables and log files.

 

With that said however, developers might also wonder how many levels of data validation do they really need. This also brings up the most common argument against data validation – performance. In our experience, having looked at hundreds of applications of all hues, the performance impact of necessary and sufficient data validation is rarely significant, especially when compared to other bottlenecks within a typical application architecture – the network bandwidth or encryption overheads for instance. Further, one has to question the value of highly performant system that has been compromised by a malicious attacker. At the other extreme however, it is important to understand the tradeoff of having too much validation. The validation funnel architecture described above should allow for mitigating against this problem by ensuring that individual components don’t have to perform data validation.

 

What To Validate and How?

The answer to these questions are best provided again with respect to the four key properties of data defined above.

 

  • Length – The classic protection against buffer overflows is to validate the length of all buffers before using them – this ensures that the destination buffer is large enough to hold the data about to be copied into that buffer. However, developers often forget to check the other bound on length – minimum length of a data element. This is especially true when the input obtained from the user is to index into a memory blob for instance. Given how long buffer overflows have been around a number of tried and tested solutions exist for this purpose. With a number of the newer “managed” languages such as Java and C#, length checks become less of a consideration due to automated memory management. However, even in these cases validating for length can help prevent unnecessary reallocations and memory copies. Further, especially when dealing with data that maybe sensitive, developers would like to avoid multiple copies being left behind in memory with only the garbage collector controlling when those would be cleared. C and C++ application development have also seen some of the problems traditionally associated with these languages solved. For instance, the standard template library provides a number of classes that allow for managed string classes, smart pointers and dynamically allocated data structures such as vectors. Similarly, a number of the unsafe ANSI C functions such as the str* string functions have been replaced by safer alternatives that do perform extensive bounds checking[3]. However, the one confusion that still does remain despite these safe libraries are the non-standard semantics of a number of commonly used APIs. The biggest source of confusion in our experience, is when the API expects the number of bytes and the developer passes the number of characters or vice versa. Everything in this case seems to work fine until you encounter a UTF-8 encoded string for instance. One such function is MultiByteToWideChar on Windows. Similarly, confusion also arises whether the length parameter is meant to indicate the number of bytes left in the string or the total size of the buffer. This last issue is especially a problem with the string concatenation functions available in the ANSI C library.
  • Range – When dealing with an acceptable range of values ensure that data is validated to be within the expected or acceptable range. The canonical example of this is prices of good on e-commerce websites. A number of websites and online shopping cart services continue to be vulnerable to “negative price / shipping cost / tax” attacks wherein the attacker can influence the price he / she pays (or indeed ends up being paid him / herself J). Such logical problems should be trivial to detect and prevent given the business rules implemented by the system. Similarly, especially when dealing with numbers, it is important to understand the range of the base numeric type being used to store the number and the difference between signed and unsigned numbers. For instance, what happens when you increment a number beyond its maximum value or decrement it below its minimum value? How does that impact application logic and security? Similarly when dealing with values returned from drop down or list boxes, it is best to implement a data indirection pattern wherein only the option index is obtained from the client and if that index does not fall within an acceptable range an error is returned. In general with regards to range based validation two approaches are common: black list and white list data validation. As the name would suggest, black list data validation involves creation of a list of “bad” data items that are then blocked. White listing on the other hand involves creation of a list of items based on business rules that are accepted while dropping everything else. As one would expect, it is much easier to build an all-encompassing white list than it is to build a black list that is effective in blocking all attacks both current and future. Therein lies the major problem – your black list is only as effective as your current knowledge of attack patterns.
  • Format – This is perhaps the aspect that is most ingrained in the business logic of an application. In most cases format checks entail checking whether the programmatic representation of an entity is consistent with its real world counterpart. There are a number of effective mechanisms of performing such format validations but perhaps the most efficient and elegant approach is to leverage regular expressions. The .NET and Apache Struts frameworks for instance provide out of the box support for validating forms and controls through the use of regular expression masks. In the .NET framework this is done using the asp:RegularExpressionValidator object[4]. For those not very familiar with regular expression syntax, a number of excellent references are easily available on the Internet. Further tools such as The RegEx Coach[5] and The Regulator[6] can help beginners not only get more comfortable with regular expressions but also query online libraries for tried and tested expressions. When dealing with XML data representations this can be taken even further by the use of an XSD schema to perform granular validation against the data elements contained within the XML document. For instance, a number of attacks these days attempt to compromise not the application but the XML parser running within the application. The most common attack is what is commonly called XDOS (XML Denial of Service). This attack typically involves feeding the parser with XML that contains embedded entity definitions that are recursive in nature. Given these attacks, it is important that data in XML documents is validated against all such malicious streams before any attempt is even made to parse the document. Format validation however does have another important dimension that is often forgotten and can be the source of numerous and repeated problems. The source of these problems primarily lies with the fact that the exact same data can be represented in multiple different formats. For instance, consider the less than symbol ‘<’ – this can be represented as ‘&lt;’ when HTML encoded or” &#x3c;” or even “&#60;”. Other common encoding formats on the web include URL encoding and hexadecimal encoding. Given the multitude of encoding formats canonicalization becomes critical and all validation must be performed after data has been decoded into its most basic form.

As a matter of fact dealing with different encoding is a perennial thorn in the side of developers. Besides the fact that the same character could be represented in different ways, the other aspect that arises is internationalization when dealing with languages other than English and especially languages from the Far East. The best way to tackle such issues is to use UNICODE and UTF-8 when building the application. Regular expression have also been extended to support non-English character sets. However, the most typical problem that arises after internationalization is buffer overflows. It is therefore important to remember that while in the United States for the most part a 8 character password is represented in 8 bytes, the same is not true in China. In China essentially an 8 character name will occupy 16 bytes of UTF-8. Hence, if dealing with buffers, one needs to be very careful with data especially when operating in non-English locales on the computer.

A special case of format checking is when dealing with file uploads or downloads. These are explicitly mentioned since in the authors’ experience very rarely have we found examples of applications that have actually implemented both in a secure manner. For instance, with file uploads, checking MIME types and performing selective virus scanning is regarded as a good practice. Similarly, file uploads should be throttled to avoid disk space exhaustion attacks. In the case of downloads on the other hand, developers must be concerned that arbitrary files cannot be downloaded from outside the equivalent of a chroot jail[7]. Developers must decided for instance whether path components and relative paths would be allowed at all. Similarly, it is important that all access control is performed on the basis of file system based access control lists rather than simply on the name. This is especially significant when 8.3 names are enabled on the system. For instance, ThisIs~1.doc is identical to the document ThisIsASecretDocument.doc when they are in the same folder. Hence, in this case if your access control is based on the whole name matching, an attacker could trivially subvert your access control mechanism by using the 8.3 file name. Hence, as was mentioned in an earlier article all access control must be based on handles rather than names.

  • Type – This is perhaps the most often ignored and rarely used attribute of the data especially when dealing with strongly typed languages from C or C++ to C# and Java. However, this continues to be the bane of the weakly typed scripting languages such as Perl or JavaScript. In such cases, it is important to ensure that if the application is expecting a string, then that is indeed what is presented to the application rather than a numeric type for example. In the scripting languages this is best done by requiring as a matter of coding standards that all variables be declared with a type before they are ever used. With languages such as C# and Java, the reflection mechanism provides an effective and efficient way of querying object meta data and identifying data types. This is especially true when dealing with dynamic code or object serialization attacks. In C++ the capabilities are fairly limited to the casting operators[8] such as dynamic_cast that understand polymorphism and can check whether a cast will be valid before actually performing it. They will return NULL if the cast were to fail allowing the developer to take remedial actions. Old C style casts on the other hand are extremely loose and allow for arbitrary conversions between unrelated data types (especially when data is passed around as void*). These conversions can not only cause exceptions themselves but could also result in unpredictable application behaviors.

 

Client Side Security

Large numbers of developers often make the mistake of trusting the client – assuming that if they wrote it and they accounted for security measures it must be security. The problem with that premise is that the client is typically deployed in a untrusted environment – a browser or thick client running on the attacker’s desktop. In general it is fair to assume that the client desktop is a hostile environment. Client side, users have access to a variety of tools that can help them circumvent, disable  or entirely negate security measures built into the client. These classes of tools include among others, decompilers, debuggers and proxies. These tools can allow an attacker to do something as simple as viewing “secrets” in client side application artifacts to something as complex as modifying the data stream and communication channel between the client and the server. The classic example of this is client side JavaScript based data validation. Using a proxy such as Paros[9] it is trivial to selectively change the JavaScript, change the data field values after they have left the browser or completely turn the application’s client side security model on its head. For this reason it is important that the client never be trusted. Validation and authorization checks maybe performed on the client side but only for performance reasons i.e. to avoid an unnecessary server round trip to deal with the innocent user error. However, all checks must be performed on the server side, irrespective of whether they are also performed on the client side. Essentially, developers must assume that a custom client written entirely by an attacker will be used to connect to the server. 

 

Conclusion

Data validation is an important aspect in the security of an application. If you consider most common attacks against software from the age-old buffer overflow to the more recent SQL injection – all of these can be prevented or mitigated by the use of effective data validation. As with most other things in software security, data validation is hard to add on at the end and must be considered from day one as an integral and central part of the application architecture. This ensures that no assumptions are made and that the individual developers do not have to be concerned with data validation in each of their components. Once architected properly, data validation then simply comes down to basic length, range, format and type checks.

 

Summary

Data validation must be centralized in the architecture and design of an application. This allows for all the business rules around validation to be implemented across the application in a consistent and effective manner. This central strategy must account for the notion of trust boundaries which are useful in defining where and when data validation must be performed. They can also help in minimizing any performance overheads with by preventing excessive or repeated validation. Finally, it is important to consider how to validate data and from this aspect it is useful to consider four primary attributes of data: length – the size of the data element or stream, range – allowed or acceptable values based on both business rules as well as computer science principles, format – allowed representations of data which are governed by the real world entity being implemented in software, and finally type which as the name suggests is concerned with the raw data types used to store the data element. With these four basic properties accounted for consistently, most of the data validation attack vectors can be negated or mitigated against.



 [1] Mitigating SQL injection goes beyond just data validation and includes database configuration and permissioning, role based  access control that adheres to the principles of least privilege and the use of stored procedures and parameterized queries.

 [2] http://www.foundstone.com/resources/proddesc/validator.htm

 [3] http://msdn2.microsoft.com/en-us/library/ms861501.aspx

https://buildsecurityin.us-cert.gov/daisy/bsi-rules/271.html

 [4] http://msdn2.microsoft.com/en-us/library/868290ew.aspx

 [5] http://weitz.de/regex-coach/

 [6] http://sourceforge.net/projects/regulator/

 [7] http://en.wikipedia.org/wiki/Chroot

 [8] http://msdn2.microsoft.com/en-us/library/5f6c9f8h.aspx

 [9] http://www.parosproxy.org/


Powered by Community Server (Personal Edition), by Telligent Systems