Crawling Ajax-driven Web 2.0 Applications
Crawling web applications is one of the key phases of automated web application scanning. The objective of crawling is to collect all possible resources from the server in order to automate vulnerability detection on each of these resources. A resource that is overlooked during this discovery phase can mean a failure to detect some vulnerabilities. The introduction of Ajax throws up new challenges for the crawling engine. New ways of handling the crawling process are required as a result of these challenges. The objective of this paper is to use a practical approach to address this issue using rbNarcissus, Watir and Ruby.
Problem domain and new approach
Usually crawling engines are “protocol-driven” and open a socket connection on the target host or IP address and port. Once a connection is in place the crawler sends HTTP requests and tries to interpret responses. All these responses are parsed and resources are collected for future access. The resource parsing process is crucial and the crawler tries to collect possible sets of resources by fetching links, scripts, flash components and other significant data.
The “protocol-driven” approach does not work when the crawler comes across an Ajax embedded page. This is because all target resources are part of JavaScript code and are embedded in the DOM context. It is important to both understand and trigger this DOM-based activity. In the process, this has lead to another approach called “event-driven” crawling. It has following three key components:
1. Javascript analysis and interpretation with linking to Ajax
2. DOM event handling and dispatching
3. Dynamic DOM content extraction
Download the paper in PDF format here.