webXray is a tool for analyzing webpage traffic and content, extracting legal policies, and identifying the companies which collect user data. A command line user interface makes webXray easy for non-programmers to use, and those with advanced needs may analyze billions of requests by fully leveraging webXray's distributed architecture. webXray has been used to run hundreds of concurrent browser sessions distributed across multiple continents.
webXray performs both "haystack" scans which give insights into large volumes of network traffic, cookies, local storage, and websockets, as well as "forensic" scans which preserve all file contents for use in reverse-engineering scripts, extracting advertising content, and verifying page execution in a forensically sound way. An additional module, policyXray, finds and extracts the text of privacy policies, terms of service, and other related documents in several languages.
For extensive details on how the system works, please reference "Preserving Needles in the Haystack: A search engine and multi-jurisdictional forensic documentation system for privacy violations on the web" by Libert et al.
webXray is easy to use for anybody who is willing to try out a command line program and it is also possible to scale the software to ingest billions of records. By default, webXray runs a single instance of Chrome, stores data in a SQLite database, and can be used on any average laptop. For many users this is sufficient. Advanced configuration of webXray uses Postgres for storing data and leverages a distributed architecture whereby a server is able to send tasks to remote scanning nodes, each of which may be running many instances of Chrome. The distributed system has been used to control hundreds of browsers concurrently.
webXray uses a custom library of domain ownership to chart the flow of data from a given third-party domain to a corporate owner, and if applicable, to parent companies. Tracking attribution reports produced by webXray provide robust granularity. Reports of the average numbers of third-parties and cookies per-site, most commonly occurring third-party domains and elements, volumes of data transferred, use of SSL encryption, and more are provided out-of-the-box. A flexible data schema allows for the generation of custom reports as well as authoring extensions to add additional data sources.
webXray is open-source, licensed under GPLv3.
webXray works on Linux, macOS, and Windows. Detailed instructions for installing Python 3 and Chrome on Ubuntu and macOS are below. Note that while webXray can be run on Windows, detailed instructions are not currently available. If you already have Python 3 and Chrome installed you may skip to installation of webXray.
Installing on Ubuntu
Step One: Install Google Chrome (if you already have Chrome go to Step Two)
If you are using Ubuntu desktop, download Chrome here: https://www.google.com/chrome/
If you are on Ubuntu server, run the following commands:
sudo dpkg -i google-chrome-stable_current_amd64.deb
It is likely you will get errors, if so, run the following:
sudo apt -f install
Run the following command to make sure chrome is installed, if you get an error try the above steps again or search the web for advice.
Step Two: Install pip3
While Ubuntu has Python3 included by default it does not include the Python3 package manager pip3, so you will need to install it using this command:
sudo apt install python3-pip
Run the following command to make sure pip3 is installed, if you get an error try the above steps again or search the web for advice.
You are now ready to install webXray.
macOS Specific Directions
Step One: Install Chrome:
Install Chrome from the official Google website.
Step Two: Install Homebrew:
Homebrew is a command-line tool which helps you install and manage various other command-line tools. To install Homebrew go to the following site and follow the instructions, note it may take some time to download and install: https://brew.sh.
By default, Homebrew sends information to Google Analytics, you can disable that with the following command using the terminal (which you should have open after installing Homebrew):
brew analytics off
Step Three: Install Python3
Python3 is needed to run webXray, enter the following command to install it:
brew install python3
To make sure you have the right version of Python installed run the following command:
...if you see 3.4 or above you are good to go!
You are now ready to install webXray.
Basic Installation and Use
The basic installation uses Chrome as the browser in 'headless' mode, meaning you will not see the browser open the pages you are analyzing. Data is stored in a SQLite database and you do not need to install a database server; databases are created in the directory './webXray/resources/db/sqlite/'.
Once Python and Chrome are installed you can download webXray from GitHub or you can clone the GitHub repository using the following command:
git clone https://github.com/timlib/webXray.git
Next use pip3 to install the needed Python packages:
pip3 install -r requirements.txt
If you want to extract page text (eg policies), you must download the file Readability.js from this address and copy it into the directory "webxray/resources/policyxray/". You can also do this via the command line as follows:
Now webXray is ready to go! To use it enter the following commands:
This is the interactive mode and will guide you to scanning a list of sample websites.
Important Note: If you are running webXray as the 'root' user in Linux it may not run properly due to limitations in Chrome. If webXray stalls or crashes after the 'Building List of Pages' message, run webXray as a non-root user.
Using webXray to Analyze Your Own List of Pages
The raison d'être of webXray is to allow you to analyze pages of your choosing. In order to do so, first place all of the page addresses you wish to scan into a text file and place this file in the "page_lists" directory. Make sure your addresses start with "http://" or "https://", if not, webXray will not recognize them as valid addresses. Once you have placed your page list in the proper directory you may run webXray and it will allow you to select your page list.
Viewing and Understanding Reports
Use the interactive mode to guide you to generating an analysis once you have completed your data collection. When it is completed it will be output to the '/reports' directory. This will contain a number of csv files; they are:
- db_summary.csv: a basic report of what is in the database and how many pages loaded
- aggregated_tracking_attribution.csv: details on percentages of sites tracked by different companies and their subsidiaries
- 3p_domain.csv: most frequently occurring third-party domains
- 3p_request.csv: most frequently occurring third-party requests
- 3p_uses.csv: percentages of pages with third-parties performing specified functions
- per_site_network_report: pairings between page domains and third-party domains, you can import this info to network visualization software
Analyze a Single Page
Sometimes you just want to run a single quick scan, to do so, use the command below. Be sure to replace "http://example.com" with the address of the site you want to scan.
python3 run_webXray.py -s http://example.com
The following are details on how to leverage the power of many advanced functions, and unlike the above, these directions assume you are capable of doing light editing of Python3 code, setting up web servers, and consider yourself a hacker (or an aspiring one).
Using Forensic Mode
webXray has two main modes: haystack and forensic. Haystack mode collects various types of data, but does not include file contents, whereas forensic collects absolutely everything possible. Haystack is better for large datasets where the contents of specific pages are not known (eg those with 'adult' content such as pornography, gambling, and the like), whereas forensic is best for smaller sets of pages where full evidence may need to be collected to prove a privacy violation, or to analyze script contents.
To switch from haystack to forensic mode open 'run_webxray.py' and change this line:
config = utilities.get_default_config('haystack')
config = utilities.get_default_config('forensic')
These modes are actually determined by a collection of over 30 different paramaters that can be customized for specific needs. If you want to create you own unique configuration (which is recommended when building very large datasets) open '/webxray/Utilities.py' and look at the 'get_default_config' function.
Speed and Scale
The first thing you need to do for using the most advanced features of webXray is to set up a Postgres server. You can find instructions for this on the web. Once your database server is set up, open the file '/webxray/PostgreSQLDriver.py' and modify your connection settings. Next, open 'run_webxray.py' and change this:
db_engine = 'sqlite'
db_engine = 'postgres'
Making the above changes will automatically allow webXray to start a browser instance for each CPU core you have, meaning that you may use the program the same way as with sqlite, and your scans will now be vasty faster.
Distribute Scanning Across Several Machines
To truly unleash the power of webXray you will need to use many machines to conduct scans in parallell. This may be done in a "worker" mode which requires little additional configuration, but relies on manual operation, and "server" mode which allows numerous computers to operate 24x7, performing any scanning tasks that are assigned. In distributed mode scan data is stored in a result queue and you must run a separate process to unpack and preprocesses the data.
Both worker and server modes require you to build a task queue of scan jobs to be performed. To create your task queue execute the following command:
python3 run_webxray.py --build_queue NAME_OF_DATABASE TASK URL_ADDRESS_LIST
You must replace NAME_OF_DATABASE with whatever you want your scan be be called. If the database exists the queue will be flushed and new tasks added, otherwise a new database will be created. The TASK parameter can either be "get_scan" (single page load), "get_random_crawl" (loads several pages in a row from the same website), "get_crawl" (loads a predetermined list of pages stored in a JSON file), or "get_policy" (fully automatic detection and extraction policies). The URL_ADDRESS_LIST needs to be the name of the file your URLs are stored in, which must be in the 'page_lists' directory.
Distributed Scanning in Worker Mode
Once a queue is built you can set up machines to process tasks from the queue. First you will need to install webXray on an additional machine following the instructions above, as well as update 'run_webxray.py' to use postgres and update the connection settings in 'PostgreSQLDriver.py' as detailed above. Once that is done you can set up a worker task to run in the background using the command below. Note that you must specify the name of the database and the process will need to be restarted if you have another database you want to process tasks from.
nohup python3 run_webxray.py --worker NAME_OF_DATABASE
Distributed Scanning in Server Mode
To get the most out of webXray you should be using the server mode. This mode first requires you to set up a web server capable of running Flask applications. In testing a combination of NGINX and Gunicorn has performed very well.
Prior to starting the server you must create a new database with the name 'wbxr_server_config' and use the file 'webxray/resources/db/postgresql/wbxr_server_db_init.sql' to instantiate it. Once that is done, you can start the server using the following command:
gunicorn3 --workers=32 -k gevent -t 240 start_server:app --daemon
Now that your server is running you will need to install clients. Unlike worker mode, the clients do not need to have the Postgres details updated, meaning you can run the client on a machine in a theoretically lower trust environment. However, the machine must have a fixed IP address and hostname and must be entered into the 'client_config' database or the server will silently ignore it. The only parameter you need to change to get the client running is to modify the following line with your server URL.
client = Client('YOUR_SERVER_URL')
Now you are ready to start a scan client. Note that while it will run indefinitely, meaning you can essentially deploy and forget about it, Chrome does crash, so setting up a regular cron job to kill Chrome and Python and restart the client is advised.
At first the client will have no jobs to do, in order for the client to become active you must assign it to a database in the table client_config.
nohup python3 run_webxray.py --run_client
Data Preprocessing and Storage in Distributed ModesWhen operating in distributed mode, results of scans are sent to a result queue. To process these results you need a machine to process results from the queue. It is recommended you have roughly 1/3 as many machines dedicated to storage as you do to do scanning (eg if you have 20 scan nodes you should have 6 storage nodes). To set up a storage node, execute this command and it will run indefinitely in the background.
nohup python3 run_webxray.py --store_queue
If you are having problems installing the software or find bugs, please open an issue on GitHub. This software is produced as a hobby and personalized support and consulting is not available.