Correcting Survey Measurement Error With Big Data from Road Sensors Through Capture-recapture

Non-probability based sensor data is becoming increasingly popular in social science and official statistics. This development can be explained by decreasing survey response rates, increasing costs for data collection, demand for more frequent and real-time statistics, and a general discussion on the quality of random sample surveys.
However, sensor data is currently rarely used in statistical production due to its unknown data generating process.
The integration of sensor data could be particularly valuable if it can be linked with survey and administrative data. Thus, to evaluate the enhancement of survey data with administrative and sensor data, to improve the accuracy of survey point estimates, empirical research on linkable datasets is needed.

Time-based diary surveys impose a heavy response burden and yield low response rates. In the past, mobility and transport diary surveys have been validated and adjusted using mobile GPS devices.
It has been shown that such surveys often show considerably downward biased estimates due to underreporting. Transferred to the Total Survey Error Framework, underreporting can be considered as a function of measurement and representation errors. Measurement errors are generally hard to quantify because external sources for validation are seldom available.
Representation errors, the other branch of the framework, can be corrected for by comparing the distribution of auxiliary information between sample and population. In this thesis, instead of using mobile GPS devices, external data from permanently installed road sensors is used to estimate underreporting in the Dutch Road Freight Transport Survey.

The Dutch Road Freight Transport Survey and road sensor data produced by the Weigh-in-Motion road sensor network operated by the Dutch national road administration of 2015 is used. In the survey, a probability sample of truck/vehicle owners, trips, and transported shipment weights for the sampled vehicle in a specified week must be reported.
18 sensor stations on Dutch highways continuously weigh every passing vehicle and use a camera system scanning the license plates to identify vehicles. Each vehicle in the survey can be linked one-to-one with the corresponding sensor observation and administrative registers using the combination of the license plate and timestamp as a unique identifier. Since the national vehicle register provides the empty weight of each vehicle and trailer, the shipment weight can be calculated.
Thus, the sensors and survey independently measure the same target variables: the occurrence of journeys and the weight of the shipments. Additional variables are available from administrative registers, such as technical specifications of the vehicles and administrative details of the vehicle owners.

This thesis developed a method to estimate the underreporting based on an application of capture-recapture techniques. Six different estimators are applied. More specifically, a post-stratified survey estimator, a naive extension of the survey estimator, two conditional likelihood capture-recapture estimators, and two unconditional likelihood capture-recapture estimators are applied, compared, and discussed.
The capture-recapture estimators correct for both nonresponse and measurement error. The survey estimate is corrected for selective nonresponse. Therefore, a potential difference between capture-recapture and survey point estimates can be attributed to measurement error. The violation of the capture-recapture assumption of homogeneous capture probabilities is corrected by modeling heterogeneity in capture probabilities using logistic regression and log-linear models.
The effects of occasional violations of the perfect linkage assumption are evaluated within sensitivity analyses. The flexibility and the limitations of the applied estimators are evaluated in a stratified capture-recapture analysis.

All capture-recapture estimators yield larger estimates for the considered target variables than the survey-based estimator. According to the recommended log-linear estimator, the most likely amount of underreporting for the occurrence of journeys is about 18\% and 23\% for the weight of the shipments. The proposed combination of data sources and methods seem to produce reasonable estimates given the literature on underestimation bias in mobility and transport surveys.
The stratified estimates show partially slightly larger amounts of underreporting in subgroups, such as smaller companies and vehicles driving not for commercial purposes. Stratification also reveals the limitations of the capture-recapture estimators, for example, when strata are small. Concerning errors in the survey responses, the unconditional likelihood estimators are fairly robust against overreporting and sensitive to underreporting. Regarding errors in the sensor observations, the unconditional likelihood estimators are sensitive to false positive links but robust against OCR failures within the observed data.

However, this thesis has also led to further questions and showed that more research required. Here, however, only a few points are mentioned. First, the probability of false positives needs to be estimated since the results are sensitive to these. Second, the developed capture-recapture models can be improved, e.g., using interaction terms. Third, regarding generalizability, the analysis should be expanded to further years. Fourth, the sensor data editing process needs to be improved since calculations lead to negative shipment weights.

This thesis demonstrates a specific use of big data in official statistics for the estimation of underreporting bias. However, this method is not limited to official statistics, but can also be used in other disciplines such as social sciences.
The method presented is applicable to any validation study, where survey, administrative, and sensor data (or any other external big data source) can be linked on a micro-level using a unique identifier. This research is a new example of multi-source statistics, a promising approach to improve the benefits of sensor data in the field of official statistics.


Citation style:
Could not load citation form.


Use and reproduction:
All rights reserved