Making Sense of Syria's Murky Death Toll

DURHAM, N.C. -- Since it began in 2011, the Syrian Civil War has left hundreds of thousands of people dead. But as the conflict continues, many monitoring groups say they are starting to .
Realities on the ground make it increasingly hard to pinpoint precise numbers -- so much so that in 2014 the United Nations announced it would its death toll due to accuracy concerns.
But some researchers are determined to do the best they can to keep Syria鈥檚 death count -- which may have reached half a million -- from getting lost in the fog of war. Duke statistician is one of them.
鈥淭his is probably the most important thing I鈥檝e worked on,鈥 Steorts said.
Steorts and colleagues an analysis in June which concluded that in the first three years of the conflict, between 190,102 and 193,646 named victims were reported.
The numbers themselves aren鈥檛 new. A similar figure was in 2014 by the non-profit , or HRDAG. What鈥檚 new is the way they came up with them.
For the past five years, Steorts has been developing state-of-the-art statistics and machine learning techniques to help human rights groups take on the grim task of tallying Syria鈥檚 war dead.
To show how their methods work, Steorts and colleagues and at Rice University analyzed roughly 354,000 Syrian death records for the period between March 2011 and April 2014.
Provided by the non-profit HRDAG, the death records data consisted of overlapping casualty lists collected by different groups, each with access to a different snapshot of the violence. Each victim is identified by name, gender, plus the date and location of their death.
But combining these records is complicated by the fact that some victims were recorded more than once. The news of someone's death may come from family members or witnesses, but also from hospital or morgue records, or the information may be obtained from social media.
It sounds like an easy problem. To make sure deaths aren't double-counted, just go through the reports and weed out duplicates. But some of the names are misspelled; prefixes, suffixes and nicknames are inconsistent; dates and locations are inexact.
There鈥檚 also a scaling issue. Checking every possible pairing of the 354,000 records in the Syrian data set to determine whether they refer to the same person or not would mean comparing 63 billion pairs.
So Steorts and her team came up with a different approach. To reduce the number of comparisons, they relied on a technique called 鈥渓ocality sensitive hashing.鈥 Records with similar names, locations, and dates of death were grouped together, and only records within the same group -- those with the reasonable chance of being duplicates -- were compared.
Out of 63 billion possible pairs, they only dealt with 450,000 -- more than 99 percent fewer pairs.
The researchers their work at the 2018 Joint Statistical Meetings in Vancouver in July.
With their method, they estimate with 95 percent certainty that the number of identified victims killed between March 2011 and April 2014 was 191,874, plus or minus 1,772 -- which closely matches of 191,369.
The estimate produced by HRDAG relied on human experts to review the pairs and decide which were matches -- a task that . Their machine learning model delivered results in as little as two minutes.
Unfortunately, both estimates are likely to be undercounts, the researchers say. Many violent incidents go unreported; bodies are unidentified.
What鈥檚 more, their work takes the Syrian death toll only to 2014, but the conflict continues. What about the four-plus years since?
One of the more recent reports, issued by the Syrian Center for Policy Research in 2016, put the death toll at , more than twice the issued by the United Nations in 2014.
In March 2018, the Syrian Observatory for Human Rights said the war had killed more than people.
Numbers matter. Putting a figure on the human price of war can help fuel support for humanitarian assistance, drive political action and hold perpetrators accountable, Steorts says.
Steorts鈥 research won鈥檛 make it any easier for monitoring groups to collect Syrian casualty data on the ground, where the security situation makes it difficult to access certain areas. But as Syria鈥檚 death toll continues to rise, she hopes their work will help make sense of the messy data they have, and update Syria鈥檚 casualty count more efficiently and with greater certainty than previous methods.
鈥淚t鈥檚 amazing what they鈥檙e doing on the ground,鈥 Steorts said. 鈥淭he groups that are there are doing a phenomenal job. They鈥檙e risking their lives.鈥
鈥淎s time goes on, and the conflict becomes more challenging, I think that the standard error will very likely go up,鈥 Steorts said. 鈥淏ut that鈥檚 not the same as saying it鈥檚 impossible to estimate how many people have died. It鈥檚 a difficult problem, but I don鈥檛 think it鈥檚 impossible. And we want to get it right.鈥
For her de-duplication research, Steorts was awarded a five-year from the National Science Foundation in 2017. She also won a three-year NSF grant in 2015, and was one of the world's top 35 by MIT Technology Review magazine in 2015, among other honors.
Steorts is also using her expertise at developing methods for merging large, noisy datasets to make sure that nobody is counted twice for the upcoming U.S. Census. She and her team led two on the topic last year.
This work was made possible through a collaboration with the .
CITATION: "Unique Entity Estimation With Application to the Syrian Conflict," Beidi Chen, Anshumali Shrivastava and Rebecca Steorts. Annals of Applied Statistics, June 2018.