Hate Crime Analysis based on Artificial Intelligence Methods

. Hate crimes always take a toll on American citizens, which harms social security. It is essential for researchers to explore the factors, which lead to hate crimes. This research is to find out the relationship between hate crimes and factors including income inequality, median household income, race using Machine Learning methods. Machine Learning, as an important branch in Artificial Intelligence, is a good way for finding relationships between things. The research is based on a dataset of hate crimes rates in the 2016 U.S. presidential election as well as hate crimes rates in every U.S. state from 2010 to 2015. Simply linear regression and multiple linear regression are used to describe the factors that influence the crime rate and their contributions, such as share of white poverty or share of non-white residents, or the median household income. Then, K-means is applied to classify hate crimes into 5 levels according to the crime rate. Furthermore, K-Nearest Neighbors is used to demonstrate a prediction of hate crime. At last, a histogram is applied to indicate the variance of the hate crimes in different states. From linear regression, four highest correlation coefficients with a hate crime can be found out, which are income inequality, median household income, the share of non-citizen, and race in turn. Income inequality has the highest correlation coefficient with a hate crime. From multiple linear regression, it can be found out that only by implementing income inequality, median household income, and race can we obtain the highest R square values, which are 0.44 for 2010 to 2015 hate crimes and 0.33 for 2016 hate crimes. From the K-Nearest Neighbors method, hate crimes can be predicted with an accuracy of 40% by applying median household income. Adding the race factor, accuracy rises to 50%. In summary, income inequality, median household income, and race have a high impact on the crime rate. The median household income and the race could predict the crime rate with an accuracy of about 50%.


Introduction
Since the 1980s, "hate crime" has attracted widespread attention from the international community, including the United States, as a global phenomenon, worldwide problem, and social problem. "Hate crime" first appeared in the United States as a new term [1]. At that time, in order to describe the crimes committed against Jews, Asians, and African Americans in American society, some journalists and politicians coined this term. It was recognized and officially became a legal term with the advancement of the legislation of the same name. The socalled "hate crime", also known as "prejudice crime" or "prejudice-induced crime", usually refers to "all or part of the perpetrator's race, religion, disability, sexual orientation, ethnicity, nationality, etc. Criminal offenses against the person, property or society of others caused by the prejudice" [1].
Data from the "U.S. Crime Victims Survey" compiled by the Bureau of Justice Statistics of the U.S. Department of Justice shows that from 2004 to 2015, the average number of victims of hate crimes in the U.S. reached 250,000 per year. According to a statistical report released by the FBI in 2017, the number of hate crime cases in the United States increased in 2016, an increase of approximately 4.6% from 2015 [2]. The report said that about 11% of law enforcement agencies across the United States reported to the FBI the number of hate crime cases handled in 2016, which totaled 6,121, with more than 7,500 victims. It is essential for researchers to explore the factors that influence hate crime before and after election and the reason for the increase.
In the field of hate crime, many researchers have looked at it from a variety of perspectives. For instance, Yangyu analyzed that from the beginning of the 20th century to 2016, economic factors were the main factor in the rise of crime in the United States [3]. Holder and Eaven analyze hate crime with political factors, "utilizing a longitudinal research design to assess the impact of theoretical predictors and political competition measures on hate crime prevalence in counties across three states (Tennessee, Virginia & West Virginia) over a seven-year span (2010-2016)" [4]. Comes and Navarro implement "state-of-the-art learning algorithms" and machine learning to predict the hate crimes in Chicago [5]. Li, Zhang, and Cao used Bayesian optimization to optimize four algorithms, including Bagging, decision tree, random forest and Fully Connected Neural Netw ork (FCNN) in order to predict terrorism [6].
However, few researchers have compared and analyzed the data before and after the 2016 US election as well as build models based on them. However, it is also too complicated for individual researchers to build a model by themselves with a relatively big dataset. Therefore, this paper compares data on hate crime from before and after the 2016 U.S. election. And models are built with machine learning methods to analyze and predict hate crimes from different factors. The contribution of the factors is compared from before the election to after the election.
This paper includes a dataset of hate crimes in the 2016 U.S. presidential election as well as the number of hate crimes committed in every U.S. state from 2010 to 2015 [7]. We wanted to analyze this dataset to determine the main factors or some indirect factors that caused the hate crimes and incidents in that period. The dataset contains the Gini Index (gini_index), Median household income (median_household_income), Share of the population which is unemployed (share_unemployed_seasonal), the share of the population that lives in the metropolitan area (share_population_in_metro_areas), the share of adults 25 and older with a high-school degree (share_population_with_high_school_degree), the share of the population that are not U.S. citizens (share_non_citizens), the share of the white resident who is living in poverty (share_white_poverty), the share of the population that is not white (share_non_white), the share of 2016 election voter who voted for Donald Trump (share_votes_voted_trump), hate crime per 100,000 population in 2016 (hate_crime_per_100k_splc) and average annual hate crime per 100,000 between 2010 to 2015 in varied states in the U.S (avg_hatecrime_per_100k_fbi). This paper uses three machine learning models to analyze and predict hate crimes from a variety of factors, that is, linear regression, K Nearest Neighbors, and clustering [8]. First, the correlation coefficient is calculated to analyze the relationship between hate crimes and factors [9]. Then simple linear regression model is built for analyzing single factors. And then multiple linear regression model is built to analyze the combined contribution of factors to hate crimes [10,11]. Next, K Nearest Neighbor is implemented to judge how well the models could predict hate crimes by giving accuracies [12]. Finally, clustering and histogram are used to analyze how hate crime is leveled and how it varies from states in America [13].

Correlation Analysis
To find the most appropriate predictor in our first two research problems, we create a heatmap that shows the correlation coefficient of all attributes. Figure 1 shows all correlation coefficients between every two of the attributes. The blacker the color of the correlation coefficient is, the more negatively correlated the two values are. The whiter the color of the correlation coefficient is, the more positively correlated the two values are. As shown in Figure  1, share_votes_voted_trump is most negatively correlated with hate_crime_per_100k_splc and avg_hatecrime_per_100k_fbi with coefficient of -0.66 and -0.57 respectively. Gini index is most positively correlated to the two values with a coefficient of 0.33 and 0.45, respectively. This means share_votes_voted_trump and Gini index have a great impact on hate crimes both before and after the election. The correlation coefficient of share_votes_voted_trump after election is 0.09 more than the before election. And Gini index is 0.11 less than before election, which means the influence of Gini index after the election is less than before. Other attributes, such as median_household_income and share_non_citizens, also have a relatively high correlation with hate crimes before and after the election, but they show little differences from before and after election. Heatmap of the correlation coefficient 3 Analysis

Linear Regression
In the first case, we focus on the relationship between income equality and hated crimes. According to Figure 1, gini_index has average 0.39 correlation coefficient with hate_crimes_per_100k_splc and avg_hatecrimes_per_100k_fbi, and greater than median_household_income's average correlation coefficient which is 0.33. Thus, we believe that the Gini Index can better reflect the income difference among different states compared with median household income. The Gini Index has ranged from 0 to 1. When the Gini Index is greater than 0.4, it indicates a significant polarization between the rich and the poor and an unequal income gap; when Gini Index is close to 0, it usually indicates a small income gap and a trend towards equality of per capita household income. A linear regression model ( Figure 2) is constructed to predict the hated crimes using the Gini index and to compare whether hate crimes at different periods are discrepant in the case of the same Gini index (income difference). It is schematically shown in Figure 2 (a) that when gini_index ranges from 0.42 to 0.52, the rate of a hate crime before election increases linearly, from 1 to 5. And Figure 2(b) also depicts a linear increase of hate crime after election from 0.18 to 0.6 with the same range of gini_index. The results indicate a positive, linear relationship between hate crimes and the Gini index for both before and after the election. The regression equation will be: As Equation (1) and (2) show, with gini_index as the independent value, a linear regression model with the slope of 4.02 could be constructed for these hate crimes before the election. A linear regression model with a slope of 37.36 could be constructed for the hate crime after the election. This means that after the election, hate crimes have significantly less sensitivity to income inequality variation. With the same degree of increase of income inequality, hate crimes seem to increase much less after the election than before.
We also constructed another linear regression model to predicts hate crimes by using races/natures. In this study case, we are using the share_non_white as our predictor to predict the hate crimes and incidents. It is schematically shown in Figure 3 (a) that when share_non_white ranges from 0.1 to 0.6, the rate of a hate crime before election increase, from 2 to 3. And Figure 3 (b) also depicts an increase of hate crime after election from 0.3 to 0.4 with the same range of share_non_white. As a result, there is not a strong linear relationship between race and crime.   (3) and (4) show, with share_non_white as the independent value, a linear regression model with the slope of 0.8 could be constructed for the hate crimes before the election. A linear regression model with a slope of 1.192 could be constructed for the hate crime after the election. This means that after the election, hate crimes are less sensitive to the variation of race. With the same degree of increase of share_non_white, hate crimes seem to increase less after the election than before. However, there is not a strong linear relationship between race and crime.

Multiple linear regression
To explore the combined effect to hate crimes according to the nature of the population, multiple linear regression is implemented [14]. The nature or the characteristics of the population, could be interpreted by income inequality (expressed by gini_index), median household income, share of the population that lives in metropolitan areas, share of the population that is not white, share of the population that are not U.S. citizens and share of the population that is unemployed (seasonally adjusted). The attributes are added one by one to construct the regression model. For each one, we judge whether the variable should be retained according to the corresponding p-value of the t-test. When p-value is less than 0.05, the attribute passes t-test, and it is retained. Then the next attribute is considered.
For the data from before election, first, gini_index, which has the highest correlation coefficient, is used to construct a multiple regression model whose dependent variable is avg_hatecrimes_per_100k_fbi.
As shown in Table 1, the variable has passed the t-test with a p value of 0.000, which is less than 0.05, so we should consider increasing variables. Then, median_household_income with the second-highest correlation coefficient is added to the multiple regression model. The variable has also passed the t-test with a p value of 0.007 that is less than 0.05, so we should consider increasing variables. Similarly, share_unemployed_seasonal with the third-highest correlation coefficient is added to the multiple regression model. We find that the t-test fails with a p value of 0.762, which is higher than 0.05, so we consider eliminating the variable.

Table1. OLS Regression Results Before Election
Then, we introduce share_population_in_metro_areas to multiple regression models.
We find that the t-test also fails with a p value of 0.102, which is higher than 0.05, so we consider eliminating the variable. Then, we introduce share_non_citizen to multiple regression models. We find that the t-test fails with a p value of 0.295, which is higher than 0.05, so we consider eliminating the variable. And finally share_non_white passed t-test with p value of 0.044. Adding the attributes one by one into the model, we find that the p-value of other variables will suddenly increase when a certain variable is introduced. The reason for this problem is likely to be that there is a first correlation between the two variables. If this situation occurs, we should check the thermodynamic diagram. After increasing and eliminating variables, we get a ternary regression model. For the data after election, first, gini_index, which has the highest correlation coefficient, is used to construct multiple regression model whose dependent variable is hate_crimes_per_100k_splc. As shown in Table 2, the variable has passed the t-test with a p values of 0.001 that is less than 0.05, so we should consider increasing variables. Then, median_household_income with the second-highest correlation coefficient is added to the multiple regression model. The variable has also passed the t-test with a p values of 0.007 that is less than 0.05, so we should consider increasing variables. Similarly, share_non_citizen with the third-highest correlation coefficient is added to the multiple regression model.
We find that the t-test fails with a p value of 0.363, which is higher than 0.05, so we consider eliminating the variable. Then, we introduce share_unemployed_seasonal to multiple regression models. We find that the t-test also fails with a p value of 0.950 which is higher than 0.05, so we consider eliminating the variable. Then, we introduce share_population_in_metro_area to multiple regression models. We find that the t-test fails with a p value of 0.418, which is higher than 0.05, so we consider eliminating the variable. And finally, share_non_white passed t-test with p value of 0.044.

Table2. OLS Regression Results After Election
.5305 (6) The above formula summarizes the parameters and their coefficients for predicting the hate crime before election.
It shows that Gini index, median_household_income, and share_non_white could together build a multiple linear regression model for the hate crimes before election, with a coefficient of 7.339 for Gini index, 1.128e-05 for median_household_income, and -0.4433 for share_non_white.
We can see after the election the model also shows that income inequality, median family income, and race have a predictive effect on hate crime, and can explain hate crime.

K Nearest Neighbors
In this part, we predict the hate crimes by race/ nature of the population. Since we do not have a standard value in the dataset to represent the hate crime directly, we do cluster to give labels to the values. We take 5 labels, from 0-4, representing low to high of the crime level. And then, we add the labels to the dataset as a new attribute and implement it as the y value in KNN. The scatter plot in Figure 4 shows the data for hate crime in two different periods. In this graph, we choose K=5. The red dots are the centroid of each cluster. Each cluster has a different color, which could represent a level of hate crime, from 0-4.
For the predictors, we choose median_income, shared_unemployed, shared_population_in_metro_area and shared_non_white. We add them one by one to make predictions as we did in the t-test. Adding the median_income, the highest accuracy is around 0.4, and when we add unemployed, it raises to 0.5. And the accuracy does not change very much by adding the last two variables, but we can use fewer neighbors to get high accuracy. However, this approach has a drawback. It is that there are two sources of error-clustering and classification. So, we came up with another approach.
We also choose splc and fbi hate crime as the y value, but we use them directly. We multiply the value by 100 and then transform them into integers. But then we have another problem. We usually use this code, "correct = np.where(prediction==y_test,1,0).sum()" to calculate the accuracy of the prediction, but it will not work under this circumstance. Because the predicted values will never equal the y_test since the y values are not within a particular range [15].
So, we use this function above, which is always implemented to calculate the error rate, to represent the error rate between the predicted values and real data. When the error rate is less than or equal to 0.3, we say that the prediction is accurate. These are the results of the prediction. The results are roughly the same as the former approach, but it is not as stable as the previous method. Sometimes the highest accuracy is only 25%.

Clustering
Lastly, the third case is to find out the number of hate crimes that vary by state. We draw two histograms to show the data. The splc data show that the District of Columbia has the highest crimes. Oregon and Washington are also higher than the others. Arkansas and Mississippi are the safest states. The fbi data show that the District of Columbia has the highest crimes. Wyoming has the lowest crimes. The number of hate crimes before election vary across states And then, we use clustering to combine the two variables.

Centroids and Labels
The scatter plot in Figure 9 shows the data for hated crime in two different periods. In this graph, we choose K=5. The red dots are the centroid of each cluster, as shown in the coordinates in Figure 10. Each cluster could represent a level of hate crime, from 0-4. This graph also gives us an ordinary way to find possible outliers.

Conclusion
This article uses linear regression, K nearest neighbors and clustering to analyze the different factors of hate crime before and after 2016 US election.
From data analysis, it is obvious that the increase of hate crime in 2016 is mainly correlated with income inequality, median household income and race. The District of Columbia has the highest hate crimes both before and after the election. From linear regression, we find out four highest correlation coefficients with hate crime, which are income inequality, median household income, the share of non-citizen, and race in turn. Income inequality has the highest correlation coefficient with a hate crime. From multiple linear regression, we find out that only by implementing income inequality, median household income and race can we obtain the highest R square values, which are 0.44 for 2010 to 2015 hate crimes and 0.33 for 2016 hate crimes. From the K-Nearest Neighbors method, we find that we can predict hate crimes with an accuracy of 40% by applying median household income. Adding the race factor, accuracy rises to 50%. In summary, we find that income inequality, median household income, and race have a high impact on the crime rate. The median household income and the race could predict the crime rate with an accuracy of about 50%.
The results show that the correlation, the R-squares, and the accuracies are not high. This is because the dataset is very small. By refining our models by doing multiple linear regression and escalating variables, the results are better, but the size of the dataset is a limitation. The numbers we have in the dataset are tricky; the data is limited by how it's collected and can't definitively tell us whether there were more hate incidents in the days after the election than is typical. What we can do, however, is to look for trends within the numbers, such as how hate crimes vary by state, as well as what factors within those states might be tied to hate crime rates.