Dual-Stage Social Friend Recommendation System Based on User Interests

The use of online social network (OSN) has become essential to humans' lives whether for entertainment, business or shopping. One system that is used extensively for this purpose is friend recommendation system (FRS) which recommends users to other users in professional or entertaining online social networks. In this paper, a Dual-Stage Friend Recommendation (FR) model is proposed. The model applies dual-stage methodology on unlabeled data of 1241 users collected from OSN users via online survey platform featuring user interests and activities based upon which users with similar social behavioral patterns are recommended to each other. The model employs techniques including user-based collaborative filtering (UBCF) approach in stage one and graph-based approach friend-of-friend recommendation (FOF) in stage two. The model offers a solution to common problems of FRS, such as data sparsity, using a dimensionality technique called non-negative matrix factorization (NMF) to create a dense representation of the collected data and reduce its sparsity, in addition to providing seamless integration with other FRSs. The evaluation of the FR model shows positive correlation of Pearson correlation coefficient (PCC) as compared the outcomes of using Cosine similarity and Euclidean distance as a baseline.

interest, and U2 and U3 are friends with similar interest too, then U1 and U3 are possibly friends as a result.

Related Work
Many studies were organized and carried out in the friend recommendation area. Different friend suggestion systems were proposed based on the similarity, using users' profiles information, geographical location of users, and graph-based similarity. This section briefly presents some studies handled in recent years by various techniques. In 2013, Akbar et al. proposed a FRS method that uses Artificial Bee Colony (ABC) algorithm on a dataset of 1000 users extracted from YouTube to recommend users to each other by analyzing the topological features of the graph network generated from the dataset. ABC algorithm is used to learn the graph weights by optimizing four parameters. The method is compared to classic machine learning algorithms such as K-nearest Neighbor, Support Vector Machine and Multilayer perceptron, and it yielded an accuracy of 77% [11] .
In 2014, Eirinaki et al. introduced a trust-aware framework for user suggestion which examined the dynamics and semantics of the friend-enemy relationship (implicit, explicit) connections amongst users based on reputation mechanism. The framework is divided into three phases. Firstly, after data preparation that is distributed on the network, the explicit and implicit connection is established that provides the signs of trust among the users. Secondly, reputation degrees are calculated. Finally, user recommendations for positive and negative are formed. Two datasets are used: Epinion dataset and Wikipedia vote network. Epinion dataset has a product review interactions collected from 132000 users and over 136 million user-to-user positive and negative statements. The introduced method achieved an accuracy of 0.9 and 0.7 for Epinion and Wikipedia datasets, respectively. The advantage of such a system is that it attracts similar users together while it repels different users from each other [12].
In 2015, Zhao et al. presented a study of the social network user relationships and behaviors. The authors introduced a FRS by using hybrid algorithms (Clustering Algorithm and Factorization Machine (FM)). The reasons for using the FM algorithm are to solve the data sparseness problem. Also, it classified users and made it simple to recognize their characteristics and interests. The model was trained by using Markov Chain Monte Carlo (MCMC) algorithm. The data used in that study is Tencent Weibo that contains 2320895 users information. The algorithm achieved a root mean square of error (RMSE) value of 0.5015 [13].
In 2015, Huang et al. presented a two-stage FRS by using multimedia information, Flickr tags feature, and friendship network. In stage one, it produces a friend list based on the relationship of different OSN. In stage two, a co-clustering method on the user, tag, and image information is applied to create groups. Then, it builds a more precise RS to improve the system outcomes from stage one. The authors collected data of 10000 users from Flickr with their photos and tags. The total number of photos collected was 543,754. The co-clustering method achieved a consistent precision value of 0.28 in comparison to other methods [14].
In 2015, Hasan et al. suggested a novel FRS which is based on utilizing individual user's behavior on social network sites. First, it measured the recurrence of the activities done by the clients and the upgrading of the data according to users' activities. Then, the persons' behavior classification strategy was applied by Frequent Pattern Growth algorithm (FP-Growth) to find out the required behavior (common and exceptional). Finally, the multilayer threshold for FRS was used. The data used is a collection of users and their relationships extracted from Facebook social network. The model achieved an accuracy of 94% [15].
In 2016, Wu et al. introduced a FRS based on location preference, in which the temporal, spatial, and social relationships are considered. Firstly, the Markov chain algorithm was used to calculate the user's friendship similitude on the social network. Then, user's area inclination similitude within the real world was calculated based on the history check-in information. The experimental results were based on using a dataset consisting of 604138 user relationships. The check-in data showed could the possibility of suggesting friends with both similar companionship and area inclination to clients within the large-scale [16].
In 2017, Ding et al. proposed a FRS based on matrix factorization approach by extracting latent architectural models from the input network utilizing convolution neural networks algorithm. After that, it uses the Bayesian ranking algorithm to make the user suggestion. The work uses two datasets: Epinion dataset and Slashdot dataset, each of which has approximately 3000 user reviews. After evaluation, the results showed 0.97 AUC value for Epinion dataset and 0.96 for Slashdot dataset [17]. In 2018, Kumar et al. presented a graph-based FRS using two CF strategies: number of mutual users and influence factor. Then, it assigned a score number to each possible friend to find the higher similarity between users based on the highest score number. The datasets used are Stanford SNAP which consists of 4039 and 81,306 users from Facebook and Twitter, respectively. The accuracy of the model was 97.2% [18].

Dual-Stage FR Model Design
The FR Model is a combination of UBCF and graph-based FR. The framework consists of dualstages. In the first stage, an initial graph is constructed based on user interest's similarity. In the second stage, a FOF graph-based method is applied to the initial graph, and a final friend recommendation list is built. The Dual-stage friend proposal can suggest friends to users who have similar interests based on user's behavior characteristics. The following steps show the general process of the proposed model, while Figure-1 illustrates these steps as a block diagram.
Step_1. Filling in the Online Survey: Users are asked to fill in a biographic (name, age, study filed, email, etc.) and interest information (movie and music genre types, sport type, favorite book, etc.) to create user profile and database.
Step_2. Preprocessing Data: the dataset is converted into a binary (0, 1) multi-dimensional sparse matrix, where "zero" indicates that the user dislikes or is not interested, while "one" indicates that the user likes or is interested in the particular feature or activity.
Step_3. Reducing Dimensionality: sparsity is reduced by converting the sparse vectors into denser vectors using dimensionality reduction NMF algorithm. Reducing dimensionality will reduce the sparsity and keep those important, relevant features.
Step_4. Stage one of UBCF approach begins by computing the similarity among all users to find the most similar users under the threshold (delta = 0.5), then an initial graph is built. The threshold is an adjustable value; it can span from 0 to 1. The value of delta indicates that users with similarity of 50% or more are considered similar. Nodes will serve as users and the connection between those nodes will serve as the similarity scores.
Step_5. Stage two of the graph-based approach begins when a user does not get more than five user's recommendation from step 3 (where five is defined as a threshold called gamma). A user will get recommendations from friends of a friend that has a high similarity score in friends list by using FOF algorithm.
Step_6. A new score is calculated for the user recommended in stage two by taking the maximum of the old similarity score of stage one (S1) and the old similarity score of their friend of friends (S2), i.e. a new score = max(S1,S2). Finally, the final friends' recommendation graph is created accordingly.

Users Profile D.B Data Preprocessing
Dimensionality Reduction

Initial FR graph
Stage One (UBCF)

Non-negative Matrix Factorization NMF
The Vector Space Model (VSM) is a common knowledge retrieval model, which describes a data collection matrix since vector space matrices are initially high dimensional and sparse. Treating a model with a large space dimensionality on a dataset usually requires a vast time and space complexity, and often leads to overfitting problems. Thus, reducing the dimensions can reduce the noise. Also, dimensionality reduction reduces the unnecessary parts of the data and finds those surprisingly very closely relevant in one's smaller subspace, then one can easily apply a simple learning algorithm [19].
The NMF decomposes a non-negative matrix (A) into two non-negative matrices (W and H); one of the decomposed matrices can be viewed as the basis vectors (W). The dimensionality reduction can be achieved by projecting the input vectors onto the lower-dimensional space which is formed by these basis vectors, as shown in Figure-2. Also, Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) are popular techniques for dimensionality reduction based on matrix decomposition [20]; however, the cost of their calculation will be limited when the matrices become high. The NMF method is distinguished from the other methods, e.g. PCA and SVD, by its nonnegativity constraints. These constraints lead to a parts-based representation because they allow only the additive, not subtracted, combinations. Also, the NMF computation is based on the simple iterative algorithm; it is, therefore, advantageous for applications involving large matrices. The mathematical calculation in NMF [20] can be determined as follow: Let matrix A be the product of matrices W and H, (1) For factorization matrix (A), matrices (W) and (H) are randomly initialized with nonnegative values. Next, we define a squared error function (Cost_function) to make the product of both matrices equals to the matrix (A): Next, a multiplicative update rule is applied to both of them iteratively, until W and H are stable, as follows: And, )) (4)

Cosine Similarity
One of the common similarity measures is cosine similarity, which is to consider the item\user interest as a feature vector of an n-dimensional space and calculate their similarity as the cosine of the angle between two user's vectors [2,21]. If the Cosine similarity for two users' vectors (A, B) is smaller than the angle, then the similarity is high. Let us explain this in more details:-For understanding the concept of cosine-based similarity, the definition of dot product is explained. ⃗ ∑ (5) The description of the dot product is the sum of element-wise multiplication between the two vectors. For instance, the dot product of vectors ⃗ ) and ⃗⃗ ⃗ ) is: ⃗ It can be noted that the input is two vectors, and the output is a single value, but not another vector. The geometric description of the dot product is: where ‖ ⃗ ‖ and ‖ ⃗ ‖ are the norms of vector ( ⃗ ) and vector ( ⃗ ), respectively. In this context, each user is represented as a vector, where each vector has the user interest rates, and then the cosine similarity equation is utilized as follows: and the equation of cosine distance will be as follows:

Pearson's Correlation Coefficient (PCC)
The second similarity metric is Correlation-Based, in which the similarity between two users' vectors, (a) and (b), are measured by Pearson's correlation coefficient (PCC). PCC can be helpful in data analysis and modeling to better guess the relationships between variables [22]. PCC ranges from -1 to +1. The statistical relationship between the two vectors is assigned to their correlation, by using a PCC. The association could be positive, indicating that both variables are in the same direction, or negative, indicating that both variables are in the opposite direction. Correlation can also be zero, indicating that the variables are uncorrelated [23]. The similarity using PCC for user-based between vectors ⃗ and ⃗ is described as follows: where the ) is the covariance of vectors ⃗ and ⃗ , while represent the standard deviation of vectors ⃗ and ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ . More formally, PCC formula can be written as follows.

Euclidean Distance
The third similarity metric is the Euclidean distance. The formula that is measuring the distance between vectors ⃗ and ⃗ is represented as follows: where ) represent the similarity between the two users vectors ⃗ and ⃗ , and it is the square root of the sum of squared differences between identical features of the two vectors.

Graph Measurement
In graph theory, a graph can be either dense or sparse depending on how close or far the number of edges from the maximal number of edges. A graph with a high number of edges is called a dense. On the contrary, a graph with few edges is called a sparse graph [24]. Graph density can be measured for an undirected graph using: ) (11) , and for a directed graph using: ) (12) where the E represents the number of edges and V is the number of nodes or vertices in the graph.

Experimental Result
In this section, experiments are conducted on the present dataset (1241 users). Dual-Stage for friend's recommendation system based on user's interests is proposed, and the experimental details are as follows:

Dataset
The data used for the dual-stage FR model is obtained via surveying active social media users. An electronic survey adopted from a previous study [3] is used to collect inputs from users on their social media activities and preferences. For example, users are asked about their favorite movie or music genre, books, video games, or the social communities to which they belong. The questions provide binary numerical values of 0 or 1, where 1 represents the presence of this particular feature (like) and 0 otherwise (dislike). There are seven main questions that inform 99 features. These features will serve as binary (0, 1) vector representation of users' interests. The total number of users collected is 1241, as shown in Table-1. A summary of the current dataset for user interest is given in Figure-3.

Preprocessing
The goal of this step is to create a binary (0, 1) 99-dimension vector representation for the features. Hence, we use the methods of find and replacement to replace strings of "Yes" or "Like" to 1 and "No" or "Dislike" to 0. Table-2 shows an example of users data collected from the online survey platform after preprocessing.

Data Sparsity Avoidance
One of the challenges that face RS and FRS is data sparsity. Data coming from the online survey platform is feature-rich, yet it is sparse. After data preprocessing, measurement of sparsity of data showed that 86% of the data are sparse. To avoid sparsity, non-negative matrix factorization (NMF) is used, as shown in algorithm (1). The significance of the NMF algorithm is to mitigate the sparsity of the data and replace it with more a compact, denser representation. NMF reduces the sparsity from 86% to 44.9%. Figure-4 shows the results before and after using NMF.

Stage One FR model: Initial Graph Construction
The NMF algorithm results in creating denser vectors of 20-dimension instead of the sparse 99dimensions. Thus, in order to start recommending users based on interests, we need to calculate the similarity between their corresponding vectors to measure which users are to be recommended to each other. The cosine distance in equation (8)  It can be noticed from Figure-6 that node U35, for example, represents user 35 who has 4 friends (user 37, user 3, user 42 and user 16). Similarly, user 50 has two users: user 38 and user 27). Another example that can be noticed is that user 32 has five friend suggestions: user 23, user 39, user 34, user 15 and user 9. Edges among all users represent the similarity scores under a threshold of 0.5. The similarity weights represent whether or not the connected users are to be recommended by the algorithm, as illustrated in algorithm (2), stage one. Users with certain similarity score will be treated as nodes in the recommendation graph, while the similarity scores will serve as weights on the edges. Table-3 shows a sample user-to-user adjacency matrix of the first 10 users. It can be noticed that the diagonal of the matrix is always zero because it shows the cosine similarity of the user with itself. For the purposes of recommendation, a threshold is imposed on the cosine similarity values to 0.5. This will force users with 50% similarity or greater to be recommended to each other. One issue that may arise from stage one is that some users may not get a sufficient number of friend recommendations due to low similarity scores. To resolve this issue, a FOF method is applied to recommend to him or her most similar friends of friends (the one with maximum cosine similarity with a friend), as shown in algorithm (3), stage two. While user not in Friends_of_most_similar_user_list: 10 Add user and FOF to Final_Graph 11 Calculate New_Score(user, FOF) = max(most_similar_friend, FOF) 12 End While 13 End IF 14

End For
For example, as shown in (Figure-6), User1 has 4 friends (U2, U3, U4, and U5), where 0.4, 0.2, 0.4, and 0.5 are their similarity scores, respectively. First, the algorithm fetches U3's friends (U22, U9, U15, U8, U12, and U11) and their similarity scores are 0.3, 0.29, 0.25, 0.27, 0.19, and 0.1, respectively. Next, it recommends U11 to U1 because it is the most similar friend to U3. Finally, the algorithm assigns a new similarity score between U1 and U11 to be the maximum of two scores. A new similarity score is required to differentiate between the users that were recommended during stage 1 and stage 2.

Evaluation
For evaluation, the present recommendation outcomes are compared by using the Pearson correlation coefficient (PCC) equation (9) between cosine similarity algorithm and Euclidean distance algorithm, as shown in equation (8) and equation (10), respectively, as baseline method to test the user's relationships. PCC helps in data modeling and analysis to predict the relationships between variables. Figure-7 shows that the PCC result of testing the current system is positive (0.47), which indicates that the present user's relationships are positively correlated.

Figure 7-PCC between Cosine Similarity and Euclidean
Since the Dual-Stage FR model relies mainly on graph-based methods, summarizing the information of the initial graph is useful. As shown in Table-4, the initial graph generated from stage one algorithm (2) has a total number of vertices or nodes of 1241. The graph also has 379776 edges, which represent the connections between the users and their friend suggestions. The table also shows that the initial graph has an average degree of 306.0242 for both Indegree and Outdegree. A degree of a vertex represents the number of edges connected to the vertex. Finally, the density of the graph can be calculated using equation (11) because the initial graph is undirected. The density is computed by dividing the number of existent edges (379776) by the total number of vertices (n * (n-1)), which is 1538840, with a result that equals to 0.247.

Conclusions and Future Work a. Conclusions
In this study, Dual-Stage FR is designed and implemented. The model brings several advantages to friend recommendation. First, it is less prone to sparse data due to incorporation of the NMF algorithm. Second, it is used with unlabeled data. Third, it combines the advantages of UBCF approach and graph-based (FoF) approach to get more accurate friend recommendation. Furthermore, the simplicity of the Dual-Stage model makes it seamlessly integrated with another FR system. One possible drawback to this model is that it can be difficult to scale on OSN with enormous number of users (millions) as it can be computationally inefficient and time-consuming. The model is evaluated by PCC (eq. 9) against Euclidean distance as a baseline and yielded a positive correlation (0.47). Also, lack of availability of data featuring social behavior information led to the design of a data collection system, specifically designated to friend recommendation purposes, which allows OSN users to include information about their interests and activities. b. Future Work : 1. Conduction of quantitative and qualitative studies on analyzing the activities of OSN users. This can help to generate datasets featuring information on social behavior. 2. Examination of the usefulness and the applicability of the clustering methods (e.g., K-means, hierarchical, etc.), instead of calculating users similarity in stage one of FR model, to build the initial graph and compare the results to user-based CF.