This section introduces the proposed review selection approach. A feature taxonomy was constructed by applying the LDA topic model and the Word2vec model. Based on the feature taxonomy, similarities between features, opinions, and reviews were calculated and the overall quality of individual online reviews was estimated. Thereafter, based on the calculated similarities and review quality, reviews can be heuristically selected by the proposed approach, whose framework is shown in Fig. 1.
Feature taxonomy construction
When commenting on a product, some consumers may discuss various features (e.g., price performance and appearance, each of which represents a key feature of a mobile phone), while some others prefer to pay close attention to one feature from different detailed angles (e.g., duration, charging time and battery life, which focus on the feature “battery” of a phone). Thus, a hierarchical semantic structure is required to reveal and differentiate the relations among all the features of one product. Without considering the semantic structure among the features of certain products, the extracted features at the same semantic level may result in semantic overlap and informative duplication within the selected subset, which affects the cognition of consumers. On the other hand, the feature taxonomy construction can provide the semantic information of features at different levels, which is in line with the target of this study. Although Tian et al. (2014) proposed a feature taxonomy construction method based on association rules, they neglected the word relations in various contexts which contains more semantic information. LDA uses global documents and words to calculate the topic distributions of a document and word distributions on each topic correspondingly, while Word2vec relies on local words (context window) to capture the context relations. The integration of these two methods could effectively improve the quality of the feature similarity measurement for constructing a feature taxonomy that is fit for consumer cognition. Therefore, meanwhile, we propose a method considering the global level and local level semantic information based upon the topic relations and word context relations.
As for topic relations, we employ the LDA topic model to calculate the relations between features in different topics. Bleiet al. (2003) proposed the LDA topic model, which is a threelevel hierarchical Bayesian model, including document level, topic level and word level. We assume that the extracted features are attributed to word level. Let F = (f_{1}, f_{2}, …, f_{m}) be a set of extracted features appearing in a review corpus, and Z = (Z_{1}, Z_{2}, …, Z_{h}) be a set of hidden topics. LDA is applied to generate topic models to represent the review collection as a whole. At a review collection level, each topic Z_{t} is represented by a probability distribution over features, P(f_{j}Z_{t}) is the probability of feature f_{j} appearing in reviews on topic Z_{t}. Based on the probability P(f_{j}Z_{t}), we can choose the top features to represent the specific topic Z_{t}.
Definition 1 (Topic features): Let φ_{k} = (P(f_{1}Z_{t}), P(f_{2}Z_{t}), …, P(f_{m}Z_{t})) be the topic representation for topic Z_{t} calculated by LDA and 0 ≤ σ ≤ 1 be a threshold. The topic features for Z_{t}, denoted as TF(Z_{t}) is defined as TF(Z_{t}) = {f_{j} P(f_{j}Z_{t}) > σ, f_{j} ∈ F}.
Topic features for specific topic Z_{t} are the features whose appearance probabilities are larger than the threshold σ. Then we define single topic relation to measure the topic relation of two features.
Definition 2 (Single topic relation): For f_{i}, f_{j} ∈ F, the topic relation between two features with respect to a certain topic Z_{t} is defined as Equation (4):
$$ {STR}_{Z_t}\ \left({f}_i,{f}_j\right)=\Big\{{\displaystyle \begin{array}{c}1\mid P\left({f}_i{Z}_t\right)P\left({f}_j{Z}_t\right)\mid, {f}_i,{f}_j\in TF\left({Z}_k\right);\\ {}0,\kern10.5em \mathrm{otherwise};\end{array}}\operatorname{} $$
(4)
where P(f_{i} Z_{t}) − P(f_{j} Z_{t}) is the absolute value of the difference between appearing probabilities of feature f_{i} and feature f_{j}. A single topic relation reflects the semantic relation of two features on one particular topic. When feature f_{i} and feature f_{j} are both topic features of hidden topic Z_{t} and P(f_{i} Z_{t}) is near to P(f_{j} Z_{t}), the value of \( {STR}_{Z_t}\left({f}_i,{f}_j\right) \) is near to 1, so the two features share similar semantic meaning on this single topic. Then we can calculate the topic relation between two features on all topics.
Definition 3 (Topic relation): Let f_{i}, f_{j} ∈ F be two features appearing in review corpora, and Z(f_{i}, f_{j}) be a set of topics that contain both features. The topic relation between two features with respect to all topics is defined in Equation (5):
$$ TR\left({f}_i,{f}_j\right)=\frac{\sum \limits_{Z_k\in Z\left({f}_i,{f}_j\right)}{STR}_{Z_t}\left({f}_i,{f}_j\right)}{\leftZ\left({f}_i,{f}_j\right)\right}, $$
(5)
where Z(f_{i}, f_{j}) represents the number of topics containing both feature f_{i} and feature f_{j}. If the topic relation between two features is large, two features share similar semantic meaning across all topics.
Moreover, as for word context relations, we utilize the Word2vec model to train word vectors, then calculate word distances. Mikolov et al. (2013) proposed the Word2vec model, which could train highquality word vectors with a much lower computational complexity. After the full training of Word2vec, we can acquire the vector representation of feature f_{j} as v_{j} = (v_{j1}, v_{j2}, …, v_{jd}). So, we can calculate the cosine distance of features, which is deemed as the word context relation.
Definition 4 (Word context relation): Let f_{i}, f_{j} ∈ F be two features appearing in review corpus, v_{i} and v_{j} are the vector representations of feature f_{i} and feature f_{j}. The word context relation is defined as Equation (6):
$$ WCR\left({f}_i,{f}_j\right)=\frac{v_i\bullet {v}_j}{\left\Vert {v}_i\right\Vert \times \left\Vert {v}_j\right\Vert }, $$
(6)
where v_{i} ∙ v_{j} equals to the inner product of vector representations of feature f_{i} and feature f_{j}. ‖v_{i}‖ represents the Euclidean norm of v_{i}.
Based on the topic relations and word context relations, the feature taxonomy can be constructed. First, we set some constraints for the feature taxonomy:

(1)
The root of the taxonomy represents the feature with the largest appearance probability in reviews. The root appears most frequently in review corpora.

(2)
The taxonomy is structured as a tree, which means that each feature has only one parent feature except for the root.

(3)
A parent feature is more general than its sub features. The proportion of a parent feature is larger than its sub features.
The input of the proposed feature taxonomy construction method is the review matrix A, feature set F = (f_{1}, f_{2}, …, f_{m}), topic relation TR(f_{i}, f_{j}), i, j = 1…m, i ≠ j, word context relation WCR(f_{i}, f_{j}), i, j = 1…m, i ≠ j and proportion vector θ = (θ_{1}, θ_{2}, …, θ_{m}) whose element θ_{j} equals to the number of appearances of feature f_{j}, j = 1…m divided by the number of reviews. First, the feature taxonomy is an empty tree. Then we sort the proportion vector at a descending sequence \( \boldsymbol{\theta}^{\prime }=\left({\theta}_1^{\prime },{\theta}_2^{\prime },\dots, {\theta}_m^{\prime}\right) \), whose corresponding feature is \( \boldsymbol{f}^{\prime }=\left({f}_1^{\prime },{f}_2^{\prime },\dots, {f}_m^{\prime}\right) \). We add the feature f_{1}^{′} to the root of the feature taxonomy, which is named as the function addroot. Second, the feature f_{2}^{′} is directly added to the feature taxonomy as the sub feature of f_{1}^{′}, which is named as the function addsubfeature. Third, we process features according to the descending sequence in proportion vector θ, and add feature f_{i}^{′} as the sub feature of f_{j}^{′} only if f_{i}^{′} satisfies that TR(f_{i}^{′}, f_{j}^{′}) + WCR(f_{i}^{′}, f_{j}^{′}) is the largest. The method proceeds until there is no feature in θ. The pseudo code of the feature taxonomy construction method is shown in Table 5 Appendix.
Review quality estimation
In order to estimate the quality of a review, according to Chen and Tseng (2011) and Jindal and Liu (2008), we adopt a quality measurement capturing four dimensions of a review: completeness, objectivity, believability and deviation, which can be assessed from its content, sentiment and feedback information. For each review r_{i}, we extract the number of features to reflect its completeness. The objectivity is measured by the number of opinion words o_{i}. In light of previous literature (Ghose and Ipeirotis 2011; Korfiatis et al. 2012; Lee 2018; Tian et al. 2015), the helpfulness vote means that people find the review helpful, which is usually used to fully or partly represent the believability of a review as proxy. Thus, the believability of a review is estimated by its helpfulness vote s_{i}. The deviation of a review is calculated as the difference between the product rating of the review and the overall average rating, denoted as d_{i}. Each dimension value is normalized respectively, which is shown in Equation (7):
$$ scl\left({x}_i\right)=\frac{x_i\mathit{\min}(x)}{\mathit{\max}(x)\mathit{\min}(x)}. $$
(7)
Afterwards, we use the average of all the normalized values to measure the overall quality of a review, as shown in Equation (8). It is worth noting that other quality measurements can also be adopted according to actual requirements:
$$ {q}_i=\frac{scl\left({f}_i\right)+\mathrm{s} cl\left({o}_i\right)+ scl\left({s}_i\right)+ scl\left({d}_i\right)}{4}. $$
(8)
Review selection approach
The similarity between two extracted features can be defined based on the established feature taxonomy. Let f_{i}, f_{j} ∈ F be two features appearing in the review corpus, H be the height of the taxonomy and d(f_{i}, f_{j}) be the shortest path length in the taxonomy. The similarity between features is defined as Equation (9):
$$ sim\left({f}_i,{f}_j\right)=\left\{\begin{array}{c}\mathit{\log}\frac{d\left({f}_i,{f}_j\right)}{2H},i\ne j;\\ {}\ 1,\kern4.75em i=j.\end{array}\right. $$
(9)
In the review selection problem, two typical sentiment polarities, positive and negative, are considered for each feature (Hu and Liu 2004; Tsaparas et al. 2011). Therefore, there are at most 2 m opinions collectively. Let O = (o_{1}, o_{2}, …, o_{p}), p = 2m, be the complete set of opinions for all the features. Considering the semantic polarity, we define the similarity of two opinions in the review corpus. Let o_{i}, o_{j} ∈ O be two corresponding opinions of feature f_{i} and f_{j}, the similarity between two opinions is defined in Equation (10):
$$ sim\left({o}_i,{o}_j\right)= sim\left({f}_i,{f}_j\right)\times \Big\{{\displaystyle \begin{array}{c}1,\kern1.25em {o}_i,{o}_j\kern.30em \mathrm{have}\kern.30em \mathrm{the}\kern.30em \mathrm{same}\kern.30em \mathrm{polarity};\\ {}0,\kern9em \mathrm{otherwise}.\end{array}}\operatorname{} $$
(10)
Let r_{i}, r_{j} ∈ R be two reviews in the review corpus. Review r_{i} has m opinions, whose opinion vector is \( \overset{\rightharpoonup }{o_i}=\left({o}_{i1},{o}_{i2},\dots, {o}_{im}\right) \) and r_{j} has n opinions with the opinion vector \( \overset{\rightharpoonup }{o_j}=\left({o}_{j1},{o}_{j2},\dots, {o}_{jn}\right) \). The similarity between two reviews is defined as Equation (11):
$$ sim\left({r}_i,{r}_j\right)=\frac{\sum_{t=1}^m{\sum}_{p=1}^n sim\left({o}_{it},{o}_{jp}\right)}{\left\Vert \overset{\rightharpoonup }{o_i}\right\Vert \times \left\Vert \overset{\rightharpoonup }{o_j}\right\Vert }, $$
(11)
where the numerator represents the sum of similarities between opinions appearing in review r_{i} and r_{j}. The denominator equals to the product of the number of opinions in review r_{i} and r_{j}. Thus, the similarity matrix of the review corpus S = (sim_{ij})_{n × n}, which records the similarity between each pair of reviews, can be calculated. The similarity matrix is a symmetric matrix, whose diagonal elements are 1.
The proposed review selection approach aims to select a review subset whose reviews are highquality and meanwhile diversified in term of low similarities. In light of previous literature on review selection, we assume that the selected subset has k reviews. The aggregated similarity of reviews in the selected subset is \( {\sum}_{i=1}^{k1}{\sum}_{j=i+1}^k sim\left({sr}_i,s{r}_j\right)/{C}_k^2 \), where sr_{i}, sr_{j} are selected reviews in subset. Therefore, the proposed review selection problem can also be formulated as Equation (3).
Because the proposed review selection problem is a NPcomplete, a simulated annealinglike method is proposed to solve this single objective optimization problem. Simulated annealing is a compact and robust algorithm that provides excellent solutions to single and multiple objective optimization problems with a great reduction in computation time (Suman and Kumar 2006). It is a kind of stochastic search algorithm based on MonteCarlo iterations, which is inspired by heating and controlled cooling of a material. The algorithm normally begins from a very high initial cooling temperature, changes the initial value of variables and then gets a new solution. If the objective function value of the new solution becomes better, it will be kept unconditionally. If the objective function value of the new solution becomes worse, it will be kept with a probability as shown in Equation (12):
$$ p={e}^{\frac{\delta }{T}}, $$
(12)
where δ is the difference of the objective function value on two consecutive iterations and T is the current temperature. Finally, this algorithm avoids local convergence and reaches global optimization by making use of the stochastic search strategy as the temperature drops.
The input of the proposed approach SADRS is similarity matrix S = (sim_{ij})_{n × n}, quality vector q = (q_{1}, q_{2}, …, q_{n}), initial vector X_{0}, the lowest temperature T _ min, initial temperature T_{0}, and cooling parameter ε. In the initialization stage, the temperature is pretty high and we only calculate the aggregated similarities for the combination of nonzero elements in the X by the function of combination and nonzero. By doing so, we calculate the initial objective function value V(X_{0}). Then, we can generate a new vector X_{N + 1} from the current state X_{N}. The function randomint is to generate an integer randomly. After that, the difference of objective values between the new solution and the current state is calculated. If the difference δ ≥ 0, the new solution can achieve a larger objective function value and could be accepted as the next state. Otherwise, rather than reject the new solution directly, it is accepted with a certain probability (acceptProb). The acceptance probability is determined by the difference δ and the current temperature T. As the iteration continues, the temperature drops quickly and thus acceptProb becomes smaller, which means it is impossible to accept the new solution.
At the end, to avoid the best solution, the best solution with the largest objective function value is obtained as the final output, rather than the result of the last iteration. Thus, the final output is the optimal review vector X. The pseudo code of SADRS approach is shown in Table 6 Appendix.
To further illustrate the performance of SADRS, it is leveraged to find the optimal subset for Example 1. We set the initial temperature T_{0} = 10,000, the cooling parameter ε = 0.999, the initial vector X_{0} = [1, 1, 1, 0, 0, 0], and the lowest temperature T_{min} = 0.001. After running more than 30,000 iterations, the approach ends and we acquire the largest objective value 6.6316. And the 1st, 3rd, 4th review subsets are selected, which are the same as the results in Section 3. It shows that SADRS could find the subset with the largest diversity. The same parameter setting is also used in the following experiments with realworld data.