NURAIN BATRISYIA BINTI BIDIN UNIVERSITI TEKNOLOGI MARA
Malaysia’s island tourism continues to grow, as seen by the extensive reviews of destinations such as Langkawi, Perhentian, and Pangkor on platforms like YouTube. YouTube was selected as the primary data source for this study due to its extensive repository of user-generated video reviews and comments, offering valuable insights into tourist perspectives and satisfaction for sentiment analysis (SA). However, most existing SA research in the tourism sector implements lexicon-based or machine learning (ML) techniques, which frequently overlook the variety of contextual meanings encountered in text reviews, such as sarcasm, local idioms, or mixed-language expressions. This research addressed three main challenges in tourism-related SA. These include the linguistic complexity of online reviews, limited capacity to detect cultural and contextual subtleties, and the lack of labelled data for supervised learning. In order to bridge this gap, this research developed a hybrid SA approach that integrates Descriptive-Semantic Analysis (DSA) techniques with both lexicon-based and ML approaches involving the Valence Aware Dictionary and Sentiment Reasoner and Support Vector Machine. For DSA implementation, significant words were identified using the Term Frequency-Inverse Document Frequency, while dominating topics and thematic insights were extracted using Latent Dirichlet Allocation. The island reviews were extracted using the YouTube Data API, followed by pre-processing steps such as translation, cleaning and normalisation. Findings indicate that the hybrid SA approach achieved accuracy rates of 98.15% for Langkawi, 98.5% for Perhentian, and 97.19% for Pangkor. Moreover, the model validation using the Area Under the Curve metric showed strong performance. Perhentian had a validation accuracy of 100% and a testing accuracy of 98.84%. Langkawi followed with 99.36% validation and 98.72% testing, while Pangkor recorded 95.77% validation and 97.22% testing accuracy. Overall, this approach successfully bridged the gap between descriptive and semantic insights, overcoming the limitations of standalone SA techniques, such as struggles to accurately interpret user content and relying on basic techniques that overlook deeper meanings.