|
|
Sono un Ricercatore Postdottorato presso il Dipartimento di Matematica e Informatica dell'Università di Cagliari in Italia. La mia ricerca si estende su vari ambiti tra cui Data Mining, Computer Vision, Intelligenza Artificiale e Sicurezza Informatica. Progetto e sviluppo soluzioni innovative per una vasta gamma di applicazioni come Recommender Systems, Credit Scoring, Fraud Detection, Intrusion Detection, Image Recognition, Blockchain-based Systems e Cybersecurity. |
I am a Postdoctoral Researcher in the Department of Mathematics and Computer Science at the University of Cagliari, Italy. My research spans across various domains including Data Mining, Computer Vision, Artificial Intelligence, and Data Security. I design and develop innovative solutions for a wide range of applications such as Recommender Systems, Credit Scoring, Fraud Detection, Intrusion Detection, Image Recognition, Blockchain-based Systems, and Cybersecurity. |
|
|
|
|
|
|
|
Number of Publications |
Number of Citations |
H-index |
|
|
|
|
Percentage of publications included in the top 25% most cited documents worldwide |
Updated as of
2024-10-04
|
|
|
|
|
|
|
PUBBLICAZIONI SCIENTIFICHE / SCIENTIFIC PUBLICATIONS |
|
|
Enhancing IDS with Ensemble LSTM Networks Using Real and GAN Data
(view on RG)
R. Saia, S. Carta, G. Fenu, S. Podda, L. Pompianu
Proceedings of the 34th IEEE International Telecommunication Networks and Applications Conference (ITNAC-2024). Sydney, Australia |
[Show/Hide Abstract]
Today's computer system security is critical at every operational level and device, as the compromise of a single element can propagate through connected other network elements, causing unpredictable and dangerous effects. To face unauthorized access and evolving malicious strategies, researchers have intensified efforts to develop effective Intrusion Detection Systems (IDSs) that monitor and analyze network traffic to detect illegitimate activities. This is a difficult challenge given the growing sophistication of malicious tactics that often mimic legitimate behavior. In such a context, this work proposes the HYDRA-LNNE (Hybrid Data Real and Artificial LSTM Neural Network Ensemble) approach, which involves feature selection and data quantization to reduce data complexity, and an ensemble of three Long Short-Term Memory (LSTM) neural networks trained on real data, GAN-generated synthetic data, and a combination of both, with the aim to maximize the strengths of each data type, effectively discriminating normal from malicious network activities. The validation process performed on the UNSW-NB15 dataset, well known for its comprehensive representation of modern cyber threats, shows that our approach outperforms state-of-the-art solutions across multiple metrics.
|
|
|
EEG Biometrics with GAN Integration for Secure Smart City Data Access
(view on RG)
R. Saia, R. Balia, S. Podda, L. Pompianu, S. Carta, A. Pisu
Proceedings of the 8th International Conference on Computer-Human Interaction Research and Applications (CHIRA-2024), Porto, Portugal |
[Show/Hide Abstract]
Biometric systems leveraging ElectroEncephaloGram (EEG) data for user authentication present significant potential in diverse contexts, especially in Smart City ecosystems where secure access to sensitive data is crucial (e.g., healthcare systems, intelligent transportation, smart grids, public safety, and citizen services). However, the complexity and variability of EEG data raise challenges in developing effective solutions. In this context, after a preliminary series of experiments used to find the best feature extraction method for the input, and performed by exploiting the Biometric EEG Dataset (BED), this paper proposes a novel EEG-based user verification framework. It utilizes Mel-Frequency Cepstral Coefficients (MFCC) for feature extraction, followed by feature selection via the Boruta strategy and automated data quantization. An important aspect of this approach is the integration of Generative Adversarial Networks (GANs) to generate synthetic EEG data, which, along with real data, is employed to train an ensemble of Artificial Neural Networks (ANNs). The ensemble decision is made using soft voting mechanisms, promising a robust and competitive solution compared to current state-of-the-art techniques. Initial experiments suggest that this framework has significant potential for further development and optimization.
|
|
|
Enhancing EEG-Based User Verification with a Normalized Neural Network Ensemble Approach
(view on RG)
R. Saia, R. Balia, S. Podda, L. Pompianu, S. Carta, A. Pisu
Proceedings of the 8th International Conference on Computer-Human Interaction Research and Applications (CHIRA-2024), Porto, Portugal |
[Show/Hide Abstract]
The development of user identity verification approaches using biometric systems based on EEG data holds significant promise across various domains. However, the inherent complexity and variability of this data make designing reliable solutions challenging. In response to these challenges, this work introduces a Normalized Neural Network Ensemble (NNNE) approach for EEG-based user verification. It leverages neural networks to enhance the current state-of-the-art performance, aiming to overcome the problems associated with EEG data by capturing spatial and temporal patterns in EEG signals more effectively. In detail, the proposed approach relies on an architecture centered around an ensemble of Multi-Layer Perceptron artificial neural networks regulated by a soft voting criterion. As part of the preprocessing steps, the input data is normalized by transforming features based on quantile information. Additionally, the MLP hyperparameters and the number of MLP evaluators in the ensemble are automatically optimized. Considering the high heterogeneity of the state-of-the-art works in this field, which are characterized by a wide variability in the choices of components, approaches, and strategies, making comparisons between their performances difficult and sometimes impossible, this paper exploits the opportunity offered by the Biometric EEG Dataset (BED), which provides benchmark values that facilitate comparisons within the context of widely adopted approaches in literature in terms of stimuli and feature extraction techniques. The experimental results show that the proposed NNNE approach improves the performance of the state-of-the-art one (Hidden Markov Model) used by the authors of the dataset to define the reference values, significantly.
|
|
|
Investigating the Effectiveness of 3D Monocular Object Detection Methods for Roadside Scenarios
(view on RG)
S. Barra, M. Marras, M. Sondos, A. Podda, R. Saia
Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing (SAC-2024), Cagliari, Italy |
[Show/Hide Abstract]
Urban environments are demanding effective and efficient detection in 3D of objects using monocular cameras, e.g., for intelligent monitoring or decision support. The limited availability of large-scale roadside camera datasets and the mere focus of existing 3D object detection methods on autonomous driving scenarios pose significant challenges for their practical adoption, unfortunately. In this paper, we conduct a systematic analysis of 3D object detection methods, originally applied to autonomous driving scenarios, on monocular roadside images. Under a common evaluation protocol, based on a synthetic dataset with images from monocular roadside cameras located at intersection areas, we analyzed the detection quality achieved by these methods in the roadside context and the influence of key operational parameters. Our study finally highlights open challenges and future directions in this field.
|
|
|
CARgram: CNN-based Accident Recognition from Road Sounds through Intensity-Projected Spectrogram Analysis
(view on RG)
R. Balia, L. Pompianu, S. Carta, G. Fenu, R. Saia
Published in Digital Signal Processing Journal (DSP), Elsevier. |
[Show/Hide Abstract]
Road surveillance systems play an important role in traffic monitoring and detecting hazardous events. In recent years, several artificial intelligence-based approaches have been proposed for this purpose, typically based on the analysis of the acquired video streams. However, occlusions, poor lighting conditions, and heterogeneity of the events may often reduce their effectiveness and reliability. To overcome the limitations mentioned, scientific and industrial research has therefore focused on integrating such solutions with audio recognition methods. By automatically identifying anomalous traffic sounds, e.g., car crashes and skids, they help reduce false positives and missed alarms. Following this trend, in this work, we propose an innovative pipeline for the analysis of intensity-projected audio spectrograms from streams of traffic sounds, which exploits both (i) a visual approach based on a custom, special-purpose Convolutional Neural Network for the identification of anomalous events on the sound signal; and, (ii) a novel multi-representational encoding of the input, which proved to significantly improve the recognition accuracy of the neural models. The validation results of the proposed pipeline on the public MIVIA dataset, with a 0.96\% of false positive rate, showed to be the best performance against the state-of-the-art competitors. Notably, following such findings, a prototype implementation has been deployed on a real-world video surveillance infrastructure.
|
|
|
Can Existing 3D Monocular Object Detection Methods Work in Roadside Contexts? A Reproducibility Study
(view on RG)
S. Barra, M. Marras, S. Mohamed, S. Podda, R. Saia
Proceeding of the 22nd International Conference of the Italian Association for Artificial Intelligence (AIxIA-2023), Rome, Italy. |
[Show/Hide Abstract]
Detecting 3D objects in images from urban monocular cameras is essential to enable intelligent monitoring applications for local municipalities decision-support systems. However, existing detection methods in this domain are mainly focused on autonomous driving and limited to frontal views from sensors mounted on the vehicle. In contrast, to monitor urban areas, local municipalities rely on streams collected from fixed cameras, especially in intersections and particularly dangerous areas. Such streams represent a rich source of data for applications focused on traffic patterns, road conditions, and potential hazards. In this paper, given the lack of availability of large-scale datasets of images from roadside cameras, and the time-consuming process of generating real labelled data, we first proposed a synthetic dataset using the CARLA simulator, which makes dataset creation efficient yet acceptable. The dataset consists of 7,481 development images and 7,518 test images. Then, we reproduced state-of-the-art models for monocular 3D object detection proven to work well in autonomous driving (e.g., M3DRPN, Monodle, SMOKE, and Kinematic) and tested them on the newly generated dataset. Our results show that our dataset can serve as a reference for future experiments and that state-of-the-art models from the autonomous driving domain do not always generalize well to monocular roadside camera images.
|
|
|
Brain Waves combined with Evoked Potentials as Biometric Approach for User Identification: a Survey
(view on RG)
Roberto Saia, Salvatore Carta, Gianni Fenu, Livio Pompianu
Proceeding of the Intelligent Systems Conference (INTELLISYS-2023), Amsterdam, The Netherlands. |
[Show/Hide Abstract]
The growing availability of low-cost devices able of performing an Electroencephalography (EEG) has opened stimulating scenarios in the security field, where such data could be exploited as a biometric approach for user identification. However, a series of problems, first of all, the difficulty of obtaining unique and stable EEG patterns over time, has made this type of research a hard challenge that has forced researchers to design ever more efficient solutions. In this context, one of the approaches that has proved most effective is the one based on the application of external stimuli to the user during the EEG data collection, a stimulation method named Evoked Potentials (EPs), which is long used for other purposes in the clinical setting, in this context used to increase the EEG patterns stability. The combination of EEG and EP has generated an ever-increasing number of literature works but their heterogeneity makes it difficult to take stock of the state-of-the-art, so this work aims to analyze the literature of the last six years, providing information useful for directing the research of those who work in this field. |
|
|
Leveraging the Training Data Partitioning to Improve Events Characterization in Intrusion Detection Systems
(view on RG)
Roberto Saia, Salvatore Carta, Gianni Fenu, Livio Pompianu
Proceeding of the 16th International Conference on Computer Science and Information Technology (ICCSIT-2023), Paris, France. |
[Show/Hide Abstract]
The ever-increasing use of services based on computer networks, even in crucial areas unthinkable until a few years ago, has made the security of these networks a crucial element for anyone, also in consideration of the increasingly sophisticated techniques and strategies available to attackers. In this context, Intrusion Detection Systems (IDSs) play a very important role since they are responsible for analyzing and classifying each network activity as legitimate or illegitimate, allowing us to take the necessary countermeasures at the appropriate time. However, these systems are not infallible and this is due to several reasons, the most important of which are the constant evolution of the attacks (e.g., zero-day attacks) and the problem that many attacks have behavior similar to those of the legitimate activities, and therefore they are very hard to identify. This work relies on the hypothesis that the subdivision of the training data used for the definition of the IDS classification model into a certain number of partitions, in terms of events and features, can improve the characterization of the network events, improving the system performance. All the non-overlapping data partitions train independent classification models and the event is classified according to a majority-voting rule. A series of experiments conducted on a benchmark real-world dataset support the initial hypothesis, showing a performance improvement, compared to a canonical training approach. |
|
|
Influencing Brain Waves by Evoked Potentials as Biometric Approach: Taking Stock of the Last Six Years of Research
(view on RG)
Roberto Saia, Salvatore Carta, Gianni Fenu, Livio Pompianu
Published in Neural Computing and Applications (NCAA) Journal, Springer. |
[Show/Hide Abstract]
The scientific advances of recent years have made available to anyone affordable hardware devices capable of doing something unthinkable until a few years ago, the reading of brain waves. It means that through small wearable devices it is possible to perform an Electroencephalography (EEG), albeit with less potential than those offered by high-cost professional devices. Such devices make it possible for researchers a huge number of experiments that were once impossible in many areas due to the high costs of the necessary hardware. Many studies in the literature explore the use of EEG data as a biometric approach for people identification but, unfortunately, it presents problems mainly related to the difficulty of extracting unique and stable patterns from users, despite the adoption of sophisticated techniques. An approach to face this problem is based on the Evoked Potentials (EPs), external stimuli applied during the EEG reading, a non-invasive technique used for many years in clinical routine, in combination with other diagnostic tests, to evaluate the electrical activity related to some areas of the brain and spinal cord to diagnose neurological disorders. In consideration of the growing number of works in the literature that combine the EEG and EP approaches for biometric purposes, this work aims to evaluate the practical feasibility of such approaches as reliable biometric instruments for user identification by surveying the state of the art of the last six years, also providing an overview of the elements and concepts related to this research area. |
|
|
Brain Waves and Evoked Potentials as Biometric User Identification Strategy: an Affordable Low-Cost Approach
(view on RG)
Roberto Saia, Salvatore Carta, Gianni Fenu, Livio Pompianu
Proceeding of the 19th International Conference on Security and Cryptography (SECRYPT-2022), Lisbon, Portugal. |
[Show/Hide Abstract]
The relatively recent introduction on the market of low-cost devices able to perform an Electroencephalography (EEG) has opened a stimulating research scenario that involves a large number of researchers previously excluded due to the high costs of such hardware. In this regard, one of the most stimulating research fields is focused on the use of such devices in the context of biometric systems, where the EEG data are exploited for user identification purposes. Based on the current literature, which reports that many of these systems are designed by combining the EEG data with a series of external stimuli (Evoked Potentials) to improve the reliability and stability over time of the EEG patterns, this work is aimed to formalize a biometric identification system based on low-cost EEG devices and simple stimulation instruments, such as images and sounds generated by a computer. In other words, our objective is to design a low-cost EEG-based biometric approach exploitable on a large number of real-world scenarios. |
|
|
A Region-based Training Data Segmentation Strategy to Credit Scoring
(view on RG)
Roberto Saia, Salvatore Carta, Gianni Fenu, Livio Pompianu
Proceeding of the 19th International Conference on Security and Cryptography (SECRYPT-2022), Lisbon, Portugal. |
[Show/Hide Abstract]
The rating of users requesting financial services is a growing task, especially in this historical period of the COVID-19 pandemic characterized by a dramatic increase in online activities, mainly related to e-commerce. This kind of assessment is a task manually performed in the past that today needs to be carried out by automatic credit scoring systems, due to the enormous number of requests to process. It follows that such systems play a crucial role for financial operators, as their effectiveness is directly related to gains and losses of money. Despite the huge investments in terms of financial and human resources devoted to the development of such systems, the state-of-the-art solutions are transversally affected by some well-known problems that make the development of credit scoring systems a challenging task, mainly related to the unbalance and heterogeneity of the involved data, problems to which it adds the scarcity of public datasets. The Region-based Training Data Segmentation (RTDS) strategy proposed in this work revolves around a divide-and-conquer approach, where the user classification depends on the results of several sub-classifications. In more detail, the training data is divided into regions that bound different users and features, which are used to train several classification models that will lead toward the final classification through a majority voting rule. Such a strategy relies on the consideration that the independent analysis of different users and features can lead to a more accurate classification than that offered by a single evaluation model trained on the entire dataset. The validation process carried out using three public real-world datasets with a different number of features, samples, and degree of data imbalance demonstrates the effectiveness of the proposed strategy, which outperforms the canonical training one in the context of all the datasets. |
|
|
A Blockchain-based Distributed Paradigm to Secure Localization Services
(view on RG)
Roberto Saia, Alessandro Sebastian Podda, Livio Pompianu, Diego Reforgiato Recupero, Gianni Fenu
Published in Sensors journal, MDPI. |
[Show/Hide Abstract]
In the last decades, modern societies are experiencing an increasing adoption of interconnected smart devices. This revolution involves not only canonical devices such as smartphones and tablets, but also simple objects like light bulbs. Named as the Internet of Things (IoT), this ever-growing scenario offers enormous opportunities in many areas of modern society, especially if joined by other emerging technologies such as, for example, the blockchain. Indeed, the latter allows users to certify transactions publicly, without relying on central authorities or intermediaries. This work aims to exploit the scenario above by proposing a novel blockchain-based distributed paradigm to secure localization services, here defined as Internet of Entities (IoE). It represents a mechanism for the reliable localization of people and things, and it exploits the increasing number of existing wireless devices and the blockchain-based distributed ledger technology. Moreover, unlike most of the canonical localization approaches, it is strongly oriented towards the protection of the users' privacy. Finally, its implementation requires minimal efforts since it employs the existing infrastructures and devices, thus giving life to a new and wide data environment, exploitable in many domains, such as e-health, smart cities, and smart mobility. |
|
|
Wireless Internet, Multimedia, and Artificial Intelligence: New Applications and Infrastructures
(view on RG)
Roberto Saia, Salvatore Carta, Olaf Bergmann Published in Future Internet journal, MDPI. |
[Show/Hide Abstract]
The potential offered by the Internet, combined with the enormous number of connectable devices, offers benefits in many areas of our modern societies, both public and private. The possibility of making heterogeneous devices communicate with each other through the Internet has given rise to a constantly growing scenario, which was unthinkable not long ago. This unstoppable growth takes place thanks to the continuous availability of increasingly sophisticated device features, an ever-increasing bandwidth and reliability of the connections, and the ever-lower consumption of the devices, which grants them long autonomy. This scenario of exponential growth also involves other sectors such as, for example, that of Artificial Intelligence (AI), which offers us increasingly sophisticated approaches that can be synergistically combined with wireless devices and the Internet in order to create powerful applications for everyday life. Precisely for the aforementioned reasons, the community of researchers, year by year, dedicates more time and resources in this direction. It should be observed that this happens in an atypical way concerning the other research fields, and this is because the achieved progress and the developed applications have practical applications in numerous and different domains. |
|
|
Decomposing Training Data to Improve Network Intrusion Detection Performance
(view on RG)
Roberto Saia, Alessandro Sebastian Podda, Gianni Fenu, Riccardo Balia
Proceeding of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KDIR-2021), Online streaming due to COVID-19 emergency. |
[Show/Hide Abstract]
Anyone working in the field of network intrusion detection has been able to observe how it involves an ever-increasing number of techniques and strategies aimed to overcome the issues that affect the state-of-the-art solutions. Data unbalance and heterogeneity are only some representative examples of them, and each misclassification operates in this context could have enormous repercussions in different crucial areas such as, for instance, financial, privacy, and public reputation. This happens because the current scenario is characterized by a huge number of public and private network-based services. The idea behind the proposed work is decomposing the canonical classification process into several sub-processes, where the final classification depends on all the sub-processes results, plus the canonical one. The proposed Training Data Decomposition (TDD) strategy is applied on the training datasets, where it applies a decomposition into regions, according to a defined number of events and features. The ratio that leads this process is related to the observation that the same network event could be evaluated in a different manner, when it is evaluated in different time periods and/or when it involves different features. According to this observation, the proposed approach adopts different classification models, each of them trained in a different data region characterized by different time periods and features, classifying the event both on the basis of all model results, and on the basis of the canonical strategy that involves all data. |
|
|
From Payment Services Directive 2 (PSD2) to Credit Scoring: A Case Study on an Italian Banking Institution
(view on RG) Roberto Saia, Alessandro Giuliani, Livio Pompianu, Salvatore Carta
Proceeding of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KDIR-2021), Online streaming due to COVID-19 emergency. |
[Show/Hide Abstract]
The Payments Systems Directive 2 (PSD2), recently issued by the European Union, allows the banks to share their customer data if they authorize the operation. On the one hand, this opportunity offers interesting perspectives to the financial operators, allowing them to evaluate the customers reliability (Credit Scoring) even in the absence of the canonical information typically used (e.g., age, current job, total incomes, or previous loans). On the other hand, the state-of-the-art approaches and strategies still train their Credit Scoring models using the canonical information. This scenario is further worsened by the scarcity of proper datasets needed for research purposes and the class imbalance between the reliable and unreliable cases, which biases the reliability of the classification models trained using this information. The proposed work is aimed at experimentally investigating the possibility of defining a Credit Scoring model based on the bank transactions of a customer, instead of using the canonical information, comparing the performance of the two models (canonical and transaction-based), and proposing an approach to improve the performance of the transactions-based model through the introduction of meta-features. The performed experiments show the feasibility of a Credit Scoring model based only on banking transactions and the possibility of improving its performance by introducing simple meta-features. |
|
|
Credit Scoring by Leveraging an Ensemble Stochastic Criterion in a Transformed Feature Space
(view on RG)
Salvatore Carta, Anselmo Ferreira, Diego Reforgiato Recupero, Roberto Saia
Published in Progress in Artificial Intelligence (PRAI) Journal, Springer. |
[Show/Hide Abstract]
The credit scoring models are aimed to assess the capability of refunding a loan by assessing user reliability in several financial contexts, representing a crucial instrument for a large number of financial operators such as banks. Literature solutions offer many approaches designed to evaluate users' reliability on the basis of information about them, but they share some well-known problems that reduce their performance, such as data imbalance and heterogeneity. In order to face these problems, this paper introduces an ensemble stochastic criterion that operates in a discretized feature space, extended with some meta-features in order to perform efficient credit scoring. Such an approach uses several classification algorithms in such a way that the final classification is obtained by a stochastic criterion applied to a new feature space, obtained by a two-fold preprocessing technique. We validated the proposed approach by using real-world datasets with different data imbalance configurations, and the obtained results show that it outperforms some state-of-the-art solutions. |
|
|
A Two-Step Feature Space Transforming Method to Improve Credit Scoring Performance
(view on RG)
Salvatore Carta, Gianni Fenu, Anselmo Ferreira, Diego Reforgiato Recupero, Roberto Saia
Published in Communications in Computer and Information Science (CCIS), Springer. |
[Show/Hide Abstract]
The increasing amount of credit offered by financial institutions has required intelligent and efficient methodologies of credit scoring. Therefore, the use of different machine learning solutions to that task has been growing during the past recent years. Such procedures have been used in order to identify customers who are reliable or unreliable, with the intention to counterbalance financial losses due to loans offered to wrong customer profiles. Notwithstanding, such an application of machine learning suffers with several limitations when put into practice, such as unbalanced datasets and, specially, the absence of sufficient information from the features that can be useful to discriminate reliable and unreliable loans. To overcome such drawbacks, we propose in this work a Two-Step Feature Space Transforming approach, which operates by evolving feature information in a twofold operation: (i) data enhancement; and (ii) data discretization. In the first step, additional meta-features are used in order to improve data discrimination. In the second step, the goal is to reduce the diversity of features. Experiments results performed in real-world datasets with different levels of unbalancing show that such a step can improve, in a consistent way, the performance of the best machine learning algorithm for such a task. With such results we aim to open new perspectives for novel efficient credit scoring systems. |
|
|
A Local Feature Engineering Strategy to Improve Network Anomaly Detection
(view on RG)
Salvatore Carta, Alessandro Sebastian Podda, Diego Reforgiato Recupero, Roberto Saia
Published in Future Internet journal, MDPI. |
[Show/Hide Abstract]
The dramatic increase in devices and services that has characterized modern societies in recent decades, boosted by the exponential growth of ever faster network connections and the predominant use of wireless connection technologies, has materialized a very crucial challenge in terms of security. The anomaly-based Intrusion Detection Systems, which for a long time have represented one of the most efficient solutions in order to detect intrusion attempts on a network, then have to face this new and more complicated scenario. Well-known problems, such as the difficulty of distinguishing legitimate activities from illegitimate ones due to their similar characteristics and their high degree of heterogeneity, today have become even more complex, considering the increase in the network activity. After providing an extensive overview of the scenario under consideration, this work proposes a Local Feature Engineering (LFE) strategy aimed to face such problems through the adoption of a data preprocessing strategy that reduces the number of possible network event patterns, increasing at the same time their characterization. Unlike the canonical feature engineering approaches, which take into account the entire dataset, it operates locally in the feature space of each single event. The experiments conducted on real-world data have shown that this strategy, which is based on the introduction of new features and the discretization of their values, improves the performance of the canonical state-of-the-art solutions. |
|
|
Popularity Prediction of Instagram Posts
(view on RG)
Salvatore Carta, Alessandro Sebastian Podda, Diego Reforgiato Recupero, Roberto Saia, Giovanni Usai
Published in Computation Journal, MDPI. |
[Show/Hide Abstract]
Predicting the popularity of posts on social networks has taken on significant importance in recent years, and several social media management tools now offer solutions to improve and optimize the quality of published content and to enhance the attractiveness of companies and organizations. Scientific research has recently moved in this direction, with the aim of exploiting advanced techniques such as machine learning, deep learning, natural language processing, etc., to support such tools. In light of the above, in this work we aim to address the challenge of predicting the popularity of a future post on Instagram, by defining the problem as a classification task and by proposing an original approach based on Gradient Boosting and feature engineering, which led us to promising experimental results. The proposed approach exploits big data technologies for scalability and efficiency and is general enough to be applied to other social media as well. |
|
|
A Feature Space Transformation to Intrusion Detection Systems
(view on RG)
Roberto Saia, Salvatore Carta, Diego Reforgiato Recupero, Gianni Fenu
Proceeding of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KDIR-2020), Online streaming due to COVID-19 emergency. |
[Show/Hide Abstract]
The anomaly-based Intrusion Detection Systems (IDSs) represent one of the most efficient methods in countering the intrusion attempts against the ever growing number of network-based services. Despite the central role they play, their effectiveness is jeopardized by a series of problems that reduce the IDS effectiveness in a real-world context, mainly due to the difficulty of correctly classifying attacks with characteristics very similar to a normal network activity or, again, due to the difficulty of contrasting novel forms of attacks (zero-days). Such problems have been faced in this paper by adopting a Twofold Feature Space Transformation (TFST) approach aimed to gain a better characterization of the network events and a reduction of their potential patterns. The idea behind such an approach is based on: (i) the addition of meta-information, improving the event characterization; (ii) the discretization of the new feature space in order to join together patterns that lead back to the same events, reducing the number of false alarms. The validation process performed by using a real-world dataset indicates that the proposed approach is able to outperform the canonical state-of-the-art solutions, improving their intrusion detection capability. |
|
|
Analysis of a Consensus Protocol for Extending Consistent Subchains on the Bitcoin Blockchain
(view on RG)
Riccardo Longo, Alessandro Sebastian Podda, Roberto Saia
Published in Computation Journal, MDPI. |
[Show/Hide Abstract]
Nowadays, an increasing number of third-party applications exploit the Bitcoin blockchain to store tamper-proof records of their executions, immutably. To this purpose, they leverage on the few extra bytes available for encoding custom metadata in Bitcoin transactions. A sequence of records of the same application can thus be abstracted as a stand-alone subchain inside the Bitcoin blockchain. However, several existing approaches do not make any assumptions about the consistency of their subchains, either (i) neglecting the possibility that this sequence of messages can be altered, mainly due to unhandled concurrency, network malfunctions, application bugs, or malicious users; or (ii) giving weak guarantees about their security. To tackle this issue, in this paper we propose an improved version of a consensus protocol formalized in our previous work, built on top of the Bitcoin protocol, to incentivize third-party nodes to extend consistently their subchains. Besides, we perform an extensive analysis of this protocol, both defining its properties and presenting some real-world attack scenarios, to show how its specific design choices and parameter configurations can be crucial to prevent malicious practices. |
|
|
A General Framework for Risk Controlled Trading Based on Machine Learning and Statistical Arbitrage
(view on RG)
Salvatore Carta, Diego Reforgiato Recupero, Roberto Saia, Maria Stanciu
Proceeding of the 6th Annual Conference on machine Learning, Optimization and Data science (LOD-2020), Siena, Italy. |
[Show/Hide Abstract]
Nowadays, machine learning usage has gained significant interest in financial time series prediction, hence being a promise land for financial applications such as algorithmic trading. In this setting, this paper proposes a general framework based on an ensemble of regression algorithms and dynamic asset selection applied to the well known statistical arbitrage trading strategy. Several extremely heterogeneous state-of-the-art machine learning algorithms, exploiting different feature selection processes in input, are used as base components of the ensemble, which is in charge to forecast the return of each of the considered stocks. Before being used as an input to the arbitrage mechanism, the final ranking of the assets takes also into account a quality assurance mechanism that prunes the stocks with poor forecasting accuracy in the previous periods. The framework has a general application for any risk balanced trading strategy aiming to exploit different financial assets. It was evaluated implementing an intra-day trading statistical arbitrage on the stocks of the S&P500 index. Our approach outperforms each single base regressor we adopted, which we considered as baselines. More important, it also outperforms Buy-and-hold of S&P500 Index, both during financial turmoil such as the global financial crisis, and also during the massive market growth in the recent years. |
|
|
Dissecting Ponzi schemes on Ethereum: identification, analysis, and impact
(view on RG)
Massimo Bartoletti, Salvatore Carta, Tiziana Cimoli, Roberto Saia
Published in Future Generation Computer Systems (FGCS) Journal, Elsevier |
[Show/Hide Abstract]
Ponzi schemes are financial frauds which lure users under the promise of high profits. Actually, users are repaid only with the investments of new users joining the scheme: consequently, a Ponzi scheme implodes soon after users stop joining it. Originated in the offline world 150 years ago, Ponzi schemes have since then migrated to the digital world, approaching first the Web, and more recently hanging over cryptocurrencies like Bitcoin. Smart contract platforms like Ethereum have provided a new opportunity for scammers, who have now the possibility of creating "trustworthy" frauds that still make users lose money, but at least are guaranteed to execute "correctly". We present a comprehensive survey of Ponzi schemes on Ethereum, analysing their behaviour and their impact from various viewpoints.c |
|
|
A Holistic Auto-Configurable Ensemble Machine Learning Strategy for Financial Trading
(view on RG)
Salvatore Carta, Andrea Corriga, Anselmo Ferreira, Diego Reforgiato Recupero, Roberto Saia
Published in Computation Journal, MDPI. |
[Show/Hide Abstract]
Financial markets forecasting represents a challenging task for a series of reasons, such as the irregularity, high fluctuation, noise of the involved data, and the peculiar high unpredictability of the financial domain. Moreover, literature does not offer a proper methodology to systematically identify intrinsic and hyper-parameters, input features, and base algorithms of a forecasting strategy in order to automatically adapt itself to the chosen market. To tackle these issues, this paper introduces a fully automated optimized ensemble approach, where an optimized feature selection process has been combined with an automatic ensemble machine learning strategy, created by a set of classifiers with intrinsic and hyper-parameters learned in each marked under consideration. A series of experiments performed on different real-world futures markets demonstrate the effectiveness of such an approach with regard to both to the Buy and Hold baseline strategy and to several canonical state-of-the-art solutions. |
|
|
A Combined Entropy-based Approach for a Proactive Credit Scoring
(view on RG)
Salvatore Carta, Anselmo Ferreira, Diego Reforgiato Recupero, Marco Saia, Roberto Saia
Published in Engineering Applications of Artificial Intelligence (EAAI) Journal, Elsevier |
[Show/Hide Abstract]
Lenders, such as credit card companies and banks, use credit scores to evaluate the potential risk posed by lending money to consumers and, therefore, mitigating losses due to bad debt. Within the financial technology domain, an ideal approach should be able to operate proactively, without the need of knowing the behavior of non-reliable users. Actually, this does not happen because the most used techniques need to train their models with both reliable and non-reliable data in order to classify new samples. Such a scenario might be affected by the cold-start problem in datasets, where there is a scarcity or total absence of non-reliable examples, which is further worsened by the potential unbalanced distribution of the data that reduces the classification performances. In this paper, we overcome the aforementioned issues by proposing a proactive approach, composed of a combined entropy-based method that is trained considering only reliable cases and the sample under investigation. Experiments done in different real-world datasets show competitive performances with several state-of-art approaches that use the entire dataset of reliable and unreliable cases. |
|
|
A Discretized Extended Feature Space (DEFS) Model to Improve the Anomaly Detection Performance in Network Intrusion Detection
(view on RG)
Roberto Saia, Salvatore Carta, Diego Reforgiato Recupero, Gianni Fenu, Maria Madalina Stanciu
Proceeding of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KDIR-2019), Vienna, Austria. |
[Show/Hide Abstract]
The unbreakable bond that exists today between devices and network connections makes the security of the latter a crucial element for our society. For this reason, in recent decades we have witnessed an exponential growth in research efforts aimed at identifying increasingly efficient techniques able to tackle this type of problem, such as the Intrusion Detection System (IDS). If on the one hand an IDS plays a key role, since it is designed to classify the network events as normal or intrusion, on the other hand it has to face several well-known problems that reduce its effectiveness. The most important of them is the high number of false positives related to its inability to detect event patterns not occurred in the past (i.e. zero-day attacks). This paper introduces a novel Discretized Extended Feature Space (DEFS) model that presents a twofold advantage: first, through a discretization process it reduces the event patterns by grouping those similar in terms of feature values, reducing the issues related to the classification of unknown events; second, it balances such a discretization by extending the event patterns with a series of meta-information able to well characterize them. The approach has been evaluated by using a real-world dataset (NSL-KDD) and by adopting both the in-sample/out-of-sample and time series cross-validation strategies in order to avoid that the evaluation is biased by over-fitting. The experimental results show how the proposed DEFS model is able to improve the classification performance in the most challenging scenarios (unbalanced samples), with regard to the canonical state-of-the-art solutions. |
|
|
A Supervised Multi-class Multi-label Word Embeddings Approach for Toxic Comment Classification
(view on RG)
Roberto Saia, Salvatore Carta, Andrea Corriga, Riccardo Mulas, Diego Reforgiato Recupero
Proceeding of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KDIR-2019), Vienna, Austria. |
[Show/Hide Abstract]
Nowadays, communications made by using the modern Internet-based opportunities have revolutionized the way people exchange information between them, allowing real-time discussions among a huge number of people. However, the advantages offered by such powerful instruments of communication are sometimes jeopardized by the dangers related to personal attacks that lead many people to leave a discussion that they were participating. Such a problem is related to the so-called toxic comments, i.e., personal attacks, verbal bullying and, more generally, an aggressive way in which many people participate in a discussion, which brings some participants to abandon it. By exploiting the Apache Spark big data framework and several word embeddings, this paper presents an approach able to operate a multi-class multi-label classification of a discussion within a range of six classes of toxicity. We evaluate such an approach by classifying a dataset of comments taken from the Wikipedia's talk page, according to a Kaggle challenge. The experimental results prove that, through the adoption of different sets of word embeddings, our supervised approach outperforms the state-of-the-art ones that operate by exploiting the canonical bag-of-word model. In addition, the adoption of a word embeddings defined in a similar scenario (i.e., discussions related to e-learning videos), proves that it is possible to improve the performance with respect to the state-of-the-art word embeddings solutions. |
|
|
A Discretized Enriched Technique to Enhance Machine Learning Performance in Credit Scoring
(view on RG)
Roberto Saia, Salvatore Carta, Diego Reforgiato Recupero, Gianni Fenu, Marco Saia
Proceeding of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KDIR-2019), Vienna, Austria. |
[Show/Hide Abstract]
The automated credit scoring tools play a crucial role in many financial environments, since they are able to perform a real-time evaluation of a user (e.g., a loan applicant) on the basis of several solvency criteria, without the aid of human operators. Such an automation allows who work and offer services in the financial area to take quick decisions with regard to different services, first and foremost those concerning the consumer credit, whose requests have exponentially increased over the last years. In order to face some well-known problems related to the state-of-the-art credit scoring approaches, this paper formalizes a novel data model that we called Discretized Enriched Data (DED), which operates by transforming the original feature space in order to improve the performance of the credit scoring machine learning algorithms. The idea behind the proposed DED model revolves around two processes, the first one aimed to reduce the number of feature patterns through a data discretization process, and the second one aimed to enrich the discretized data by adding several meta-features. The data discretization faces the problem of heterogeneity, which characterizes such a domain, whereas the data enrichment works on the related loss of information by adding meta-features that improve the data characterization. Our model has been evaluated in the context of real-world datasets with different sizes and levels of data unbalance, which are considered a benchmark in credit scoring literature. The obtained results indicate that it is able to improve the performance of one of the most performing machine learning algorithm largely used in this field, opening up new perspectives for the definition of more effective credit scoring solutions. |
|
|
Fraud Detection for E-commerce Transactions by Employing a Prudential Multiple Consensus Model
(view on RG)
Roberto Saia, Salvatore Carta, Diego Reforgiato Recupero, Gianni Fenu
Published in Journal of Information Security and Applications (JISA) Journal, Elsevier |
[Show/Hide Abstract]
More and more financial transactions through different E-commerce platforms have appeared now-days within the big data era bringing plenty of opportunities but also challenges and risks of stealing information for potential frauds that need to be faced. This is due to the massive use of tools such as credit cards for electronic payments which are targeted by attackers to steal sensitive information and perform fraudulent operations. Although intelligent fraud detection systems have been developed to face the problem, they still suffer from some well-known problems due to the imbalance of the used data. Therefore this paper proposes a novel data intelligence technique based on a Prudential Multiple Consensus model which combines the effectiveness of several state-of-the-art classification algorithms by adopting a twofold criterion, probabilistic and majority based. The goal is to maximize the effectiveness of the model in detecting fraudulent transactions regardless the presence of any data imbalance. Our model has been validated with a set of experiments on a large real-world dataset characterized by a high degree of data imbalance and results show how the proposed model outperforms several state-of-the-art solutions, both in terms of ensemble models and classification approaches. |
|
|
Forecasting E-commerce Products Prices by Combining an Autoregressive Integrated Moving Average (ARIMA) Model and Google Trends Data
(view on RG)
Roberto Saia, Salvatore Carta, Diego Reforgiato Recupero, Andrea Medda, and Alessio Pili
Published in Future Internet Journal, MDPI |
[Show/Hide Abstract]
E-commerce is becoming more and more the main instrument for selling goods to the mass market. This led to a growing interest in algorithms and techniques able to predict products future prices, since they allow us to define smart systems able to improve the quality of life by suggesting more affordable goods and services. The joint use of time series, reputation and sentiment analysis clearly represents one important approach to this research issue. In this paper we present Price Probe, a suite of software tools developed to perform forecasting on products prices. Its primary aim is to predict the future price trend of products generating a customized forecast through the exploitation of autoregressive integrated moving average (ARIMA) model. We experimented the effectiveness of the proposed approach on one of the biggest E-commerce infrastructure in the world: Amazon. We used specific APIs and dedicated crawlers to extract and collect information about products and their related prices over time and, moreover, we extracted information from social media and Google Trends that we used as exogenous features for the ARIMA model. We fine-estimated ARIMA's parameters and tried the different combinations of the exogenous features and noticed through experimental analysis that the presence of Google Trends information significantly improved the predictions. |
|
|
Internet of Entities (IoE): a Blockchain-based Distributed Paradigm for Data Exchange Between Wireless-based Devices
(view on RG)
Roberto Saia, Salvatore Carta, Diego Reforgiato Recupero, and Gianni Fenu
Proceeding of the 8th International Conference on Sensor Networks (SENSORNETS-2019), Prague, Czech Republic |
[Show/Hide Abstract]
The exponential growth of wireless-based solutions, such as those related to the mobile smart devices (e.g., smart-phones and tablets) and Internet of Things (IoT) devices, has lead to countless advantages in every area of our society. Such a scenario has transformed the world a few decades back, dominated by latency, into a new world based on an efficient real-time interaction paradigm. Recently, cryptocurrency have contributed to this technological revolution, the fulcrum of which are a decentralization model and a certification function offered by the so-called blockchain infrastructure, which make it possible to certify the financial transactions, anonymously. This paper aims to indicate a possible approach able to exploit this challenging scenario synergistically by introducing a novel blockchain-based distributed paradigm for data exchange between wireless-based devices defined Internet of Entities (IoE). It is based on two core elements with interchangeable roles, entities and trackers, which can be implemented by using existing infrastructures and devices, such as those related to smart-phones, tablets, and IoT systems. The employment of the blockchain-based distributed paradigm allows our approach ensuring the anonymization and immutability of the involved data, which is key in many scenarios and domains (e.g. financial applications, health and legal applications dealing with personal and sensitive data), requirements more and more searched in recent innovations. The possibility to exchange data between a huge number of devices gives rise to a novel and widely exploitable data environment, whose applications are possible in different domains, such as, for instance, that of the Security, eHealth, and Smart Cities. |
|
|
Evaluating the Benefits of Using Proactive Transformed-domain-based Techniques in Fraud Detection Tasks
(view on RG)
Roberto Saia, Salvatore Carta
Published in Future Generation Computer Systems (FGCS) Journal, Elsevier |
[Show/Hide Abstract]
The exponential growth in the number of E-commerce transactions indicates a radical change in the way people buy and sell goods and services, a new opportunity offered by a huge global market, where they may choose sellers or buyers on the basis of multiple criteria (e.g., economic, logistical, ethical, sustainability, etc.), without being forced to use the traditional brick-and-mortar criterion. If, on the one hand, such a scenario offers an enormous control to people, both at private and corporate level, allowing them to filter their needs by adopting a large range of criteria, on the other hand, it has contributed to the growth of fraud cases related to the involved electronic instruments of payment, such as credit cards. The Big Data Information Security for Sustainability is a research branch aimed to face these issues in relation to the potential implications in the field of sustainability, proposing effective solutions to design safe environments in which the people can operate and by exploiting the benefits related to new technologies. The fraud detection systems are a significant example of such solutions, although the techniques adopted by them are typically based on retroactive strategies, which are incapable of preventing fraudulent events. In this perspective, this paper aims to investigate the benefits related to the adoption of proactive fraud detection strategies, instead of the canonical retroactive ones, theorizing those solutions that can lead toward practical effective implementations. We evaluate two previously experimented novel proactive strategies, one based on the Fourier transform, and one based on the Wavelet transform, which are used in order to move the data (i.e., financial transactions) into a new domain, where they are analyzed and an evaluation model is defined. Such strategies allow a fraud detection system to operate by using a proactive approach, since they do not exploit previous fraudulent transactions, overcoming some important problems that reduce the effectiveness of the canonical retroactive state-of-the-art solutions. Potential benefits and limitations of the proposed proactive approach have been evaluated in a real-world credit card fraud detection scenario, by comparing its performance to that of one of the most used and performing retroactive state-of-the-art approaches (i.e. Random Forests). |
|
|
A Probabilistic-driven Ensemble Approach to Perform Event Classification in Intrusion Detection System
(view on RG)
Roberto Saia, Diego Reforgiato Recupero, Salvatore Carta
Proceeding of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KDIR-2018),
Seville, Spain. |
[Show/Hide Abstract]
Nowadays, it is clear how the network services represent a widespread element, which is absolutely essential for each category of users, professional and non-professional ones. Such a scenario needs a constant research activity aimed to ensure the security of the involved services, so as to prevent any fraudulently exploitation of the related network resources. This is not a simple task, because day by day new threats arise, forcing the research community to face them by developing new specific countermeasures. The Intrusion Detection System (IDS) covers a central role in this scenario, since its task is the detection of the intrusion attempts through an evaluation model designed to classify each new network event as normal or intrusion. This paper introduces a Probabilistic-Driven Ensemble (PDE) approach that operates by using several classification algorithms, whose effectiveness has been improved on the basis of a probabilistic criterion. A series of experiments, performed by using real-world data, show how such an approach outperforms the state-of-the-art ones in terms of specificity, proving its better capability to detect intrusion events with regard to the canonical solutions. |
|
|
Recommending Friends by Identifying Latent Similarities in Social
Environments
(view on RG)
Roberto Saia, Luca Piras, Salvatore Carta
Proceeding of the 40th European Conference on Information Retrieval
(ECIR-2018), Social Aspects in Personalization and Search (SoAPS-2018) Workshop, Grenoble,
France. |
[Show/Hide Abstract]
When browsing social media, we get in touch with great amounts of content that, if well exploited, can provide valuable knowledge on our preferences. On the one hand, the opportunities that the interaction of the users with social content can offer are great, since they represent implicit feedbacks that the users provide on what they like. On the other hand, the resulting information is very sparse, since the users do not interact in any form with a lot of content (e.g., by liking, commenting, or clicking on an item). Therefore, finding similarities between the users to recommend friends with similar preferences is a challenging but important task. This paper introduces a novel technique able to discover the shared latent-spaces between users by moving the data analysis in the frequency domain, where the spectral patterns of the users are compared. By identifying non-explicit similarities between the users, friend recommendations can be performed. |
|
|
Proactivity or Retroactivity? Evaluating the Benefits of Using Proactive Transformed-domain-based Techniques in Fraud Detection Tasks
(view on RG)
Roberto Saia, Salvatore Carta
Proceeding of the High Quality Journal Forum - 3rd International Conference on Internet of Things, Big Data and Security (IoTBDS-2018), Funchal, Madeira, Portugal |
[Show/Hide Abstract]
The exponential growth of the number of E-commerce transactions indicates a radical change in the way people buy and sell goods and services, a new opportunity offered by a huge global market, where they can choose their sellers or buyers on the basis of multiple criteria (e.g., economic, logistical, ethical, sustainability, etc.), instead of being forced to use the traditional brick-and-mortar criterion. If on the one hand such scenario offers enormous control to people, both at private and corporate level, allowing them to filter their needs by adopting a large range of criteria, on the other hand it has contributed to the growth of fraud cases related to the involved electronic instruments of payment, such as credit cards. The fraud detection systems are aimed to face this problem, although the techniques adopted by them are typically based on retroactive strategies, which are incapable of preventing fraudulent events. This work aims to investigate the benefits related to the adoption of proactive fraud detection strategies instead of the canonical retroactive ones, evaluating two previously experimented novel proactive strategies, one based on the Fourier transform, and one based on the Wavelet transform. The financial transaction data are moved into a new domain, where they are analyzed in order to define an evaluation model. Such strategies allow us to operate proactively, since they do not use previous fraudulent examples, overcoming some important problems that reduce the effectiveness of the canonical retroactive state-of-the-art solutions. Advantages related to these strategies have been evaluated in a real-world credit card fraud detection scenario, by comparing its performance to that of one of the most used and performing retroactive state-of-the-art approaches. |
|
|
Recommending Friends by Identifying Latent Similarities in Social Environments
(view on RG)
Roberto Saia, Luca Piras, and Salvatore Carta
Proceeding of the 40th European Conference on Information Retrieval (ECIR-2018), Social Aspects in Personalization and Search (SoAPS) Workshop, Grenoble, France |
[Show/Hide Abstract]
When browsing social media, we get in touch with great amounts of content that, if well exploited, can provide valuable knowledge on our preferences. On the one hand, the opportunities that the interaction of the users with social content can offer are great, since they represent implicit feedbacks that the users provide on what they like. On the other hand, the resulting information is very sparse, since the users do not interact in any form with a lot of content (e.g., by liking, commenting, or clicking on an item). Therefore, finding similarities between the users to recommend friends with similar preferences is a challenging but important task. This paper introduces a novel technique able to discover the shared latent-spaces between users by moving the data analysis in the frequency domain, where the spectral patterns of the users are compared. By identifying non-explicit similarities between the users, friend recommendations can be performed. |
|
|
A Wavelet-based Data Analysis to Credit Scoring
(view on RG)
Roberto Saia, Salvatore Carta, and Gianni Fenu
Proceeding of the 2nd International Conference on Digital Signal Processing (ICDSP-2018), Tokyo, Japan |
[Show/Hide Abstract]
Nowadays, the dramatic growth in consumer credit has made ineffective the methods based on the human intervention, aimed to assess the potential solvency of loan applicants. For this reason, the development of approaches able to automate this operation represents today an active and important research area named Credit Scoring. In such scenario it should be noted how the design of effective approaches represents an hard challenge, due to a series of well-known problems, such as, for instance, the data imbalance, the data heterogeneity, and the cold start. The Centroid wavelet-based approach proposed in this paper faces these issues by moving the data analysis from its canonical domain to a new time-frequency one, where this operation is performed through three different metrics of similarity. Its main objective is to achieve a better characterization of the loan applicants on the basis of the information previously gathered by the Credit Scoring system. The performed experiments demonstrate how such approach outperforms the state-of-the-art solutions. |
|
|
Unbalanced Data Classification in Fraud Detection by Introducing a Multidimensional Space Analysis
(view on RG)
Roberto Saia
Proceeding of the 3rd International Conference on Internet of Things, Big Data and Security (IoTBDS-2018), Funchal, Madeira, Portugal |
[Show/Hide Abstract]
The problem of frauds is becoming increasingly important in this E-commerce age, where an enormous number of financial transactions are carried out by using electronic instruments of payment such as credit cards. Given the impossibility of adopting human-driven solutions, due to the huge number of involved operations, the only possible way to face this kind of problems is the adoption of automatic approaches able to discern the legitimate transactions from the fraudulent ones. For this reason, today the development of techniques capable of carrying out this task efficiently represents a very active research field that involves a large number of researchers around the world. Unfortunately, this is not an easy task, since the definition of effective fraud detection approaches is made difficult by a series of well-known problems, the most important of them being the non-balanced class distribution of data that leads towards a significant reduction of the machine learning approaches performance. Such limitation is addressed by the approach proposed in this paper, which exploits three different metrics of similarity in order to define a three-dimensional space of evaluation. Its main objective is a better characterization of the financial transactions in terms of the two possible target classes (legitimate or fraudulent), facing the information asymmetry that gives rise to the problem previously exposed. A series of experiments conducted by using real-world data with different size and imbalance level, demonstrate the effectiveness of the proposed approach with regard to the state-of-the-art solutions. |
|
|
Dissecting Ponzi schemes on Ethereum: identification, analysis, and impact
(view on RG)
Roberto Saia, Massimo Bartoletti, Salvatore Carta, Tiziana Cimoli
Proceeding of the P2P Financial Systems International Workshop (P2PFISY-2017), London, United Kingdom |
[Show/Hide Abstract]
Ponzi schemes are financial frauds where, under the promise of high profits, users put their money, recovering their investment and interests only if enough users after them continue to invest money. Originated in the offline world 150 years ago, Ponzi schemes have since then migrated to the digital world, approaching first on the Web, and more recently hanging over cryptocurrencies like Bitcoin. Smart contract platforms like Ethereum have provided a new opportunity for scammers, who have now the possibility of creating "trustworthy" frauds that still make users lose money, but at least are guaranteed to execute "correctly". We present a comprehensive survey of Ponzi schemes on Ethereum, analysing their behaviour and their impact from various viewpoints. Perhaps surprisingly, we identify a remarkably high number of Ponzi schemes, despite the hosting platform has been operating for less than two years. Dissecting Ponzi schemes on Ethereum: identification, analysis, and impact. |
|
|
A Discrete Wavelet Transform Approach to Fraud Detection
(view on RG)
Roberto Saia
Proceeding of the 11th International Conference on Network and System Security (NSS-2017), International Workshop on Security Measurements of Cyber Networks (SMCN-2017), Helsinki, Finland |
[Show/Hide Abstract]
The exponential growth in the number of operations carried out in the e-commerce environment is directly related to the growth in the number of operations performed through credit cards. This happens because practically all commercial operators allow their customers to make their payments by using them. Such scenario leads toward an high level of risk related to the potential fraudulent activities that the fraudsters can perform by exploiting this powerful instrument of payment illegitimately. A large number of state-of-the-art approaches are designed in order to face the fraud detection problem, adopting different solutions, although there are some common issues that reduce their effectiveness. Some of the most important of them are the imbalanced distribution and the heterogeneity of data, i.e., there is a large number of legitimate cases and a small number of fraudulent ones, and the transactions are characterized by large variations in the feature values. This paper presents a novel fraud detection approach based on the Discrete Wavelet Transform, which is here exploited in order to define an evaluation model able to address the aforementioned issues. Such goal is reached by using only legitimate transactions during the model definition process, thanks to the more stable representation of data offered by the new domain, less influenced by the data variation. The experimental results show that our approach achieves performance comparable to that of one of the best approaches at the state of the art, such as random forests. The relevant aspect related to this result is that our model is trained without using previous fraudulent cases, adopting a proactive strategy, with the positive side-effect to solve the cold-start problem. |
|
|
Evaluating Credit Card Transactions in the Frequency Domain for a Proactive Fraud Detection Approach (view on RG)
Roberto Saia and Salvatore Carta
Proceeding of the 14th International Conference on Security and Cryptography (SECRYPT-2017), Madrid, Spain |
[Show/Hide Abstract]
The massive increase in financial transactions made in the e-commerce field has led to an equally massive increase in the risks related to fraudulent activities. It is a problem directly correlated with the use of credit cards, considering that almost all the operators that offer goods or services in the e-commerce space allow their customers to use them for making payments. The main disadvantage of these powerful methods of payment concerns the fact that they can be used not only by the legitimate users (cardholders) but also by fraudsters. Literature reports a considerable number of techniques designed to face this problem, although their effectiveness is jeopardized by a series of common problems, such as the imbalanced distribution and the heterogeneity of the involved data. The approach presented in this paper takes advantage of a novel evaluation criterion based on the analysis, in the frequency domain, of the spectral pattern of the data. Such strategy allows us to obtain a more stable model for representing information, with respect to the canonical ones, reducing both the problems of imbalance and heterogeneity of data. Experiments show that the performance of the proposed approach is comparable to that of its state-of-the-art competitor, although the model definition does not use any fraudulent previous case, adopting a proactive strategy able to contrast the well known cold-start issue. |
|
|
A Fourier Spectral Pattern Analysis to Design Credit Scoring Models (view on RG)
Roberto Saia and Salvatore Carta
Proceeding of the International Conference on Internet of Things and Machine Learning (IML-2017), Liverpool city, United Kingdom |
[Show/Hide Abstract]
The increase of consumer credit has made it necessary to research more and more effective models for the credit scoring. Such models are usually trained by using the past loan applications, evaluating the new ones on the basis of certain criteria. Although the state of the art offers several different approaches for their definition, this process represents a hard challenge due to several reasons. The most important ones are the data unbalance between the default and the non-default cases that reduces the effectiveness of almost all techniques, and the data heterogeneity, which makes it difficult the definition of a model able to effectively evaluate all the new loan applications. The approach proposed in this paper faces the aforementioned problems by moving the evaluation process from the canonical time domain to a frequency one, using a model based on the past non-default loan applications. It allows us to overcome the data unbalance problem by exploiting only a class of data, also defining a model that is less influenced by the data heterogeneity. The performed experiments show interesting results, since the proposed approach achieves performance closer or better than that of one of the best state-of-the-art approaches of credit scoring, such as random forests, although it operates in a proactive way, only by exploiting the past non-default cases. |
|
|
A Frequency-domain-based Pattern Mining for Credit Card Fraud Detection (view on RG)
Roberto Saia and Salvatore Carta
Proceeding of the 2nd International Conference on Internet of Things, Big Data and Security (IoTBDS-2017), Porto, Portugal |
[Show/Hide Abstract]
Nowadays, the prevention of credit card fraud represents a crucial task, since almost all the operators in the E-commerce environment accept payments made through credit cards, aware of that some of them could be fraudulent. The development of approaches able to face effectively this problem represents a hard challenge due to several problems. The most important among them are the heterogeneity and the imbalanced class distribution of data, problems that lead toward a reduction of the effectiveness of the most used techniques, making it difficult to define effective models able to evaluate the new transactions. This paper proposes a new strategy able to face the aforementioned problems based on a model defined by using the Discrete Fourier Transform conversion in order to exploit frequency patterns, instead of the canonical ones, in the evaluation process. Such approach presents some advantages, since it allows us to face the imbalanced class distribution and the cold-start issues by involving only the past legitimate transactions, reducing the data heterogeneity problem thanks to the frequency-domain-based data representation, which results less influenced by the data variation. A practical implementation of the proposed approach is given by presenting an algorithm able to classify a new transaction as reliable or unreliable on the basis of the aforementioned strategy. |
|
|
Semantics-Aware Content-Based Recommender Systems: Design and Architecture Guidelines (view on RG)
Roberto Saia, Ludovico Boratto, Salvatore Carta, and Gianni Fenu
Accepted for publication in Neurocomputing (NEUCOM) Journal, Elsevier. |
[Show/Hide Abstract]
Recommender systems suggest items by exploiting the interactions of the users with the system (e.g., the choice of the movies to recommend to a user is based on those she previously evaluated). In particular, content-based systems suggest items whose content is similar to that of the items evaluated by a user. An emerging application domain in content-based recommender systems is represented by the consideration of the semantics behind an item description, in order to have a disambiguation of the words in the description and improve the recommendation accuracy. However, different phenomena, such as changes in the preferences of a user over time or the use of her account by third parties, might affect the accuracy by considering items that do not reflect the actual user preferences. Starting from an analysis of the literature and of an architecture proposed in a recent survey, in this paper we first highlight the current limits in this research area, then we propose design guidelines and an improved architecture to build semantics-aware content-based recommendations. |
|
|
An Entropy Based Algorithm for Credit Scoring (view on RG)
Roberto Saia and Salvatore Carta
Proceeding of the 10th International Conference on Research and Practical Issues of Enterprise Information Systems (CONFENIS-2016), Vienna, Austria. Published in Lecture Notes in Business Information Processing (LNBIP), Springer. |
[Show/Hide Abstract]
The request of effective credit scoring models is rising in these last decades, due to the increase of consumer lending. Their objective is to divide the loan applicants into two classes, reliable or non-reliable, on the basis of the available information. The linear discriminant analysis is one of the most common techniques used to define these models, although this simple parametric statistical method does not overcome some problems, the most important of which is the imbalanced distribution of data by classes. It happens since the number of default cases is much smaller than that of non-default ones, a scenario that reduces the effectiveness of the machine learning approaches, e.g., neural networks and random forests. The Difference in Maximum Entropy (DME) approach proposed in this paper leads toward two interesting results: on the one hand, it evaluates the new loan applications in terms of maximum entropy difference between their features and those of the non-default past cases, using for the model training only these last cases, overcoming the imbalanced learning issue; on the other hand, it operates proactively, overcoming the cold-start problem. Our model has been evaluated by using two real-world data sets with an imbalanced distribution of data, comparing its performance to that of the most performant state-of-the-art approach: random forests. |
|
|
A Linear-dependence-based Approach to Design Proactive Credit Scoring Models (view on RG)
Roberto Saia and Salvatore Carta
Proceeding of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KDIR-2016), Porto, Portugal |
[Show/Hide Abstract]
The main aim of a credit scoring model is the classification of the loan applicants into two classes, reliable and non-reliable customers, on the basis of their potential capability to keep up with their repayments. Nowadays, credit scoring models are increasingly in demand, due to the consumer credit growth. Such models are usually designed on the basis of the past loan applications and used to evaluate the new ones. Their definition represents an hard challenge for different reasons, the most important of which is the imbalanced class distribution of data (i.e., the number of default cases is much smaller than that of the non-default cases), and this reduces the effectiveness of the most widely used approaches (e.g., neural network, random forests, and so on). The Linear Dependence Based (LDB) approach proposed in this paper offers a twofold advantage: it evaluates a new loan application on the basis of the linear dependence of its vector representation in the context of a matrix composed by the vector representation of the non-default applications history, thus by using only a class of data, overcoming the imbalanced class distribution issue; furthermore, it does not exploit the defaulting loans, allowing us to operate in a proactive manner, by addressing also the cold-start problem. We validate our approach on two real-world data sets characterized by a strong unbalanced distribution of data, by comparing its performance with that of one of the best state-of-the-art approach: random forests. |
|
|
Exploiting a Determinant-based Metric to Evaluate a Word-embeddings Matrix of Items (view on RG)
Roberto Saia, Ludovico Boratto, Salvatore Carta, and Gianni Fenu
Proceeding of the IEEE International Conference on Data Mining series (ICDM), Workshop on Semantics-Enabled Recommender System (SERecSys 2016), Barcelona, Spain |
[Show/Hide Abstract]
In order to generate effective results, it is essential for a recommender system to model the information about the user interests (user profiles). A profile usually contains preferences that reflect the recommendation technique, so collaborative systems represent a user with the ratings given to items, while content-based approaches assign a score to semantic/text-based features of the evaluated items. Even though semantic technologies are rapidly evolving and word embeddings (i.e., vector representations of the words in a corpus) are effective in numerous information filtering tasks, at the moment collaborative approaches (such as SVD) still generate more accurate recommendations. However, this might happen because, by employing classic profiles in form of vectors that collect all the preferences of a user, the power of word embeddings at modeling texts could be affected. In this paper we represent a profile as a matrix of word-embedding vectors of the items a user evaluated, and present a novel determinant-based metric that measures the similarity between an unevaluated item and those in the matrix-based user profile, in order to generate effective content-based recommendations. Experiments performed on three datasets show the capability of our approach to perform a better ranking of the items w.r.t. collaborative filtering, both when compared to a latent-factor-based approach (SVD) and to a classic neighborhood user-based system. |
|
|
Representing Items as Word-Embedding Vectors and Generating Recommendations by Measuring their Linear Independence (view on RG)
Roberto Saia, Ludovico Boratto, Salvatore Carta, and Gianni Fenu
Proceeding of the ACM Recommender Systems conference (RecSys-2016), Boston, MA, USA |
[Show/Hide Abstract]
In order to generate effective results, it is essential for a recommender system to model the information about the user interests in a profile. Even though word embeddings (i.e., vector representations of textual descriptions) have proven to be effective in many contexts, a content-based recommendation approach that employs them is still less effective than collaborative strategies (e.g., SVD). In order to overcome this issue, this paper introduces a novel criterion to evaluate the word-embedding representation of the items a user evaluated. The proposed approach defines a vector space in which the similarity between an unevaluated item and those in a user profile is measured in terms of linear independence. Experiments show its effectiveness to perform a better ranking of the items, w.r.t. collaborative filtering, both when compared to a latent-factor-based approach (SVD) and to a classic neighborhood user-based system. |
|
|
Improving the Accuracy of Latent-space-based Recommender Systems by Introducing a Cut-off Criterion (view on RG)
Roberto Saia, Ludovico Boratto, and Salvatore Carta
Proceeding of the Workshop on Engineering Computer-Human Interaction in Recommender Systems (EnCHIReS), Brussels, Belgium |
[Show/Hide Abstract]
Recommender systems filter the items a user did not evaluate, in order to acquire knowledge on the those that might be suggested to her. To accomplish this objective, they employ the preferences the user expressed in forms of explicit ratings or of implicitly values collected through the browsing of the items. However, users have different rating behaviors (e.g., users might use just the ends of the rating scale, to expressed whether they loved or hated an item), while the system assumes that the users employ the whole scale. Over the last few years, {\em Singular Value Decomposition} (SVD) became the most popular and accurate form of recommendation, because of its capability of working with sparse data, exploiting latent features. This paper presents an approach that pre-filters the items a user evaluated and removes those she did not like. In other words, by analyzing a user's rating behavior and the rating scale she used, we capture and employ in the recommendation process only the items she really liked. Experimental results show that our form of filtering leads to more accurate recommendations. |
|
|
Using Neural Word Embeddings to Model User Behavior and Detect User Segments (view on RG)
Roberto Saia, Ludovico Boratto, Salvatore Carta, and Gianni Fenu
Published in Knowledge-Based Systems (KBS) Journal, Elsevier |
[Show/Hide Abstract]
Modeling user behavior to detect segments of users to target and to whom address ads (behavioral targeting) is a problem widely-studied in the literature. Various sources of data are mined and modeled in order to detect these segments, such as the queries issued by the users. In this paper, we first show the need for a user segmentation system to employ reliable user preferences, since nearly half of the times users reformulate their queries in order to satisfy their information need. Then we propose a method that analyzes the description of the items positively evaluated by the users, and extract a vector representation of the words in these descriptions (word embeddings). Since it is widely-known that users tend to choose items of the same categories, our approach is designed to avoid the so-called preference stability who would associate the users to trivial segments. Moreover, we make sure that the interpretability of the generated segments is a characteristic offered to the advertisers who will use these segments. We performed different sets of experiments on a large real-world dataset, which validated our approach and showed its capability to produce effective segments. |
|
|
Binary Sieves: Toward a Semantic Approach to User Segmentation for Behavioral Targeting (view on RG)
Roberto Saia, Ludovico Boratto, Salvatore Carta, and Gianni Fenu
Published in Future Generation Computer Systems (FGCS) Journal, Elsevier |
[Show/Hide Abstract]
Behavioral targeting is the process of addressing ads to a specific set of users. The set of target users is detected from a segmentation of the user set, based on their interactions with the website (pages visited, items purchased, etc.). Recently, in order to improve the segmentation process, the semantics behind the user behavior has been exploited, by analyzing the queries issued by the users. However, nearly half of the times users need to reformulate their queries in order to satisfy their information need. In this paper, we tackle the problem of semantic behavioral targeting considering reliable user preferences, by performing a semantic analysis on the descriptions of the items positively rated by the users. We also consider widely-known problems, such as the interpretability of a segment, and the fact that user preferences are usually stable over time, which could lead to a trivial segmentation. In order to overcome these issues, our approach allows an advertiser to automatically extract a user segment by specifying the interests that she/he wants to target, by means of a novel boolean algebra; the segments are composed of users whose evaluated items are semantically related to these interests. This leads to interpretable and non-trivial segments, built by using reliable information. Experimental results confirm the effectiveness of our approach at producing users segments. |
|
|
A semantic approach to remove incoherent items from a user profile and improve the accuracy of a recommender system (view on RG)
Roberto Saia, Ludovico Boratto, and Salvatore Carta
Published in Journal of Intelligent Information Systems (JIIS) Journal, Springer |
[Show/Hide Abstract]
Recommender systems usually suggest items by exploiting all the previous interactions of the users with a system (e.g., in order to decide the movies to recommend to a user, all the movies she previously purchased are considered). This canonical approach sometimes could lead to wrong results due to several factors, such as a change in user preferences over time, or the use of her account by third parties. This kind of incoherence in the user profiles defines a lower bound on the error the recommender systems may achieve when they generate suggestions for a user, an aspect known in literature as magic barrier. This paper proposes a novel dynamic coherence-based approach to define the user profile used in the recommendation process. The main aim is to identify and remove from the previously evaluated items those not semantically adherent to the the others, in order to make a user profile as close as possible to the user's real preferences, solving the aforementioned problems. Moreover, reshaping the user profile in such a way leads to great advantages in terms of computational complexity, since the number of items considered during the recommendation process is highly reduced. The performed experiments show the effectiveness of our approach to remove the incoherent items from a user profile, increasing the recommendation accuracy. |
|
|
A Class-based Strategy to User Behavior Modeling (view on RG)
Roberto Saia, Ludovico Boratto, and Salvatore Carta
Published in Studies in Computational Intelligence (SCI) Journal, Springer |
[Show/Hide Abstract]
A recommender system is a tool employed to filter the huge amounts of data that companies have to deal with, and produce effective suggestions to the user. The estimation of the interest of a user toward an item, however, is usually performed at the level of a single item, i.e., for each item not evaluated by a user, canonical approaches look for the rating given by similar users for that item, or for an item with similar content. Such approach leads toward the so-called overspecialization/serendipity problem, in which the recommended items are trivial and users do not come across surprising items. This work first shows that the user preferences are actually distributed over a small set of classes of items, leading the recommended items to be too similar to the ones already evaluated, then we propose a novel model, named Class Path Information (CPI), able to represent the current and future preferences of the users in terms of a ranked set of classes of items. The proposed approach is based on a semantic analysis of the items evaluated by the users, in order to extend the ground truth and infer the future preferences of the users. The performed experiments show that our approach, by including in the CPI model the same classes predicted by a state-of-the-art recommender system, is able to accurately model the user preferences in terms of classes, instead of in terms of single items, allowing to recommend non trivial items. |
|
|
A Proactive Time-frame Convolution Vector (TFCV) Technique to Detect Frauds Attempts in E-commerce Transactions (view on RG)
Roberto Saia, Ludovico Boratto, and Salvatore Carta
Proceeding of the International Conference on Communication and Information Processing (ICCIP), Tokyo, Japan, Published in International Journal of e-Education, e-Business, e-Management and e-Learning (IJEEEE) |
[Show/Hide Abstract]
Any business that carries out activities on the Internet and accepts payments through debit or credit cards, also implicitly accepts all the risks related to them, like for some transaction to be fraudulent. Although these risks can lead to significant economic losses, nearly all the companies continue to use these powerful instruments of payment, as the benefits derived from them will outweigh the potential risks involved. The design of effective strategies able to face this problem is however particularly challenging, due to several factors, such as the heterogeneity and the non stationary distribution of the data stream, as well as the presence of an imbalanced class distribution. To complicate the problem, there is the scarcity of public datasets for confidentiality issues, which does not allow researchers to verify the new strategies in many data contexts. Differently from almost all strategies at the state of the art, instead of producing a unique model based on the past transactions of the users, in this paper we present an approach that generates a set of models (behavioral patterns) that allow us to evaluate a new transaction, by considering the behavior of the user in different temporal frames of her/his history. The size of the temporal frames and the number of levels (granularity) used to discretize the values in the behavioral patterns, can be adjusted in order to adapt the system sensitivity to the operating environment. Considering that our models do not need to be trained with both the past legitimate and fraudulent transactions of a user, since they use only the legitimate ones, we can operate in a proactive manner, by detecting fraudulent transactions that have never occurred in the past. Such a way to proceed also overcomes the data imbalance problem that afflicts the machine learning approaches at the state of the art. The evaluation of the proposed approach is performed by comparing it with one of the most performant approaches at the state of the art as Random Forests, using a real-world credit card dataset. |
|
|
A Latent Semantic Pattern Recognition Strategy for an Untrivial Targeted Advertising (view on RG)
Roberto Saia, Ludovico Boratto, and Salvatore Carta
Proceedings of the 4th IEEE International Congress (BigData), New York, United States of America |
[Show/Hide Abstract]
Target definition is a process aimed at partitioning the potential audience of an advertiser into several classes, according to specific criteria. Almost all the existing approaches take into account only the explicit preferences of the users, without considering the hidden semantics embedded in their choices, so the target definition is affected by widely-known problems. One of the most important is that easily understandable segments are not effective for marketing purposes due to their triviality, whereas more complex segmentations are hard to understand. In this paper we propose a novel segmentation strategy able to uncover the implicit preferences of the users, by studying the semantic overlapping between the classes of items positively evaluated by them and the rest of classes. The main advantages of our proposal are that the desired target can be specified by the advertiser, and that the set of users is easily described by the class of items that characterizes them; this means that the complexity of the semantic analysis is hidden to the advertiser, and we obtain an interpretable and non-trivial user segmentation, built by using reliable information. Experimental results confirm the effectiveness of our approach in the generation of the target audience. |
|
|
Introducing a Weighted Ontology to Improve the Graph-based Semantic Similarity Measures (view on RG)
Roberto Saia, Ludovico Boratto, and Salvatore Carta
Proceeding of the 6th International Conference on Networking and Information Technology (ICNIT), Tokyo, Japan.
Published in International Journal of Signal Processing Systems (IJSPS) |
[Show/Hide Abstract]
The semantic similarity measures are designed to compare terms that belong to the same ontology. Many of these are based on a graph structure, such as the well-known lexical database for the English language, named WordNet, which groups the words into sets of synonyms called synsets. Each synset represents a unique vertex of the WordNet semantic graph, through which is possible to get information about the relations between the different synsets. The literature shows several ways to determine the similarity between words or sentences through WordNet (e.g., by measuring the distance among the words, by counting the number of edges between the correspondent synsets), but almost all of them do not take into account the peculiar aspects of the used dataset. In some contexts this strategy could lead toward bad results, because it considers only the relationship between vertexes of the WordNet semantic graph, without giving them a different weight based on the synsets frequency within the considered datasets. In other words, common synsets and rare synsets are valued equally. This could create problems in some applications, such as those of recommender systems, where WordNet is exploited to evaluate the semantic similarity between the textual descriptions of the items positively evaluated by the users, and the descriptions of the other ones not evaluated yet. In this context, we need to identify the user preferences as best as possible, and not taking into account the synsets frequency, we risk to not recommend certain items to the users, since the semantic similarity generated by the most common synsets present in the description of other items could prevail. This work faces this problem, by introducing a novel criterion of evaluation of the similarity between words (and sentences) that exploits the WordNet semantic graph, adding to it the weight information of the synsets. The effectiveness of the proposed strategy is verified in the recommender systems context, where the recommendations are generated on the basis of the semantic similarity between the items stored in the user profiles, and the items not evaluated yet.. |
|
|
Multiple Behavioral Models: a Divide and Conquer Strategy to Fraud Detection in Financial Data Streams (view on RG)
Roberto Saia, Ludovico Boratto, andSalvatore Carta
Proceeding of the 7th International Conference on Knowledge Discovery and Information Retrieval (KDIR), Lisbon, Portugal |
[Show/Hide Abstract]
The exponential and rapid growth of the E-commerce based both on the new opportunities offered by the Internet, and on the spread of the use of debit or credit cards in the online purchases, has strongly increased the number of frauds, causing large economic losses to the involved businesses. The design of effective strategies able to face this problem is however particularly challenging, due to several factors, such as the heterogeneity and the non stationary distribution of the data stream, as well as the presence of an imbalanced class distribution. To complicate the problem, there is the scarcity of public datasets for confidentiality issues, which does not allow researchers to verify the new strategies in many data contexts. Differently from the canonical state-of-the-art strategies, instead of defining a unique model based on the past transactions of the users, we follow a Divide and Conquer strategy, by defining multiple models (user behavioral patterns), which we exploit to evaluate a new transaction, in order to detect potential attempts of fraud. We can act on some parameters of this process, in order to adapt the models sensitivity to the operating environment. Considering that our models do not need to be trained with both the past legitimate and fraudulent transactions of a user, since they use only the legitimate ones, we can operate in a proactive manner, by detecting fraudulent transactions that have never occurred in the past. Such a way to proceed also overcomes the data imbalance problem that afflicts the machine learning approaches. The evaluation of the proposed approach is performed by comparing it with one of the most performant approaches at the state of the art as Random Forests, using a real-world credit card dataset. |
|
|
Popularity Does Not Always Mean Triviality: Introduction of Popularity Criteria to Improve the Accuracy of a Recommender System (view on RG)
Roberto Saia, Ludovico Boratto, and Salvatore Carta
Proceedings of the International Conference on Computer Science and Information Technology (ICCSIT), Amsterdam, Netherland, Published in Journal of Computers (JCP) |
[Show/Hide Abstract]
The main goal of a recommender system is to provide suggestions, by predicting a set of items that might interest the users. In this paper, we will focus on the role that the popularity of the items can play in the recommendation process. The main idea behind this work is that if an item with a high predicted rating for a user is very popular, this information about its popularity can be effectively employed to select the items to recommend. Indeed, by merging a high predicted rating with a high popularity, the effectiveness of the produced recommendations would increase with respect to a case in which a less popular item is suggested. The proposed strategy aims to employ in the recommendation process new criteria based on the items' popularity, by measuring how much it is preferred by users. Through a post-processing approach, we use this metric to extend one of the most performing state-of-the-art recommendation techniques, i.e., SVD++. The effectiveness of this hybrid strategy of recommendation has been verified through a series of experiments, which show strong improvements in terms of accuracy w.r.t. SVD++. |
|
|
Exploiting the Evaluation Frequency of the Items to Enhance the Recommendation Accuracy (view on RG)
Roberto Saia, Ludovico Boratto, and Salvatore Carta
Proceedings of the International Conference on Computer Applications & Technology (ICCAT), Rome, Italy |
[Show/Hide Abstract]
The main task of a recommender system is to suggest a list of items that users may be interested in. In this paper, we focus on the role that the popularity of the items plays in the recommendation process. If on the one hand, considering only the most popular items generates trivial recommendations, on the other hand, not taking in consideration the item popularity could lead to a non-optimal performance of a system, since it does not differentiate the items, giving them the same weight during the recommendation process. Therefore, we could risk to exclude from the recommendations some popular items that would have a high probability of being preferred by the users, suggesting instead others that, despite meeting the selection criteria, have less chance to be preferred. The proposed strategy aims to employ in the recommendation process new criteria based on the items' popularity, by introducing two novel metrics. Through the first metric we evaluate the semantic relevance of an item with respect to the user profile, while through the second metric, we measure how much it is preferred by users. Through a post-processing approach, we use these metrics in order to extend one of the most performing state-of-the-art recommendation techniques: SVD++. The effectiveness of this hybrid strategy of recommendation has been verified through a series of experiments, which show strong improvements in terms of accuracy w.r.t. SVD++. |
|
|
A New Perspective on Recommender Systems: a Class Path Information Model (view on RG)
Roberto Saia, Ludovico Boratto, and Salvatore Carta
Proceedings of the Science and Information Conference (SAI), London, United Kingdom |
[Show/Hide Abstract]
Recommender systems perform suggestions for items that might interest the users. The recommendation process is usually performed at the level of a single item, i.e., for each item not evaluated by a user, classic approaches look for the rating given by similar users for that item, or for an item with similar content. This leads to the so-called overspecialization/serendipity problem, in which the recommended items are trivial and users do not come across surprising items. In this paper we first show that the preferences of the users are actually distributed over a small set of classes of items, leading the recommended items to be too similar with the ones already evaluated. We also present a novel representation model, named Class Path Information (CPI), able to express the current and future preferences of the users in terms of a ranked set of classes of items. Our approach to user preferences modeling is based on a semantic analysis of the items evaluated by the users, in order to extend the ground truth and predict where the future preferences of the users will go. Experimental results show that our approach, by including in the CPI model the same classes predicted by a state-of-the-art recommender system, is able to accurately model the preferences of the users in terms of classes and not in terms of single items, allowing recommender systems to suggest non trivial items. |
|
|
Semantic Coherence-based User Profile Modeling in the Recommender Systems Context (view on RG)
Roberto Saia, Ludovico Boratto, and Salvatore Carta
Proceedings of the 6th International Conference on Knowledge Discovery and Information Retrieval (KDIR), Rome, Italy |
[Show/Hide Abstract]
Recommender systems usually produce their results to the users based on the interpretation of the whole historic interactions of these. This canonical approach sometimes could lead to wrong results due to several factors, such as a changes in work proposes a novel dynamic coherence-based approach that analyzes the information stored in the user profiles based on their coherence. The main aim is to identify and remove from the previously evaluated items those not adherent to the average preferences, in order to make a user profile as close as possible to the user’s real tastes. The conducted experiments show the effectiveness of our approach to remove the incoherent items from a user profile, increasing the recommendation accuracy. |
|
|
|
|
ARTICOLI SU RIVISTE / ARTICLES IN MAGAZINES |
|
Hping - Il coltellino svizzero della sicurezza
Il tool perfetto per analizzare e forgiare pacchetti TCP/IP
Pubblicato sulla rivista "Linux Pro", numero 124 del mese di dicembre 2012 |
Introduzione: grazie alla sua capacità di analizzare e forgiare pacchetti TCP/IP, il software hping rappresenta un vero e proprio coltellino svizzero in ambito sicurezza, in quanto consente di verificare la sicurezza dei dispositivi di protezione, divenendo il compagno insostituibile sia di coloro che hanno il compito di amministrare la sicurezza delle reti sia, purtroppo, degli antagonisti.... |
|
|
"Creating a Fake Wi-Fi Hotspot to Capture Connected Users Information" & "Deceiving Defenses with Nmap Camouflaged Scanning"
Republished in 'Hakin9 Exploiting Software Bible' , June 2012 |
|
|
Tecniche di mappatura delle reti wireless
A caccia di reti Wi-Fi
Pubblicato sulla rivista "Linux Pro", numero 118 del mese di giugno 2012 |
Introduzione: la mappatura delle reti wireless è un'attività che coinvolge trasversalmente il settore della sicurezza informatica, in quanto fornisce delle preziose informazioni sia a coloro che hanno il compito di difendere le proprie reti dagli accessi illegittimi, sia agli antagonisti di questi ultimi che, specularmente, si adoperano invece per violarle... |
|
|
Metti al sicuro la tua LAN
Testiamo la rete evidenziandone tutte le vulnerabilità
Pubblicato sulla rivista "Win Magazine", numero 166 del mese di giugno 2012 |
Introduzione: sebbene i canonici strumenti di difesa dagli attacchi informatici tipicamente utilizzati sui nostri computer o nelle reti locali (in primis, i firewall) offrano in molti casi una adeguata protezione verso leinsidie provenienti dall'asterno, è doveroso evidenziare che,in talune circostanze, l'utilizzo di questi strumenti potrebbe rilevarsi inefficace... |
|
|
Deceiving Networks Defenses
with Nmap Camouflaged Scanning
Published in 'Hakin9 Exploiting Software' , April 2012 |
Overview: Nmap (contraction of ‘Network Mapper’) is an open-source software designed to rapidly scan both single hosts and large networks. To perform its functionalities Nmap uses particular IP packets (raw-packets) in order to probe what hosts are active on the target network: about these hosts, it is able to ... |
|
|
Creating a Fake Wi-Fi Hotspot to Capture Connected Users Information
Use a standard laptop to create a fake open wireless access point
Published in 'Hakin9 Exploiting Software' , March 2012 |
Overview: we can use a standard laptop to create a fake open wireless access point that allows us to capture a large amount of information about connected users; in certain environments, such as airports or meeting areas, this kind of operation can represent an enormous security threat but, on the other hand, the same approach is a powerful way to check the wireless activity in certain areas ... |
|
|
Proactive Network Defence through Simulated Network
How to use some techniques and tools in order to deceive the potential intruders in our network
Published in 'Hakin9 Extra' , February 2012 |
Overview: a honeypot-based solution realizes a credible simulation of a complete network environment where we can add and activate one or more virtual hosts (the honeypots) in various configuration: a network of honeypot systems is named honeynet... |
|
|
From the Theory of Prime Numbers to Quantum Cryptography
The history of a successful marriage between theoretical mathematics
and the modern computer science
Published in 'Hakin9 Extra' , January 2012 |
Overview: the typical ‘modus operandi’ of the computer science community is certainly more oriented to pragmatism than to fully understanding what underlies the techniques and tools used. This article will try to fill one of these gaps by showing the close connection between the mathematics and modern cryptographic systems. Without claiming to achieve full completeness, the goal here is to expose some of the most important mathematical theories that regulate the operation of modern cryptography... |
|
|
Rsyslog: funzioni avanzate e grande affidabilità
Logging avanzato con Rsyslog
Pubblicato sulla rivista "Linux&C", numero 75 del mese di novembre 2011 |
Introduzione: Rsyslog è stato scelto dalle maggiori distribuzioni per sostituire il glorioso syslogd, rispetto al quale offre maggiore flessibilità e nuove funzioni... |
|
|
La rete è sotto controllo - Regolamentazione e filtraggio dei contenuti
Come implementare un efficiente sistema di gestione basato su Squid e DanSGuardian
Pubblicato sulla rivista "Linux Pro", numero 106 del mese di luglio 2011 |
Introduzione: le problematiche afferenti il filtraggio e la regolamentazione degli accessi verso una rete esterna da parte degli utenti di una rete locale vengono oggi poste in risalto dalle recenti disposizioni di legge emanate dal garante per la protezione dei dati personali, nuove norme che impongono agli amministratori una rigorosa regolamentazione di tale attività... |
|
|
|
COLLABORAZIONI / COOPERATIONS |
|
La sicurezza delle reti aziendali ai tempi di Facebook
Progetto wiki promosso da IBM per discutere sul tema della sicurezza informatica e, nello specifico, su come l’utilizzo dei social network si ripercuote sulla sicurezza delle strutture aziendali, nonché sui rischi che ne derivano.
|
Documento protetto da licenza Creative Commons di tipo "Attribuzione-Non commerciale-Non opere derivate" |
|
Autori: Mario Mazzolin, Simone Riccetti, Cristina Berta, Raoul Chiesa, Angelo Iacubino, Roberto Marmo, Roberto Saia |
|
|
La sicurezza delle informazioni nell'era del Web 2.0
Progetto wiki promosso da IBM per discutere sul tema della sicurezza informatica e, nello specifico, di come gli strumenti offerti dal web 2.0 possano essere amministrati senza mettere a repentaglio la sicurezza dei sistemi. |
Documento protetto da licenza Creative Commons di tipo "Attribuzione-Non commerciale-Non opere derivate" |
|
Autori: Luca Cavone, Gaetano Di Bello, Angelo Iacubino, Armando Leotta, Roberto Marmo, Mario Mazzolini, Daniele Pauletto, Roberto Saia |
|
|
|
LIBRI / BOOKS |
|
SIMILARITY AND DIVERSITY
Two Sides of the Same Coin in Data Analysis |
Lingua: Inglese
Pagine: 168
Autore: Roberto Saia
ISBN-13: 978-3-659-88315-6
ISBN-10: 3659883158
EAN: 9783659883156
Anno di edizione : 2016
Editore: LAP LAMBERT Academic Publishing |
|
|
|
Lingua: Italiano
Pagine: 362 - 17x24
Autore: Roberto Saia
Codice ISBN : 9788882338633
Anno di edizione : 2010
Editore: FAG Milano
Collana: Pro DigitalLifeStyle |
|
|
|
Lingua: Italiano
Pagine: 336 - 17x24
Autore: Roberto Saia
Codice ISBN : 9788882337742
Anno di edizione : 2009
Editore: FAG Milano
Collana: Pro DigitalLifeStyle |
|
|
|
Lingua: Italiano
Pagine: 448
Autore: Roberto Saia
Codice ISBN : 9788882336912
Anno di edizione : 2008
Editore: FAG Milano
Collana: Pro DigitalLifeStyle |
|
|
|
|
LIBRI DIGITALI / EBOOKS |
|
|
Lingua: Italiano
Pagine: 446
Autore: Roberto Saia
Anno di edizione : 2011
Editore: Manuali.net
Formato: E-Book |
|
|
|
Lingua: Italiano
Pagine: 100
Autore: Roberto Saia
Anno di edizione : 2010
Editore: Manuali.net
Formato: E-Book |
|
|
|
Lingua: Italiano
Pagine: 86
Autore: Roberto Saia
Anno di edizione : 2010
Editore: Manuali.net
Formato: E-Book |
|
|
|
|
ARTICOLI VARI / MISCELLANEOUS ARTICLES |
2012 |
Dalla teoria dei numeri primi alla crittografia quantistica |
|
2010 |
Approccio euristico nella sicurezza nel Web semantico |
|
2010 |
Vulnerabilità di tipo Cross Site Scripting |
|
2010 |
Sicurezza proattiva nel Web di seconda generazione |
|
2010 |
Rischi derivanti dall’analisi aggregata dei dati a scarsa valenza individuale |
|
2008 |
Information Technology e sicurezza |
|
2008 |
Introduzione alla sicurezza informatica |
|
2004 |
SQL Injection Attack Technique |
|
|
|
|
|
GUIDE / TUTORIALS |
|
#1: Il framework Metasploit
|
Introduzione: il progetto Metasploit nasce con l'obiettivo di realizzare un prodotto software capace di fornire informazioni sulle vulnerabilità dei sistemi informatici, questo sia al fine di compiere delle operazioni di analisi dello scenario operativo (penetration testing), sia per coadiuvare le fasi di sviluppo di strumenti pensati per la loro difesa... (leggi intero articolo) |
|
#2: La gestione dei permessi in ambiente Linux
|
Introduzione: La gestione dei permessi nell'ambito dei sistemi operativi multiutente come Linux riveste una grande importanza e, proprio per questa ragione, ciascun sistema rende disponibili alcuni comandi pensati apposta per gestire queste operazioni... (leggi intero articolo) |
|
#3: La maschera dei permessi in ambiente Linux
|
Introduzione: Uno strumento alquanto prezioso nell'ambito della sicurezza dei sistemi è la cosiddetta maschera dei permessi, strumento con il quale è possibile amministrare i privilegi su file e cartelle. Il suo utilizzo è possibile avvalendosi del comando umask, comando che... (leggi intero articolo) |
|
#4: Penetration Test con Nmap
|
Introduzione: Una delle più importanti attività in ambito sicurezza informatica è senza dubbio quella che in letteratura informatica prende il nome di “Penetration Test“, una denominazione data a tutte quelle attività che hanno lo scopo di verificare, in modo più o meno approfondito, la sicurezza di una infrastruttura informatica... (leggi intero articolo) |
|
#5: Introduzione al Social Engineering: Phishing e Pharming
|
Introduzione: Il termine Social Engineering, che nella nostra lingua si traduce come Ingegneria Sociale, indica un modo di operare dell'aggressore basato su azioni di imbroglio e/o persuasione volte ad ottenere informazioni riservate che, solitamente, consentono a chi le mette in opera di accedere illecitamente ad uno o più sistemi... (leggi intero articolo) |
|
#6: Rilevare le intrusioni in una rete wireless con AirSnare
|
Introduzione: In questo articolo discuteremo di AirSnare, un software che, differentemente da altri prodotti dalle funzionalità analoghe, non richiede particolari competenze tecniche per il suo utilizzo, consentendo a chiunque di effettuare dei controlli mirati a identificare attività non autorizzate sulla propria rete... (leggi intero articolo) |
|
#7: Puntatori nel linguaggio di programmazione C
|
Introduzione:Uno degli aspetti che risulta più ostico da comprendere a coloro che per la prima volta si avvicinano al linguaggio C, è certamente quello relativo ai puntatori, un potente strumento reso disponibile dal linguaggio, con il quale è possibile compiere numerose operazioni in modo inusuale rispetto... (leggi intero articolo) |
|
#8: Principi di guerra elettronica: attacco, protezione e supporto
|
Introduzione:Questo articolo costituisce una sorta di curiosa divagazione in merito alle tecnologie wireless, in quanto queste vengono qui chiamate in causa in una loro particolarissima accezione che, certamente, è più attinente al mondo dell'intelligence o, comunque, più in generale... (leggi intero articolo) |
|
#9: Principi di guerra elettronica: tecnologia dei sistemi Tempest |
Introduzione: Il termine Tempest identifica un particolare settore che si occupa dello studio delle emissioni elettromagnetiche di alcune parti hardware di un elaboratore, emissioni (si tratta dei campi elettromagnetici generati dalle oscillazioni dei segnali elaborati dai circuiti) che se... (leggi intero articolo) |
|
#10: La Subnet Mask |
Introduzione: La Subnet Mask, in italiano "maschera di sottorete", viene adoperata per differenziare, attraverso un'operazione definita "messa in AND", la porzione di indirizzo IP che identifica la rete (Network) da quella che, invece, individua la macchina (Host)... (leggi intero articolo) |
|
#11: Introduzione alla tecnica del Buffer Overflow |
Introduzione: La tecnica denominata Buffer Overflow, che nella nostra lingua potrebbe essere tradotta come trabocco del Buffer, è basata sullo sfruttamento di un certo tipo di vulnerabilità presente in alcuni Software; la vulnerabilità in questione è costituita... (leggi intero articolo) |
|
#12: Web of Things: introduzione a Paraimpu |
Introduzione: i limiti che per lungo tempo hanno caratterizzato il word wide web sono stati recentemente valicati dal cosiddetto “Web of Things”, un innovativo paradigma di interazione che oltre ai canonici utenti, siti e servizi, coinvolge in rete un enorme numero di dispositivi semplici e complessi... (leggi intero articolo) |
|
#13: Autocostruzione di un Firewall Hardware |
Introduzione:L'obiettivo di questo progetto consiste nella costruzione di un firewall hardware con caratteristiche similari a quelle riscontrate sui dispositivi commerciali che, pur essendo sicuramente molto comodi ed efficaci, sono estremamente costosi... (leggi intero articolo] |
|
|
|
|
|