Authors - Rakesh Agrawal, Ramakrishnan Srikant
Summary - The paper presents two association rules mining algorithm for large datasets. Association rule mining is an interesting area of data mining which discovers the relations among the items in transactions. The new algorithms execute at least three times faster than existing algorithms. The first algorithm is called Apriori which incrementally build the large itemsets (an itemset is a set of items, the support of an itemset is calculated by the count the set appear in the database and a large itemset has support greater than minimum support) until no new bigger large itemset can be added, but this algorithm requires many passes over the database to discover all large itemsets. To tackle this problem this paper also presents another algorithm called AprioriTid, which does not require many passes over the database. AprioriTid stores the list of transactions that requires passing again to compute bigger large itemsets in the subsequent steps. However, the list of transactions that requires passing again may grow larger than the original size of the database which deteriorates the performance instead of improving it. This problem is severe for greater number of large itemsets but drastically disappear for smaller number of large itemset. Hence this paper proposes the use of a hybrid algorithm composed of Apriori and AprioriTid called AprioriHybrid. AprioriHybrid uses Apriori at the beginning when the number of large itemsets is high and switched to AprioriTid when the number of large itemsets became small. An empirical study confirms their claim at the end of the paper.
Further thoughts - (1) In order to obtain meaningful rules how does a user choose minimum support and minimum confidence? (2) Attribute data are fundamentally different from transactional data, how does the association rule mining work for attribute data?
Summary - The paper presents two association rules mining algorithm for large datasets. Association rule mining is an interesting area of data mining which discovers the relations among the items in transactions. The new algorithms execute at least three times faster than existing algorithms. The first algorithm is called Apriori which incrementally build the large itemsets (an itemset is a set of items, the support of an itemset is calculated by the count the set appear in the database and a large itemset has support greater than minimum support) until no new bigger large itemset can be added, but this algorithm requires many passes over the database to discover all large itemsets. To tackle this problem this paper also presents another algorithm called AprioriTid, which does not require many passes over the database. AprioriTid stores the list of transactions that requires passing again to compute bigger large itemsets in the subsequent steps. However, the list of transactions that requires passing again may grow larger than the original size of the database which deteriorates the performance instead of improving it. This problem is severe for greater number of large itemsets but drastically disappear for smaller number of large itemset. Hence this paper proposes the use of a hybrid algorithm composed of Apriori and AprioriTid called AprioriHybrid. AprioriHybrid uses Apriori at the beginning when the number of large itemsets is high and switched to AprioriTid when the number of large itemsets became small. An empirical study confirms their claim at the end of the paper.
Further thoughts - (1) In order to obtain meaningful rules how does a user choose minimum support and minimum confidence? (2) Attribute data are fundamentally different from transactional data, how does the association rule mining work for attribute data?
No comments:
Post a Comment
Please, no abusive word, no spam.