Matches in SemOpenAlex for { <https://semopenalex.org/work/W2785651640> ?p ?o ?g. }
Showing items 1 to 85 of
85
with 100 items per page.
- W2785651640 endingPage "135" @default.
- W2785651640 startingPage "125" @default.
- W2785651640 abstract "In 2015, Cdiscount challenged the community to predict the correct category of its products from some of their attributes such as their title, description, price or associated image. The candidates had access to the whole catalogue of active products as of May 2015, which accounts for about 15.8 millions items distributed over 5,789 categories, a subset of which served as testing set. The data suffers from inconsistencies typical of large, real-world databases and the distribution of categories is extremely uneven, thereby complicating the classification task. The five winning algorithms, selected amongst more than 3,500 contributions, are able to predict the correct category of 66--68% of the testing set's products. Most of them are based on simple linear models such as logistic regressions, which suggests that preliminary steps such as text preprocessing, vectorization and data set rebalancing are more crucial than resorting to complex, non-linear models. In particular, the winning contributions all carefully cope with the strong imbalance of the categories, either through random sampling or sample weighting. A distinguishing feature of the two highest-scoring algorithms is their blending of large ensemble of models trained on random subsets of the data. The data set is released to the research and teaching communities, as we hope it will prove of valuable help to improve text and image-based classification algorithms in a context of very large number of classes. Keywords: classification, e-commerce, big data, public data set. Le challenge de categorisation organise par Cdiscount sur datascience.net en 2015;: analyse du jeu de donnees mis a disposition et des contributions gagnantes En 2015, Cdiscount a mis la communaute au defi de prevoir la categorie correcte de ses produits a partir de certains de leurs attributs comme le libelle, la description, le prix ou l'image associee. Les candidats ont eu acces a l'integralite du catalogue de produits actifs en mai 2015, soit environ 15.8 millions d'items repartis dans 5,789 categories, hormis une petite partie qui a servi d'ensemble de test. La qualite des donnees est loin d'etre homogene et la repartition des categories est extremement desequilibree, ce qui complique la tâche de categorisation. Les cinq algorithmes gagnants, selectionnes parmi plus de 3,500 contributions, atteignent un taux de previsions correctes de 66--68% sur l'ensemble de test. La plupart utilisent des modeles lineaires simples comme des regressions logistiques, ce qui suggere que les etapes preliminaires telles que le pre-traitement du texte, sa vectorisation et le reechantillonnage des donnees sont plus cruciales que le choix de modeles non-lineaires complexes. En particulier, les gagnants corrigent tous le desequilibre des categories par des methodes d'echantillonnage aleatoire ou de ponderation en fonction de l'importance des categories. Les deux meilleurs algorithmes se distinguent par leur aggregation de grands nombres de modeles entraines sur des sous-ensembles aleatoires des donnees. Le catalogue de produits est mis a disposition de la communaute de recherche et formation scientifique, qui disposera ainsi de donnees reelles issues du e-commerce pour etalonner et ameliorer les algorithmes de classification bases sur le texte et les images dans un contexte de tres grand nombre de classes. Mots-cles : classification, e-commerce, big data, jeu de donnees public." @default.
- W2785651640 created "2018-02-23" @default.
- W2785651640 creator A5007306218 @default.
- W2785651640 creator A5015727229 @default.
- W2785651640 creator A5016892940 @default.
- W2785651640 creator A5063787879 @default.
- W2785651640 creator A5089187388 @default.
- W2785651640 date "2017-12-15" @default.
- W2785651640 modified "2023-09-24" @default.
- W2785651640 title "The categorization challenge organized by Cdiscount on datascience.net in 2015: analysis of the released data set and winning contributions" @default.
- W2785651640 hasPublicationYear "2017" @default.
- W2785651640 type Work @default.
- W2785651640 sameAs 2785651640 @default.
- W2785651640 citedByCount "0" @default.
- W2785651640 crossrefType "journal-article" @default.
- W2785651640 hasAuthorship W2785651640A5007306218 @default.
- W2785651640 hasAuthorship W2785651640A5015727229 @default.
- W2785651640 hasAuthorship W2785651640A5016892940 @default.
- W2785651640 hasAuthorship W2785651640A5063787879 @default.
- W2785651640 hasAuthorship W2785651640A5089187388 @default.
- W2785651640 hasConcept C119857082 @default.
- W2785651640 hasConcept C124101348 @default.
- W2785651640 hasConcept C126838900 @default.
- W2785651640 hasConcept C154945302 @default.
- W2785651640 hasConcept C166957645 @default.
- W2785651640 hasConcept C177264268 @default.
- W2785651640 hasConcept C183115368 @default.
- W2785651640 hasConcept C199360897 @default.
- W2785651640 hasConcept C205649164 @default.
- W2785651640 hasConcept C23123220 @default.
- W2785651640 hasConcept C2779343474 @default.
- W2785651640 hasConcept C34736171 @default.
- W2785651640 hasConcept C41008148 @default.
- W2785651640 hasConcept C58489278 @default.
- W2785651640 hasConcept C71924100 @default.
- W2785651640 hasConcept C75684735 @default.
- W2785651640 hasConcept C94124525 @default.
- W2785651640 hasConceptScore W2785651640C119857082 @default.
- W2785651640 hasConceptScore W2785651640C124101348 @default.
- W2785651640 hasConceptScore W2785651640C126838900 @default.
- W2785651640 hasConceptScore W2785651640C154945302 @default.
- W2785651640 hasConceptScore W2785651640C166957645 @default.
- W2785651640 hasConceptScore W2785651640C177264268 @default.
- W2785651640 hasConceptScore W2785651640C183115368 @default.
- W2785651640 hasConceptScore W2785651640C199360897 @default.
- W2785651640 hasConceptScore W2785651640C205649164 @default.
- W2785651640 hasConceptScore W2785651640C23123220 @default.
- W2785651640 hasConceptScore W2785651640C2779343474 @default.
- W2785651640 hasConceptScore W2785651640C34736171 @default.
- W2785651640 hasConceptScore W2785651640C41008148 @default.
- W2785651640 hasConceptScore W2785651640C58489278 @default.
- W2785651640 hasConceptScore W2785651640C71924100 @default.
- W2785651640 hasConceptScore W2785651640C75684735 @default.
- W2785651640 hasConceptScore W2785651640C94124525 @default.
- W2785651640 hasIssue "2" @default.
- W2785651640 hasLocation W27856516401 @default.
- W2785651640 hasOpenAccess W2785651640 @default.
- W2785651640 hasPrimaryLocation W27856516401 @default.
- W2785651640 hasRelatedWork W106724427 @default.
- W2785651640 hasRelatedWork W2094252110 @default.
- W2785651640 hasRelatedWork W2223286428 @default.
- W2785651640 hasRelatedWork W2275863571 @default.
- W2785651640 hasRelatedWork W2406008833 @default.
- W2785651640 hasRelatedWork W2409537812 @default.
- W2785651640 hasRelatedWork W2607273208 @default.
- W2785651640 hasRelatedWork W2689651934 @default.
- W2785651640 hasRelatedWork W2786213716 @default.
- W2785651640 hasRelatedWork W2912708577 @default.
- W2785651640 hasRelatedWork W2942124016 @default.
- W2785651640 hasRelatedWork W2951437415 @default.
- W2785651640 hasRelatedWork W2981406182 @default.
- W2785651640 hasRelatedWork W3081577193 @default.
- W2785651640 hasRelatedWork W3083200146 @default.
- W2785651640 hasRelatedWork W3106355232 @default.
- W2785651640 hasRelatedWork W3127672287 @default.
- W2785651640 hasRelatedWork W399049322 @default.
- W2785651640 hasRelatedWork W582716883 @default.
- W2785651640 hasRelatedWork W2740536177 @default.
- W2785651640 hasVolume "8" @default.
- W2785651640 isParatext "false" @default.
- W2785651640 isRetracted "false" @default.
- W2785651640 magId "2785651640" @default.
- W2785651640 workType "article" @default.