Matches in SemOpenAlex for { <https://semopenalex.org/work/W2895839426> ?p ?o ?g. }
Showing items 1 to 57 of
57
with 100 items per page.
- W2895839426 abstract "Analyzing source code using computational linguistics and exploiting the linguistic properties of source code have recently become popular topics in the domain of software engineering. In the first part of the thesis, we study the predictability of source code and determine how well source code can be represented using language models developed for natural language processing. In the second part, we study how well English discussions of source code can be aligned with code elements to create parallel corpora for English-to-code statistical machine translation. This work is organized as a “manuscript” thesis whereby each core chapter constitutes a submitted paper.The first part replicates recent works that have concluded that software is more repetitive and predictable, i.e. more natural, than English texts. We find that much of the apparent “naturalness”is artificial and is the result of language specific tokens. For example, the syntax of a language, especially the separators e.g., semi-colons and brackets, make up for 59% of all uses of Java tokens in our corpus. Furthermore, 40% of all 2-grams end in a separator, implying that a model for autocompleting the next token, would have a trivial separator as top suggestion 40% of the time. By using the standard NLP practice of eliminating punctuation (e.g., separators) and stopwords (e.g., keywords) we find that code is less repetitive and predictable than was suggested by previous work. We replicate this result across 7 programming languages.Continuing this work, we find that unlike the code written for a particular project, API code usage is similar across projects. For example a file is opened and closed in the same manner irrespective of domain. When we restrict our n-grams to those contained in the Java API we find that the entropy for 2-grams is significantly lower than the English corpus. This repetition perhaps explains the successful literature on API usage suggestion and autocompletion.We then study the impact of the representation of code on repetition. The n-gram model assumes that the current token can be predicted by the sequence of n previous tokens. When we extract program graphs of size 2, 3, and 4 nodes we see that the abstract graph representation is much more concise and repetitive than the n-gram representations of the same code. This suggests that future work should focus on graphs that include control and data flow dependencies and not linear sequences of tokens.The second part of this thesis focuses cleaning English and code corpora to aid in machine translation. Generating source code API sequences from an English query using Machine Translation (MT) has gained much interest in recent years. For any kind of MT, the model needs to be trained on a parallel corpus. We clean StackOverflow, one of the most popular online discussion forums for programmers, to generate a parallel English-Code corpora. We contrast three data cleaning approaches: standard NLP, title only, and software task. We evaluate the quality of each corpus for MT. We measure the corpus size, percentage of unique tokens, and per-word maximum likelihoodalignment entropy. While many works have shown that code is repetitive and predictable, we find that English discussions of code are also repetitive. Creating a maximum likelihood MT model, we find that English words map to a small number of specific code elements which partially explains the success of using StackOverflow for search and other tasks in the software engineering literature and paves the way for MT. Our scripts and corpora are publicly available." @default.
- W2895839426 created "2018-10-26" @default.
- W2895839426 creator A5007723765 @default.
- W2895839426 date "2018-03-23" @default.
- W2895839426 modified "2023-09-22" @default.
- W2895839426 title "Analyzing the Predictability of Source Code and its Application in Creating Parallel Corpora for English-to-Code Statistical MachineTranslation" @default.
- W2895839426 hasPublicationYear "2018" @default.
- W2895839426 type Work @default.
- W2895839426 sameAs 2895839426 @default.
- W2895839426 citedByCount "0" @default.
- W2895839426 crossrefType "dissertation" @default.
- W2895839426 hasAuthorship W2895839426A5007723765 @default.
- W2895839426 hasConcept C154945302 @default.
- W2895839426 hasConcept C195324797 @default.
- W2895839426 hasConcept C199360897 @default.
- W2895839426 hasConcept C204321447 @default.
- W2895839426 hasConcept C41008148 @default.
- W2895839426 hasConcept C43126263 @default.
- W2895839426 hasConcept C540372491 @default.
- W2895839426 hasConcept C548217200 @default.
- W2895839426 hasConcept C60048249 @default.
- W2895839426 hasConceptScore W2895839426C154945302 @default.
- W2895839426 hasConceptScore W2895839426C195324797 @default.
- W2895839426 hasConceptScore W2895839426C199360897 @default.
- W2895839426 hasConceptScore W2895839426C204321447 @default.
- W2895839426 hasConceptScore W2895839426C41008148 @default.
- W2895839426 hasConceptScore W2895839426C43126263 @default.
- W2895839426 hasConceptScore W2895839426C540372491 @default.
- W2895839426 hasConceptScore W2895839426C548217200 @default.
- W2895839426 hasConceptScore W2895839426C60048249 @default.
- W2895839426 hasLocation W28958394261 @default.
- W2895839426 hasOpenAccess W2895839426 @default.
- W2895839426 hasPrimaryLocation W28958394261 @default.
- W2895839426 hasRelatedWork W1582459552 @default.
- W2895839426 hasRelatedWork W1783519389 @default.
- W2895839426 hasRelatedWork W1994573369 @default.
- W2895839426 hasRelatedWork W2176369193 @default.
- W2895839426 hasRelatedWork W2585840180 @default.
- W2895839426 hasRelatedWork W2604143972 @default.
- W2895839426 hasRelatedWork W2889775767 @default.
- W2895839426 hasRelatedWork W2915346886 @default.
- W2895839426 hasRelatedWork W2981321698 @default.
- W2895839426 hasRelatedWork W3017697027 @default.
- W2895839426 hasRelatedWork W3026852836 @default.
- W2895839426 hasRelatedWork W3046805850 @default.
- W2895839426 hasRelatedWork W3084946713 @default.
- W2895839426 hasRelatedWork W3090025768 @default.
- W2895839426 hasRelatedWork W3105398568 @default.
- W2895839426 hasRelatedWork W3108387573 @default.
- W2895839426 hasRelatedWork W3161997752 @default.
- W2895839426 hasRelatedWork W3176125721 @default.
- W2895839426 hasRelatedWork W3177649442 @default.
- W2895839426 hasRelatedWork W1499025187 @default.
- W2895839426 isParatext "false" @default.
- W2895839426 isRetracted "false" @default.
- W2895839426 magId "2895839426" @default.
- W2895839426 workType "dissertation" @default.