Code-based Plagiarism Detection Techniques

  • Biraj Upadhyaya and Dr. Samarjeet Borah

 

Abstract- The artful of programming assignments by acceptance distinctively at the undergraduate as able-bodied as postgraduate akin is a accepted practice. Efficient mechanisms for audition plagiarised cipher is accordingly needed. Argument based appropriation apprehension techniques do not appointment able-bodied with antecedent codes. In this cardboard we are activity to analyse a code- based appropriation apprehension address which is active by assorted appropriation apprehension accoutrement like JPlag, MOSS, CodeMatch etc.

  1. Introduction

The chat Appropriation is acquired from the Latin chat plagiarie which agency to kidnap or to abduct. In academicia or industry appropriation refers to the act of artful abstracts after absolutely acknowledging the aboriginal source[1]. Appropriation is advised as an ethical answerability which may acquire austere antidotal accomplishments such as aciculate abridgement in marks and alike banishment from the university in astringent cases. Apprentice appropriation primarily avalanche into two categories: text-based appropriation and code-based plagiarism. Instances of argument based appropriation includes chat to chat copy, paraphrasing, appropriation of accessory sources, appropriation of ideas, appropriation of accessory sources, appropriation of ideas, edgeless appropriation or antecedent appropriation etc. Appropriation is advised cipher based back a apprentice copies or modifies a affairs adapted to be submitted for a programming assignment. Cipher based appropriation includes accurately copying, alteration comments, alteration white amplitude and formatting, renaming identifiers, reordering cipher blocks, alteration the adjustment of operators/ operands in expression, alteration abstracts types, abacus bombastic account or variables, replacing ascendancy structures with agnate structures etc[2].

  1. Background

Text based appropriation apprehension techniques do not appointment able-bodied with a coded ascribe or a program. Experiments accept adapted that argument based systems avoid coding syntax, an basal allotment of any programming assemble appropriately assuming a austere drawback. To affected this botheration code-based appropriation apprehension techniques were developed. Code-based appropriation apprehension techniques can be classified into two categories viz. Attributed aggressive appropriation apprehension and Anatomy aggressive appropriation detection.

Attribute aggressive appropriation apprehension systems admeasurement backdrop of appointment submissions[3]. The afterward attributes are considered:

  • Number of altered operators
  • Number of altered operands
  • Total cardinal of occurrences of operators
  • Total cardinal of occurrences of operands

Based on the aloft attributes, the bulk of affinity of two programs can be considered.

Structure aggressive appropriation apprehension systems advisedly avoid calmly adjustable programming elements such as comments, added white spaces and capricious names. This makes this arrangement beneath affected to accession of bombastic advice as compared to aspect aggressive appropriation apprehension systems. A apprentice who is acquainted of this affectionate of appropriation apprehension arrangement actuality deployed at his academy would rather complete the appointment by himself/herself instead of alive on a annoying and time arresting modification task.

  1. Scalable Appropriation Detection

Steven Burrows in his cardboard ” Efficient and Effective Appropriation Apprehension for Ample Cipher Repositories”[3] provided an algorithm for cipher -based appropriation detection. The algorithm comprises of the afterward steps:

  1. Tokenization

Figure: 1.0

Let us accede a simple C program:

#include

int main( ) {

int var;

for (var=0; var<5; var++)

{

printf(“%dn”, var);

}

acknowledgment 0;

}

Programming Construct

Token

int

main

for

return

(

)

{

}

=

<

+

,

ALPHANUM

STRING

S

N

R

g

A

B

j

l

K

J

D

E

N

5

Table 1.0: Badge account for affairs in Figure 1.0.

Here ALPHANAME refers to any action name, capricious name or capricious value. STRING refers to bifold amid character(s).

The agnate badge beck for the affairs in Figure 1.0 is accustomed as

SNABjSNRANKNNJNNDDBjNA5ENBlgNl

Now the aloft badge is adapted to N-gram representation. In our case the bulk of N is called as 4. The agnate tokenization of the aloft badge beck is apparent below:

SNAB NABj ABjS BjSN jSNR SNRA NRAN RANK ANKN NKNN KNNJ NNJN NJNN JNND NNDD NDDB DDBj DBjN BjNA jNA5 NA5E A5EN 5ENB ENBl NBlg BlgN lgNl

These 4-grams are generated appliance the sliding window technique. The sliding window address generates N-grams by affective a “window” of admeasurement N beyond all genitalia of the cord from larboard to adapted of the badge stream.

The use of N-grams is an adapted adjustment of assuming structural appropriation apprehension because any change to the antecedent cipher will alone affect a few neighbouring N-grams. The adapted adaptation of the affairs will accept a ample allotment of banausic N-grams, appropriately it will be accessible to ascertain appropriation in this affairs .

  1. Index Construction

The additional footfall is to actualize an astern basis of these N-grams . An astern basis consists of a dictionary and an astern list. It is apparent below:

Lexicon

Inverted List

Apple

1: 25,3

Orange

1: 26,2

Banana

1: 22,5

Mango

3: 31,1 33,3 15,2

Grapes

2: 24,6 26,1

Table 2.0: Astern Index

Referring to aloft astern basis for mango, we can achieve that mango occurs in three abstracts in the collection. It occurs already in certificate no. 31, thrice in certificate no. 33 and alert in certificate no. 15. Similarly we can represent our 4-gram representation of Figure 1.0 with the advice of an astern index. The astern basis for any bristles 4-grams is apparent beneath in Table 3.0.

Lexicon

Inverted List

5ENB

2: 1,1 2,2

A5EN

2: 1,1 2,2

ABjS

2: 1,1 2,1

ANKN

2: 1,1 2,1

BgNl

1: 2,1

………

………

Table 3.0: Astern Index

  1. Querying

The abutting footfall is to concern the index. It is barefaced that anniversary concern is an N-gram representation of a program. For a badge beck of t tokens, we crave (t − n + 1) N-grams area n is the breadth of the N-gram . Anniversary concern allotment the ten best agnate programs analogous the concern affairs and these are organised from best agnate to atomic similar. If the concern affairs is one of the indexed programs, we would apprehend this aftereffect to aftermath the accomplished score. We accredit a affinity account of 100% to the exact or top match[3]. All alternative programs are accustomed a affinity account about to the top account .

Burrows agreement compared adjoin an basis of 296 programs apparent in Table 4.0 presents the top ten after-effects of one N-gram affairs book (0020.c). In this example, it is apparent that the book denticulate adjoin itself generates the accomplished about account of 100.00%. This account is ignored, but it is acclimated to accomplish a about affinity account for all alternative results. We can additionally see that the affairs 0103.c is actual agnate to affairs 0020.c with a account of 93.34% .

Rank Concern Basis Raw Similarity

File Book Account Score

1

0020.c

0020.c

369.45

100%

2

0020.c

0103.c

344.85

93.34%

3

0020.c

0092.c

189.38

51.26%

4

0020.c

0151.c

185.05

50.09%

5

0020.c

0267.c

167.82

45.43%

6

0020.c

0150.c

164.67

44.57%

7

0020.c

0137.c

158.67

42.93%

8

0020.c

0139.c

154.31

41.76%

9

0020.c

0269.c

129.17

34.96%

10

0020.c

0241.c

126.87

34.33%

Table 4.0: After-effects of the affairs 0020.c compared to an basis of 296 programs.

  1. Comparison of assorted Appropriation Apprehension Tools
  1. 4.1 JPlag:

The arresting appearance of this apparatus are presented below:

  • JPlag was developed in 1996 by Guido Malpohl
  • It currently supports C, C++, C#, Java, Scheme and accustomed accent text
  • It is a chargeless appropriation apprehension tool
  • It is use to ascertain software appropriation amid assorted set of antecedent cipher files.
  • JPlag uses Greedy Cord Tiling algorithm which produces matches ranked by boilerplate and best similarity.
  • It is acclimated to assay programs which accept a ample aberration in admeasurement which is apparently the aftereffect of inserting a asleep cipher into the affairs to beard the origin.
  • Obtained after-effects are displayed as a set of HTML pages in a anatomy of a histogram which presents the statistics for analyzed files
  1. CodeMatch

The arresting appearance of this apparatus are presented below:

  • It was developed by in 2003 by Bob Zeidman and beneath the licence of SAFE Corporation
  • This affairs is accessible as a standalone application.
  • It supports 26 altered programming languages including C, C++, C#, Delphi, Flash ActionScript, Java, JavaScript, SQL etc
  • It has a chargeless adaptation which allows alone one balloon allegory area the absolute of all files actuality advised doesn’t beat the bulk of 1 megabyte of data
  • It is mostly acclimated as argumentative software in absorb contravention cases
  • It determines the best awful activated files placed in assorted directories and subdirectories by comparing their antecedent cipher .
  • Four types of analogous algorithms are used: Account Matching, Comment Matching, Instruction Sequence Analogous and Identifier Analogous .
  • The after-effects appear in a anatomy of HTML basal address that lists the best awful activated pairs of files.
  1. MOSS

The arresting appearance of this appropriation apprehension apparatus are as follows:

  • The abounding anatomy of MOSS is Admeasurement of Software Similarity
  • It was developed by Alex Aiken in 1994
  • It is provided as a chargeless Internet account hosted by Stanford University and it can be acclimated alone if a user creates an account
  • The affairs can assay antecedent cipher accounting in 26 programming languages including C, C++, Java, C#, Python, Pascal, Visual Basic, Perl etc.
  • Files are submitted through the command band and the processing is performed on the Internet server
  • The accepted anatomy of a affairs is accessible alone for the UNIX platforms
  • MOSS uses Winnowing algorithm based on code-sequence analogous and it analyses the syntax or the anatomy of the empiric files
  • MOSS maintains a database that food an centralized representation of programs and again looks for similarities amid them
  1. Comparative Analysis Table
 

JPlag

MOSS

CodeMatch

Birth

1996

1994

2003

Inventor

Guido Malpohl

Alex Aiken

Bob Zeidman

Availability

Free

Free

Free(till 1 MB use)

Algorithm

used

Greedy Cord Tiling

Winnowing Algorithm

Statement/ Comment/ Instruction/ Identifier matching

Languages supported

C, C++, C#, Java, Schema and Accustomed Text

26 languages

26 languages

Results displayed

HTML Histogram

HTML basal report

HTML brace cipher matching

  1. Conclusion

In this cardboard we learnt a structured code-based appropriation address accepted as Scalable Appropriation Detection. Assorted processes like tokenization, indexing and query-indexing were additionally studied. We additionally advised assorted arresting appearance of assorted code-based appropriation apprehension accoutrement like JPlag, CodeMatch and MOSS.

References

  1. Gerry McAllister, Karen Fraser, Anne Morris, Stephen Hagen, Hazel White http://www.ics.heacademy.ac.uk/resources/assessment/plagiarism/
  2. Georgina Cosma , “An Approach to Source-Code Appropriation Apprehension and Investigation Appliance Latent Semantic Analysis ”, University of Warwick, Department of Computer Science, July 2008
  3. Steven Burrows, “Efficient and Effective Appropriation Apprehension for Ample Cipher Repositories”, School of Computer Science and Advice Technology , Melbourne, Australia, October 2004
  4. Vedran Juric, Tereza Juric and Marija Tkalec ,”Performance Evaluation of Appropriation Apprehension Adjustment Based on the Intermediate Accent ”, University of Zagreb

Order a unique copy of this paper

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
Top Academic Writers Ready to Help
with Your Research Proposal
Live Chat+1(978) 822-0999EmailWhatsApp

Order your essay today and save 20% with the discount code COURSEGUY