/* concordiff.c - Compare data files by the concordance method This work is licensed under a Creative Commons Attribution 2.5 License. Details at: http://creativecommons.org/licenses/by/2.5/ A summary of this and other HPC projects is in high-performance.rhtf DESCRIPTION This tool compares one textfile ("file A") against one more blocks of text in a second file (each of which takes its turn as "file B") and performs a "concordance matching" calculation between the former and each of the latter. When comparing two entities (the text of "file A" and the block of text "file B"), all keywords of a given size are recognized and indexed in both files, and a "concordance score" is compiled. The score is based on the frequency of keywords that appear in file A and also appear in file B, with appropriate weighting to remove bias towards or against larger files (which would otherwise score higher simply by having more keywords) The full definition of the matching algorithm and some examples are given in the webpage: mrob.com/pub/math/Sloandora COMMAND SYNTAX and EXAMPLES concordiff [-n#] [-v] [-h] [-d ] file1 file2 -n specifies the number of characters in a token ("keyword"). If omitted the token size is 5 bytes. The token size is also the expected length of the divider (if any) given by -d. -d followed by a string (which must be of the length specified by -n), indicates that file2 is to be treated as a sequence of blocks of text, each of which is considered to be its own "file2" for the purposes of the algorithm. file1 will be compared to each of these "file2"s separately giving a concordance score for each. -h (or -help, --help or --h) prints command arguments and options and exits. -v (verbose) adds performance statistics to the end of the output. REVISION HISTORY 20091024 First version 20091025 census() finished, seems to return reasonable results 20091119 -v displays CPU time used 20100118 Add "Read from stream2" statistic. TO DO Modify my_alloc/bt_insert/etc to support deallocating an entire tree all at once. (needed for the next item) Performance can be optimized somewhat by pre-computing the token counts for file2 (or each constituent block within file2, as appropriate) by generating a content-addressable index and storing that index into a file. This could boost performance in cases where a token occurs multiple times -- rather than scanning the token multiple times and incrementing its count by 1 at a time, the total count can be retrieved instantly from the index. Consider this ratio (the average length of an article divided by the average number of distinct tokens in that article) and call it the "token redundancy". Suppose the number of distinct tokens in file A is P, the number of records in the database is N and the average number of distinct tokens per record in the entire database is Q, and the average token redundancy is K. Then the current algorithm is performing log(P)*N*Q*K operations, whereas a reverse-index would permit computing exactly the same results in P*N*log(Q) operations. We get a new worst-case domain (namely, when the current article diversity P is abnormally large), we trade off one log article size for another (log(P) becomes P and Q becomes log(Q)) but most importantly a factor of K goes away. The linear factor of N remains, and results from the fact that we effectively have to re-rank every article whenever the user rates an article. An additional (small) speedup might be possible by scanning both files' keywords in the same order: in other words, instead of using bt_next to traverse the file A keywords and a "bt_locate"-type function to find the linked list of citations in file B, traverse both trees with bt_next simultaneously. In order for this performance improvement to really provide benefit, there should be a way to keep the content-addressable index in memory for use in subsequent perations (e.g. for Sloandora, when comparing the next "file1" to the same set of articles represented by "file2"). This implies an interface driven by stdio blocking (using pipes), or with a sigaction(2) handler and semaphores, or even perhaps adding a higher-level user interface or command-line shell to implement the full functionality of the #Sloandora# script (in which case the concordiff functionality becomes just a thread subroutine). At the minimum there would be three new bits of functionality: create index, save index to a file, load index from a file (and use it in lieu of scanning through a "file2"). The first two (create and save) naturally go together and would be a new command. The other (load index) would be an alternative to the current normal operating mode, with the "stay resident and take signals" being a further option added later. There should be a way to indicate that an article is "relevant but boring". This is to address the common problem of lots of articles that are virtually identical. This happens often when searching for news, and in the OEIS for the "continued fraction expansion for sqrt(K)" sequences, etc. In Pandora, it would happen when a listener gets tired of hearing a particular artist, or when there are multiple versions of the same album (all with the same set of songs) in the database. To implement that, I'll need to explore different metrics. The basic idea is for overlap to always count towards the concordance score, but to count more or less towards the "novelty score" depending on whether or not the larger-scale structure exhibits concordance. Thus, two articles that use all the same words and in the same order, would have a very low "novelty score" with respect to each other because they are probably the same article. If the words appear in a different order, they are probably different articles talking about the same thing. Implementing such a metric would probably involve making a scatter-plot of the position of each keyword in file A vs. its position in file B. A keyword that occurs X times in A and Y times in B would produce X*Y points on the scatter-plot; these could be given weights to level the playing field for less common keywords; alternatively each occurrence in file A could be matched with the nearest available occurrence in file B yielding only X points on the scatter-plot (with X #include #include #include #include /* Static definitions */ #define MAX_TOK_SIZE 8 /* Data types */ typedef signed char s8; typedef unsigned char u8; typedef short s16; typedef unsigned short u16; typedef long s32; typedef unsigned long u32; typedef long long int s64; typedef unsigned long long int u64; /* Simple binary tree node format, used for the normal (full scan) operating mode. */ typedef struct bt_node { struct bt_node *left; /* left child tree or 0 if none */ struct bt_node *up; /* parent node, or 0 if we're the head */ struct bt_node *right; /* right child tree or 0 if none */ u32 population; /* 1 plus number of nodes below this node */ u32 count_f1; /* Number of times this token was seen in file 1 */ u32 count_f2; /* Number of times this token was seen in file 2 */ char tok[MAX_TOK_SIZE]; /* The token */ } bt_node; /* Node format for content-addressable index. This index format allows us to perform lots of searches against a (assumed static) target dataset without rescanning and retokenizing the dataset. (On the downside, this index will typically take up more memory than the original dataset and so it should be retained in memory across multiple queries to provide any real benefit.) */ typedef struct citeref { u32 file_id; /* Identifies a file (or segment of "file2") that this token was seen in */ u32 count_tok; /* Number of times this token was seen in that file/segment */ } citeref; typedef struct ix_node { struct bt_node *left; /* left child tree or 0 if none */ struct bt_node *up; /* parent node, or 0 if we're the head */ struct bt_node *right; /* right child tree or 0 if none */ u32 population; /* 1 plus number of nodes below this node */ char tok[MAX_TOK_SIZE]; /* The token */ u32 count_f1; /* Number of times this token was seen in file 1 */ u32 numhits; /* Number of files/segments in which this token appears */ citeref hits[1]; /* (placeholder for) first citation (array size varies; more follow if numhits>1) */ } ix_node; /* Function prototypes */ char * my_alloc(u32 size); void bt_insert(int f1, char * token, bt_node * * result); bt_node * bt_first(bt_node * it); bt_node * bt_next(bt_node *it); float census(void); void clear_scores(void); void help(void); void do_scan(void); int get_input(void); int main(int argc, char** argv); void help(void) { printf("%s", "Parameter description:\n" " -n3 Select 3-character token size.\n" " file1 First file containing data to compare\n" " file2 Second file containing data to compare\n" " -d --- Set divider for multi-record matching in file2.\n" " -v Verbose (displays statistics)\n"); } char * g_f1name; char * g_f2name; char * g_divider; int g_verbose; s32 g_toksize; u32 g_count_B; float g_result; #define CHUNK_SIZE 256 /* Memory allocation */ #define ALLOC_SIZE 65536L /* %%% should be at least 8x the VM page size */ char *freepool; /* pointer to current alloc block */ u32 freesize; /* and bytes left in current alloc block */ u64 mem_total; /* Total amount of memory used, in bytes */ u64 g_mem_limit; s32 g_total_nodes; /* All allocation is done in fixed-size blocks, and nothing is ever * deallocated, so it's very easy to manage. This is your typical * cached blocked allocate funtion */ char * my_alloc(u32 size) { s16 do_alloc; char * rv; /* assume the worst */ rv = 0; if (mem_total > g_mem_limit) { printf("concordiff: g_mem_limit exceeded (%d nodes; %lld bytes used).\n", ((int)g_total_nodes), mem_total); exit(1); } /* first, see if there's a block with enough space */ do_alloc = 0; if (freepool) { if (freesize >= size) { /* we're okay */ } else { do_alloc = 1; } } else { do_alloc = 1; } /* if we need to allocate, do so */ if (do_alloc) { freesize = 0; freepool = (char *) malloc((size_t) ALLOC_SIZE); /* did we get any? */ if (freepool) { freesize = ALLOC_SIZE; mem_total += ALLOC_SIZE; } } /* if we have a block now, we can allocate from it */ if (freepool) { rv = freepool; freepool += size; /* %%% how do I find out the alignment requirement for pointers to structs? */ freesize -= size; } return(rv); } bt_node * the_tree; s32 g_rebal_count; s32 g_tree_height; /* 3 .' `. 1 5 / \ / \ 0 2 4 6 bt.insert adds a node to the tree. It copies the data for the new node from the fields rhs, bitmap and valence. */ void bt_insert(int f1, char * token, bt_node * * result) { bt_node * it; s16 going, insert, fillin, report, i; s32 how_deep; /* How many nodes we have descended */ int anc, rebal; /* Type of ancestry */ bt_node * parent; bt_node * gr_par; bt_node * gg_par; int cr; how_deep = 0; anc = 0; parent = gr_par = gg_par = 0; rebal = 1; if (result) { *result = 0; } fillin = 0; going = 0; insert = 0; it = the_tree; /* If there's a tree to descend, descend it. */ if (it) { going = 1; insert = 1; } else { /* insert and copy first node */ the_tree = (bt_node *) my_alloc(sizeof(bt_node)); fillin = 1; if (the_tree == 0) { printf("bt_insert: unable to allocate root node\n"); exit(1); } how_deep++; anc = 0; it = the_tree; it->up = 0; going = 0; /* no descending to do */ insert = 0; /* and we just inserted */ } while(going) { cr = strncmp(token, it->tok, g_toksize); if (cr == 0) { /* Exact match. */ going = 0; insert = 0; if (f1) { it->count_f1++; } else { it->count_f2++; } } else { /* no match yet: descend. */ if (cr < 0) { /* go to left child */ if (it->left) { gg_par = gr_par; gr_par = parent; parent = it; it = it->left; anc = (anc<<1) & 7; } else { /* no left: that means we insert here. */ insert = -1; going = 0; } } else { /* go to right child */ if (it->right) { gg_par = gr_par; gr_par = parent; parent = it; it = it->right; anc = ((anc<<1)+1) & 7; } else { /* no right: that means we insert here. */ insert = 1; going = 0; } } how_deep++; /* check for unbalance */ if (going && rebal && gr_par) { if ( (it->population + 1)*2 > gr_par->population ) { /* Here are the four cases (unbalanced state on top, rebalanced on bottom): anc=0 anc=1 anc=2 anc=3 | | | | G G G G .' \ .' \ / `. / `. P d P d a P a P / \ / \ / \ / \ it c a it it d b it / \ / \ / \ / \ a b b c b c c d | | | | P it it P .' `. .' `. .' `. .' `. it G P G G P G it / \ / \ / \ / \ / \ / \ / \ / \ a b c d a b c d a b c d a b c d */ bt_node * la; bt_node * lb; bt_node * lc; bt_node * ld; bt_node * head; /* printf("Rebal %d (depth %d, us %d, grp %d)\n", anc, how_deep, it->population, gr_par->population); */ switch(anc & 3) { case 0: default: la=it->left; lb=it->right; lc=parent->right; ld=gr_par->right; /* children and parent of "it" stay the same */ /* c becomes gr_par's left child */ gr_par->left=lc; if (lc) { lc->up=gr_par; } gr_par->population = 1 + (lc ? lc->population : 0) + (ld ? ld->population : 0); /* it and gr_par become children of parent */ parent->left=it; it->up=parent; parent->right=gr_par; gr_par->up=parent; parent->population = 1 + it->population + gr_par->population; /* parent is new head */ head=parent; break; case 1: la=parent->left; lb=it->left; lc=it->right; ld=gr_par->right; /* a and b go to parent */ parent->left=la; if (la) { la->up=parent; } parent->right=lb; if (lb) { lb->up=parent; } parent->population = 1 + (la ? la->population : 0) + (lb ? lb->population : 0); /* c and d go to gr_par */ gr_par->left=lc; if (lc) { lc->up=gr_par; } gr_par->right=ld; if (ld) { ld->up=gr_par; } gr_par->population = 1 + (lc ? lc->population : 0) + (ld ? ld->population : 0); /* parent and gr_par go to "it" */ it->left=parent; parent->up=it; it->right=gr_par; gr_par->up=it; it->population = 1 + parent->population + gr_par->population; /* "it" is new head */ head=it; break; case 2: la=gr_par->left; lb=it->left; lc=it->right; ld=parent->right; /* a and b go to gr_par */ gr_par->left=la; if (la) { la->up=gr_par; } gr_par->right=lb; if (lb) { lb->up=gr_par; } gr_par->population = 1 + (la ? la->population : 0) + (lb ? lb->population : 0); /* c and d go to parent */ parent->left=lc; if (lc) { lc->up=parent; } parent->right=ld; if (ld) { ld->up=parent; } parent->population = 1 + (lc ? lc->population : 0) + (ld ? ld->population : 0); /* gr_par and parent go to "it" */ it->left=gr_par; gr_par->up=it; it->right=parent; parent->up=it; it->population = 1 + gr_par->population + parent->population; /* "it" is new head */ head=it; break; case 3: la=gr_par->left; lb=parent->left; lc=it->left; ld=it->right; /* children and parent of "it" stay the same */ /* b becomes gr_par's right child */ gr_par->right=lb; if (lb) { lb->up=gr_par; } gr_par->population = 1 + (la ? la->population : 0) + (lb ? lb->population : 0); /* gr_par and it become children of parent */ parent->left=gr_par; gr_par->up=parent; parent->right=it; it->up=parent; parent->population = 1 + gr_par->population + it->population; /* parent is new head */ head=parent; break; } if (gr_par == the_tree) { head->up = 0; the_tree = head; } else if (gg_par) { head->up=gg_par; if (anc < 4) { gg_par->left = head; } else { gg_par->right = head; } } else { printf("bt_insert: gg_par nil and head != the_tree\n"); exit(1); } rebal = 0; g_rebal_count++; } } } } if (f1 == 0) { insert = 0; /* We don't insert f2 tokens */ g_count_B++; /* But still count this token as being uniquely B */ } if (insert) { bt_node * new_node; new_node = (bt_node *) my_alloc(sizeof(bt_node)); fillin = 1; if (new_node == 0) { printf("bt_insert: unable to allocate new node\n"); exit(1); } if (insert > 0) { /* Insert it as right child */ it->right = new_node; } else { it->left = new_node; } how_deep++; new_node->up = it; while(it) { it->population += 1; it = it->up; } it = new_node; if (result) { *result = it; } } if (fillin) { /* Keep stats */ /* copy the supplied data into the new node */ /* First the tree management data */ /* it->up is already filled in */ it->right = it->left = 0; it->population = 1; /* Then the application-specific data */ strncpy(it->tok, token, g_toksize); it->count_f1 = 1; /* Every node we insert is an f1 token */ it->count_f2 = 0; g_total_nodes++; } if (how_deep > g_tree_height) { g_tree_height = how_deep; } } /* bt_first returns the first item in the tree. This is used with bt_next for in-order traversals. */ bt_node * bt_first(bt_node * it) { while (it && (it->left)) { it = it->left; } return it; } /* bt_next traverses the list to the next-greater item in tree order. It returns a pointer to the next node, or 0 if we were already at the last node. */ bt_node * bt_next(bt_node *it) { bt_node * old; if (it) { /* if it has a right child it's relatively easy */ if (it->right) { /* go down right, then down left to dead end */ it = it->right; while (it->left) { it = it->left; } } else { /* here we have to traverse up until we get to a node from which we were the left. We also need to worry about going all the way off the top, which would mean we're done. */ old = 0; while(it && (old == it->right)) { old = it; it = it->up; } } } return it; } float census(void) { bt_node * it; u32 c1, c2; float AcoverB, BcoverA; u32 count_A, count_AB, count_B; count_A = count_AB = count_B = 0; it = bt_first(the_tree); while(it) { c1 = it->count_f1; c2 = it->count_f2; if (c1 == c2) { /* Same number of occurrances are found in A and in B */ count_AB += c1; } else if (c1 < c2) { /* There is some text in B not covered by A */ count_AB += c1; count_B += (c2 - c1); } else { /* There is some text in A not covered by B */ count_AB += c2; count_A += (c1 - c2); } it = bt_next(it); } count_B += g_count_B; /* Ensure denominator is nonzero */ count_AB++; AcoverB = ((float) count_AB) / ((float) (count_AB + count_B)); BcoverA = ((float) count_AB) / ((float) (count_AB + count_A)); if (AcoverB > BcoverA) { return(BcoverA); } return(AcoverB); } /* Clears all f2 counts */ void clear_scores(void) { bt_node * it; it = bt_first(the_tree); while(it) { it->count_f2 = 0; it = bt_next(it); } g_count_B = 0; } s64 g_f2_bytes_read; void do_scan(void) { FILE * in1; FILE * in2; s64 f1len, f2len; char dbuf[MAX_TOK_SIZE+CHUNK_SIZE]; char * db2; s64 bytes_read; s32 req, got, i; int flag_label, ign_label, saw_label; flag_label = ign_label = saw_label = 0; db2 = dbuf + (g_toksize - 1); g_result = 0; /* Default */ in1 = fopen(g_f1name, "r"); if (in1 == 0) { fprintf(stderr, "concordiff: Failed open on first input file.\n"); exit(-1); } fseek(in1, 0, SEEK_END); f1len = ftell(in1); fclose(in1); if (g_verbose > 1) { printf("File 1: %lld octets.\n", f1len); } if (f1len < g_toksize) { return; } in2 = fopen(g_f2name, "r"); if (in2 == 0) { fprintf(stderr, "concordiff: Failed open on second input file.\n"); exit(-1); } fseek(in2, 0, SEEK_END); f2len = ftell(in2); fclose(in2); if (g_verbose > 1) { printf("File 2: %lld octets.\n", f2len); } if (f2len < g_toksize) { return; } /* ================ FILE 1 ============================================= */ /* Read initial few characters necessary to get first token */ in1 = fopen(g_f1name, "r"); req = g_toksize - 1; got = fread((void *) dbuf, 1, req, in1); if (got < req) { fprintf(stderr, "concordiff: Error reading input 1 at byte %d\n", ((int)got)); exit(1); } bytes_read = got; while (bytes_read < f1len) { req = CHUNK_SIZE; if (bytes_read + ((s64) req) > f1len) { req = f1len - bytes_read; } got = fread((void *) db2, 1, req, in1); if (got < req) { fprintf(stderr, "concordiff: Error reading input 1 at byte %ld\n", (long) (bytes_read+got)); exit(1); } /* %%% Scan buffer and add nodes to binary tree */ for(i=0; i f2len) { req = f2len - bytes_read; } got = fread((void *) db2, 1, req, in2); if (got < req) { fprintf(stderr, "concordiff: Error reading input 2 at byte %ld\n", (long) (bytes_read+got)); exit(1); } /* %%% Scan buffer and compare to binary tree */ for(i=0; i 0) { printf("%s", "Concordance:"); } printf(" %f\n", g_result); } } /* Command-line argument parsing */ int g_argc; int g_argn; char * * g_argv; char * g_curarg; int get_input(void) { char *res; int ilen; if (g_argn < g_argc) { g_curarg = g_argv[g_argn]; g_argn++; ilen = strlen(g_curarg); return(ilen); } /* No arguments left. */ ilen = 0; return(ilen); } int main(int argc, char** argv) { int cmd, num; int going, ilen; int p1; struct rusage ru_before, ru_after; struct timeval tv_diff; g_argc = argc; g_argn = 1; g_argv = argv; /* Initialize memory allocation */ freepool = 0; mem_total = 0; g_mem_limit = 512LL * 1048576LL; /* 512 MiB */ g_total_nodes = 0; /* Initialize data structure */ the_tree = 0; g_rebal_count = 0; g_tree_height = 0; g_count_B = 0; /* Default argument values */ g_toksize = 5; g_f1name = 0; g_f2name = 0; g_verbose = 1; /* Normal amount of printing */ g_divider = 0; /* printf("concordiff: starting.\n"); */ /* Get our arguments */ going = 1; while(going) { ilen = get_input(); if (ilen == 0) { going = 0; } else if (strncmp(g_curarg, "-q", ilen)==0) { g_verbose = 0; } else if (strncmp(g_curarg, "-v", ilen)==0) { g_verbose = 2; } else if ( (strncmp(g_curarg, "-help", ilen)==0) || (strncmp(g_curarg, "-h", ilen)==0) || (strncmp(g_curarg, "--help", ilen)==0) || (strncmp(g_curarg, "--h", ilen)==0) ) { help(); return(0); } else if (num = sscanf(g_curarg, "-n%d", &p1), ((num == 1) && (p1 > 0))) { if (p1 > MAX_TOK_SIZE) { printf("concordiff: Tokensize cannot be greater than %d\n", MAX_TOK_SIZE); p1 = MAX_TOK_SIZE; } if (g_verbose > 1) { printf("Selected %d-character token size.\n", p1); } g_toksize = p1; } else if (strncmp(g_curarg, "-d", ilen)==0) { ilen = get_input(); if (ilen) { g_divider = g_curarg; if (g_verbose > 1) { printf("Using divider string '%s'.\n", g_divider); } } else { fprintf(stderr, "concordiff: -d requires a string argument.\n"); help(); exit(1); } } else if (g_f1name == 0) { g_f1name = g_curarg; if (g_verbose > 1) { printf("First filename: '%s'\n", g_f1name); } } else if (g_f2name == 0) { g_f2name = g_curarg; if (g_verbose > 1) { printf("Second filename: '%s'\n", g_f2name); } } else { printf("Unknown parameter '%s'\n", g_curarg); } } if (g_f2name == 0) { fprintf(stderr, "concordiff: Need two filename arguments.\n"); help(); return(1); } if (g_divider && (strlen(g_divider) != g_toksize)) { fprintf(stderr, "concordiff: Divider should be the same length as the token size.\n"); help(); return(1); } { getrusage(RUSAGE_SELF, &ru_before); do_scan(); getrusage(RUSAGE_SELF, &ru_after); tv_diff.tv_sec = ru_after.ru_utime.tv_sec - ru_before.ru_utime.tv_sec; tv_diff.tv_usec = ru_after.ru_utime.tv_usec - ru_before.ru_utime.tv_usec; if (tv_diff.tv_usec < 0) { tv_diff.tv_usec += 1000000; tv_diff.tv_sec --; } } if (g_verbose > 1) { printf("Statistics:\n"); printf("Used %lld bytes for %ld nodes\n", mem_total, ((long)g_total_nodes)); printf("Tree height %d, rebalanced %ld times.\n", (int)g_tree_height, (long)g_rebal_count); printf("Read from stream2: %lld octets\n", g_f2_bytes_read); printf("Time: %ld.%06d s\n", (long) tv_diff.tv_sec, (int) tv_diff.tv_usec); } return(0); }