Does scws scws get words function exist for bugs?

59 5

Scws is a very good thesaurus for chinese, its php extensions can be easily handled in chinese word segmentation. It's now possible to find a problem with a function scws_get_words function that's used to get the result of the word segmentation, and the second argument can specify the result that you need to return, which is its c api document description ( php is similar ).

A · scws_top_t scws_get_words ( scws_t s, char * xattr );
Description: returns the keyword table for the specified part of speech, and the system will be inserted according to the sequence of words that appear. Parameter xattr is used to describe to exclude
Or participate in statistical lexical parts, separated by commas between parts. Indicates that a statistical result doesn't contain these parts of the ~ at the beginning of the
Otherwise, the expression must contain, incoming null indicates all parts of the speech.
Return value: retur & the header pointer for the list of list lists, which must call scws_free_tops release
Error: none

In other words, I just need to add a comma separated parameter in the second argument, such as I add '~Ag,~a,~ad,~b,~c,~Dg,~d,~e' characters, indicating that I'm filtering the results.

But the actual result is that no matter how many filtering conditions you add, but instead, if you add only one filter condition, such as '~a', it can filter the So I think there's a bug in this. Here's the c implementation code for this function, and let's see

//get words by attr (rand order)
scws_top_t scws_get_words(scws_t s, char *xattr)
{
 int off, cnt, xmode = SCWS_NA;
 xtree_t xt; 
 scws_res_t res, cur;
 scws_top_t top, tail, base;
 char *word;
 word_attr *at = NULL;
 if (!s ||!s->txt ||!(xt = xtree_new(0,1)))
 return NULL;
 __PARSE_XATTR__;
//save the offset.
 off = s->off;
 s->off = 0;
 base = tail = NULL;
 while ((cur = res = scws_get_result(s))!= NULL)
 {
 do
 {
/* check attribute filter */
 if (at!= NULL)
 {
 if ((xmode == SCWS_NA) &&!_attr_belong(cur->attr, at))
 continue;
 if ((xmode == SCWS_YEA) && _attr_belong(cur->attr, at))
 continue;
 }
/* put to the stats */
 if (!(top = xtree_nget(xt, s->txt + cur->off, cur->len, NULL)))
 {
 top = (scws_top_t) malloc(sizeof(struct scws_topword));
 top->weight = cur->idf;
 top->times = 1;
 top->next = NULL;
 top->word = (char *)_mem_ndup(s->txt + cur->off, cur->len);
 strncpy(top->attr, cur->attr, 2);
//add to the chain
 if (tail == NULL)
 base = tail = top;
 else
 {
 tail->next = top;
 tail = top;
 }
 xtree_nput(xt, top, sizeof(struct scws_topword), s->txt + cur->off, cur->len);
 }
 else
 {
 top->weight += cur->idf;
 top->times++;
 }
 }
 while ((cur = cur->next)!= NULL);
 scws_free_result(res);
 }
//free at & xtree
 if (at!= NULL)
 free(at);
 xtree_free(xt);
//restore the offset
 s->off = off;
 return base;
}

I found some problems with its __PARSE_XATTR__ macros, and here's a definition of word_attr.

/* macro to parse xattr -> xmode, at */
#define __PARSE_XATTR__ do { 
 if (xattr == NULL) break; 
 if (*xattr == '~') { xattr++; xmode = SCWS_YEA; } 
 if (*xattr == '') break; 
 cnt = ((strlen(xattr)/2) + 2) * sizeof(word_attr); 
 at = (word_attr *) malloc(cnt); 
 memset(at, 0, cnt); 
 cnt = 0; 
 for (cnt = 0; (word = strchr(xattr, ',')); cnt++) { 
 strncpy(at[cnt], xattr, 2); 
 xattr = word + 1; 
 } 
 strncpy(at[cnt], xattr, 2); 
} while (0)
typedef char word_attr[4];

In this case, the xattr can only handle the case of part 2 characters because it's strncpy(at[cnt], xattr, 2);. It's too harsh, and the part of the part of the part of the part of speech is that it copies the comma and copies it.

I tried all of them with 2 characters of speech filtering, and sure. And consider how to change it.

1 Answers

127 1

By communicating with the author, hightman gives patch and modifies the definition.

diff -c -r1.28 -r1.29
*** libscws/scws.c 5 Aug 2011 04:39:33 -0000 1.28
--- libscws/scws.c 26 Oct 2011 08:41:44 -0000 1.29
***************
*** 1278,1284 ****
 memset(at, 0, cnt); 
 cnt = 0; 
 for (cnt = 0; (word = strchr(xattr, ',')); cnt++) { 
! strncpy(at[cnt], xattr, 2); 
 xattr = word + 1; 
 } 
 strncpy(at[cnt], xattr, 2); 
--- 1278,1285 ----
 memset(at, 0, cnt); 
 cnt = 0; 
 for (cnt = 0; (word = strchr(xattr, ',')); cnt++) { 
! at[cnt][0] = *xattr++; 
! at[cnt][1] = xattr == word? '' : *xattr; 
 xattr = word + 1; 
 } 
 strncpy(at[cnt], xattr, 2); 
51 5

By communicating with the author, hightman gives patch and modifies the definition.

diff -c -r1.28 -r1.29
*** libscws/scws.c 5 Aug 2011 04:39:33 -0000 1.28
--- libscws/scws.c 26 Oct 2011 08:41:44 -0000 1.29
***************
*** 1278,1284 ****
 memset(at, 0, cnt); 
 cnt = 0; 
 for (cnt = 0; (word = strchr(xattr, ',')); cnt++) { 
! strncpy(at[cnt], xattr, 2); 
 xattr = word + 1; 
 } 
 strncpy(at[cnt], xattr, 2); 
--- 1278,1285 ----
 memset(at, 0, cnt); 
 cnt = 0; 
 for (cnt = 0; (word = strchr(xattr, ',')); cnt++) { 
! at[cnt][0] = *xattr++; 
! at[cnt][1] = xattr == word? '' : *xattr; 
 xattr = word + 1; 
 } 
 strncpy(at[cnt], xattr, 2); 
...