Skip to main content
Previous sectionNext section

Dominance and Proximity

The iKnow semantics package consists of two classes: %iKnow.Semantics.DominanceAPI and %iKnow.Semantics.ProximityAPI. These classes with their parameters and methods are described in the InterSystems Class Reference.

This chapter describes:

Semantic Dominance

Semantic dominance is the overall importance of an entity within a source. iKnow determines semantic dominance by performing the following tests and obtaining statistical results:

  • The number of times that an entity appears in the source (the frequency).

  • The number of times that each component word of an entity appears in the source.

  • The number of words in the entity.

  • The type of entity (concept or relation).

  • The diversity of entities in the source.

  • The diversity of component words in the source.

Note:

Japanese uses a different, language-specific set of statistics to calculate the semantic dominance of an entity. See iKnow Japanese.

iKnow generates these values when the sources are loaded as part of iKnow indexing. It combines these values to produce a dominance score for each entity. The higher the dominance score, the more dominant the entity is within the source.

For example, for the concept “cardiovascular surgery” to be semantically dominant in a document, a statistical analysis would be performed. It determines that this concept appears 20 times in the source. A dominant concept is not the same as a top concept. In this source the concepts “doctor” (60 times), “surgery” (50 times), “operating room” (40 times), and “surgical procedure” (30 times) are far more common. However, the component words of “cardiovascular surgery” appear over twice as many times as the concept itself: “cardiovascular” (50 times) and “surgery” (80 times), which lends support to this being a dominant concept. In contrast, the concept “operating room” appears 40 times, but its component words appear barely more often than the concept: “operating” 60 times and “room” only 45 times. This indicates that this source is much more concerned with cardiovascular matters and surgery than it is with rooms.

iKnow gives these frequency counts greater or lesser weight based on the number of words in the original entity and whether that entity is a concept or a relation. (There is a much smaller number of commonly-occurring relations than concepts.)

However, to determine how dominant a concept really is, iKnow has to compare it to the total number of concepts in the source. If 5% of the concepts in the source contain the words “cardiovascular” and “surgery”, and these words do not combine in other concepts nearly as frequently as they do together, we know that these words not only appear frequently in the source, but that the source does not have a wide range of subject matter. If, however, the source contains a nearly equal occurrence of the word “surgery” in concepts with “hand”, “kidney” and “brain” and the word “cardiovascular” appears nearly as often with the words “exercise” and “diet” it is apparent that the source contains a wide range of subject matter. The concept “cardiovascular surgery” and its component words may appear more frequently than others, but may not significantly dominate the subject matter of the source.

By performing these statistical calculations, iKnow can determine the dominant concepts in a source — the subjects that are of greatest interest to you. iKnow performs this analysis without using an external reference corpus (such as pre-existing table of the relative frequency of words in a “typical” medical text). iKnow determines dominance using only the contents of the actual source text, and thus can be used on sources on any topic without any prior knowledge of the subject matter.

Dominance in Context

iKnow calculates the dominance of entities (concepts and relations) within a source and assigns each an integer value. It assigns the most dominant concept(s) in a source a dominance value of 1000. You can use the Indexing Results tool to list the dominance values for concepts in a single source.

iKnow calculates the dominance of CRCs within a source. The algorithm uses the dominance values of the entities within the CRC. CRC dominance values are intended only for comparison with other CRC dominance values; CRC dominance values should not be compared with entity dominance values. You can use the Indexing Results tool to list the dominance values for CRCs in a single source.

iKnow calculates the dominance of concepts across all loaded sources as a weighted average. These concept dominance scores are fractional numbers, with the largest possible number being 1000. You can use the Knowledge Portal tool Dominant Concepts option to list the dominance scores for concepts in all loaded sources. You can also use the %iKnow.Semantics.DominanceAPI GetTop() method to list the dominance scores for concepts in all loaded sources.

Concepts of Semantic Dominance

The following are the key elements of semantic dominance:

  • Profile: the counts of elements that are used to calculate a dominance score.

  • Typical: a typical source is a source in which the dominant entities in that source are most similar to the dominant elements of the group of sources. This is the opposite of Breaking.

  • Breaking: a breaking source is a source in which the dominant entities in that source are least similar to the dominant elements of the group of sources. This is the opposite of Typical. For example, a breaking news story would likely be least similar to the dominant entities in all of the news stories from the previous month.

  • Overlap: the number of occurrences of an entity in different sources.

  • Correlation: a comparison of the entities in a source with a list of entities, returning a correlation percentage for each source.

Semantic Dominance Examples

This chapter describes and provides examples for the following semantic dominance queries:

Dominance Profile Counts

The following example uses the GetProfileCountByDomain() method to return the total count of unique values for an entity type in the specified domain. The available entity types are: 0=concept ($$$SDCONCEPT or $$$SDENTITY); 1=relation ($$$SDRELATION); 2=CRC ($$$SDCRC); 4=aggregate ($$$SDAGGREGATE, the default) which is the total of all concepts, relations, and CRCs.

#Include %IKPublic
  ZNSPACE "Samples"
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
ProfileCounts
  WRITE ##class(%iKnow.Semantics.DominanceAPI).GetProfileCountByDomain(domId,$$$SDCONCEPT)," total concepts",!
  WRITE ##class(%iKnow.Semantics.DominanceAPI).GetProfileCountByDomain(domId,$$$SDRELATION)," total relations",!
  WRITE ##class(%iKnow.Semantics.DominanceAPI).GetProfileCountByDomain(domId,$$$SDCRC)," total CRCs",!
  WRITE ##class(%iKnow.Semantics.DominanceAPI).GetProfileCountByDomain(domId)," aggregate total",!
Copy code to clipboard

Concepts with Top Dominance Scores in the Domain

The GetTop() method returns the top concepts (or relations) by dominance scores for all loaded sources.

#Include %IKPublic
  ZNSPACE "Samples"
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCount
  SET numSrcD=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
  WRITE "The domain contains ",numSrcD," sources",!!
DominantConcepts
  DO ##class(%iKnow.Semantics.DominanceAPI).GetTop(.profresult,domId,1,50)
  WRITE "Top Concepts in Domain by Dominance Score",!
  SET j=1
  WHILE $DATA(profresult(j),list) {
     WRITE $LISTGET(list,2)
     WRITE ": ",$LISTGET(list,3),!
     SET j=j+1 }
  WRITE !,"Printed ",j-1," dominant concepts"
Copy code to clipboard

CRC Dominance Scores in the Domain

The GetProfileByDomain() method returns the dominance scores for CRCs in a domain. For each entity returns the following:

  • The entity source Id, a unique integer.

  • The entity value (the external Id). Because this method can return not only single concepts, but CRCs made up of multiple concepts and relations, this value is returned as a list structure.

  • The entity type, an integer code: 0=concept ($$$SDCONCEPT or $$$SDENTITY, the default); 1=relation ($$$SDRELATION); 2=CRC ($$$SDCRC). The value 4=aggregate ($$$SDAGGREGATE) returns first all concepts, then all relations, then all CRCs.

    Note:

    Paths are not supported at this time

  • The calculated dominance value. Dominance values for concepts and relations are always integers. Dominance values for CRCs may be integers, or fractional numbers, such as 3480.5, that are half of an integer.

The following example uses the GetProfileByDomain() method to return the dominant profile entities, in this case CRCs. For each CRC it returns the value, the source Id, the type, and the dominance score. Because they are all CRCs, the type is always 2:

#Include %IKPublic
  ZNSPACE "Samples"
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceSentenceQueries
  SET numSrcD=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
  WRITE "The domain contains ",numSrcD," sources",!
  SET numSentD=##class(%iKnow.Queries.SentenceAPI).GetCountByDomain(domId)
  WRITE "These sources contain ",numSentD," sentences",!!
DominanceProfile
  DO ##class(%iKnow.Semantics.DominanceAPI).GetProfileByDomain(.profresult,domId,1,20,$$$SDCRC)
  SET j=1
  WHILE $DATA(profresult(j),list) {
      WRITE " ",$LISTTOSTRING($LISTGET(list,2))
      WRITE " [Id:",$LISTGET(list,1)
      WRITE " type:",$LISTGET(list,3)
      WRITE " Dominance:",$LISTGET(list,4),"]",!
      SET j=j+1 }
  WRITE !,"Printed ",j-1," dominant profile CRCs"
Copy code to clipboard

Dominance Score for a Specified Entity

iKnow uses the GetDomainValue() method to return the dominance value for a specified entity. You specify the entity by its entity Id (a unique integer), and specify the entity type by a numeric code. The default entity type is 0 (concept).

A single set of unique entity Ids is used for concepts and relations; therefore, no concept has the same entity Id as a relation. A separate set of unique entity Ids is used for CRCs; therefore, a CRC may have the same entity Id as a concept or a relation. This numbering is entirely coincidental; there is no connection between the entities.

The following example takes the top 12 entities and determines the dominance score for each entity. As one can see from the results of this example, the top (most frequently occurring) entities do not necessarily correspond to the entities with the highest dominance scores:

#Include %IKPublic
  ZNSPACE "Samples"
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
TopEntitiesDominanceScores
  DO ##class(%iKnow.Queries.EntityAPI).GetTop(.result,domId,1,12)
  SET i=1
  WHILE $DATA(result(i)) {
       SET topstr=$LISTTOSTRING(result(i),",",1)
       SET topid=$PIECE(topstr,",",1)
       SET val=$PIECE(topstr,",",2)
         SET spc=25-$LENGTH(val)
       WRITE val
       WRITE $JUSTIFY("top=",spc),i
       WRITE " dominance="
       WRITE ##class(%iKnow.Semantics.DominanceAPI).GetDomainValue(domId,topid,0),!
       SET i=i+1 }
  WRITE "Top ",i-1," entities and their dominance scores"
Copy code to clipboard

Overlap Between Sources

The following example uses the GetOverlap() method to return the overlap score for each entity. This score is the count of overlapping occurrences in all of sources in the domain for that entity. Note that before you can invoke GetOverlap(), you must first invoke BuildOverlap(). When displaying the results, type=0 (concepts) for all of the returned values in this example and is therefore omitted from the display; the $JUSTIFY function is used in this example to align display of the overlap counts:

#Include %IKPublic
  ZNSPACE "Samples"
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceSentenceQueries
  SET numSrcD=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
  WRITE "The domain contains ",numSrcD," sources",!
  SET numSentD=##class(%iKnow.Queries.SentenceAPI).GetCountByDomain(domId)
  WRITE "These sources contain ",numSentD," sentences",!!
Overlap
  DO ##class(%iKnow.Semantics.DominanceAPI).BuildOverlap(domId)
  DO ##class(%iKnow.Semantics.DominanceAPI).GetOverlap(.result,domId,1,17)
DisplayOverlapResults
  SET i=1
  WHILE $DATA(result(i)) {
       SET datastr=$LISTTOSTRING(result(i),",",1)
          SET spc=40-$LENGTH(datastr)
       WRITE $PIECE(datastr,",",1)," "
       WRITE $LISTTOSTRING($PIECE(datastr,",",2))
       /* WRITE $PIECE(datastr,",",3)," " */
          WRITE $JUSTIFY(" overlap=",spc)
       WRITE $PIECE(datastr,",",4),!
       SET i=i+1 }
Copy code to clipboard

Typical Sources

The following example lists the ten most typical sources with their dominance scores. In order to retrieve typical sources you must issue the GetOverlap() method, followed by the FindMostTypicalSources() and GetTypicalSources() methods.

#Include %IKPublic
  ZNSPACE "Samples"
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCountQueries
  WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)," total sources",!!
DominanceTypicalSources
  SET dstat = ##class(%iKnow.Semantics.DominanceAPI).BuildOverlap(domId)
  IF dstat '=1 {WRITE "BuildOverlap() error"  QUIT}
  SET dstat = ##class(%iKnow.Semantics.DominanceAPI).FindMostTypicalSources(domId)
  IF dstat '=1 {WRITE "FindMostTypicalSources() error"  QUIT}
  SET dstat = ##class(%iKnow.Semantics.DominanceAPI).GetTypicalSources(.srcresult,domId,1,10)
  IF dstat '=1 {WRITE "GetTypicalSources() error"  QUIT}
  SET k=1
  WHILE $DATA(srcresult(k)) {
        WRITE $LISTTOSTRING(srcresult(k)),!
        SET k=k+1 }
Copy code to clipboard

Breaking Sources

Breaking sources are those sources that differ significantly from the other sources in the domain (or filtered subset of the domain). In order to retrieve breaking sources you must issue the GetOverlap() method, followed by the FindBreakingSources() and GetBreakingSources() methods.

In the following program, after identifying each breaking source, the program uses GetBySource() to return the dominant entities for each breaking source. For each entity it returns:

  • the entity Id, which is a unique integer for that entity type. However, because GetBySource() returns multiple entity types (concepts, relations, CRCs), this entity Id is only unique and meaningful within the specified type.

  • the entity value, which can be a single concept, or a list of concepts and relations. This list can be a CRC. (To simplify the display, the entity value is here commented out.)

  • the entity type: 0 for a concept, 1 for a relation, 3 for a CRC, or an odd number > 3 for a path. The path value corresponds to the number of elements in the path. If the number of elements in the path is an even number, the entity type is the next larger odd number.

  • the dominance value, which is a calculated positive number greater than 1. The higher the dominance value, the more dominant the entity is within that source. The dominance value for a concept or relation is an integer; the dominance value for a CRC or path may contain a decimal fraction.

#Include %IKPublic
  ZNSPACE "Samples"
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCountQueries
  WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)," total sources",!!
DominanceBreakingSources
  SET dstat = ##class(%iKnow.Semantics.DominanceAPI).BuildOverlap(domId)
  IF dstat '=1 {WRITE "BuildOverlap() error"  QUIT}
  SET dstat = ##class(%iKnow.Semantics.DominanceAPI).FindBreakingSources(domId)
  IF dstat '=1 {WRITE "FindBreakingSources() error"  QUIT}
  DO ##class(%iKnow.Semantics.DominanceAPI).GetBreakingSources(.profresult,domId,1,10)
  SET j=1,k=1
  WHILE $DATA(profresult(j),srclist) {
    SET src = $LISTGET(srclist)
    WRITE !,"Source id: ",src,!
    DO ##class(%iKnow.Semantics.DominanceAPI).GetBySource(.srcresult,domId,src,1,10)
    WHILE $DATA(srcresult(k),list) {
          WRITE "id:",$LISTGET(list)
          /* WRITE "  values:",$LISTTOSTRING($LISTGET(list,2)) */
          WRITE "  type:",$LISTGET(list,3),"  dominance:",$LISTGET(list,4),!
          SET k=k+1 }
    SET k=1
    SET j=j+1 }
  WRITE !!,"Printed ",j-1," breaking sources"
Copy code to clipboard

Sources By Correlation

GetSourcesByCorrelation() returns those sources that correlate to a list of entities, calculating the a percentage of correlation between the source’s entities and the list entities. A correlation percentage of 1 (100%) means that there are as many correlations in the source as there are listed entities; it does not mean that every listed entity appears in that source. For this reason, it is possible to return a correlation percentage greater that 1. Sources with 0% correlation are not listed.

In order to retrieve sources by correlation you must issue the GetOverlap() method, followed by the GetSourcesByCorrelation() method. You can supply a filter to GetOverlap() to limit the set of sources to be tested for correlation.

The following example tests sources by comparing them to a list of concepts that refer to helicopters. It returns the source ID, the external Id, and the percentage of correlation:

#Include %IKPublic
  ZNSPACE "Samples"
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceSentenceQueries
  SET numSrcD=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
  WRITE "The domain contains ",numSrcD," sources",!
  SET numSentD=##class(%iKnow.Queries.SentenceAPI).GetCountByDomain(domId)
  WRITE "These sources contain ",numSentD," sentences",!!
SourcesByCorrelation
  DO ##class(%iKnow.Semantics.DominanceAPI).BuildOverlap(domId)
  SET heliterms=$LB(796,1048,1706,2484,2571,2637,2642,2653,3284,3987,4037,4038,4054,4957)
  DO ##class(%iKnow.Semantics.DominanceAPI).GetSourcesByCorrelation(.result,domId,heliterms)
DisplaySourcesByCorrelation
  SET i=1
  WHILE $DATA(result(i)) {
       WRITE $LISTTOSTRING(result(i),",",1),!
       SET i=i+1 }
Copy code to clipboard

Semantic Proximity

Semantic proximity is a calculation of the semantic “distance” between two entities within a sentence. The higher the proximity integer, the closer the entities.

As a demonstration of this semantic distance, given the sentence:

“The giraffe walked with long legs to the base of the tree, then stretched his long neck
 up to reach the lowest leaves.”

the proximity of the concept “giraffe” might be as follows: long legs=64, base=42, tree=32, long neck=25, lowest leaves=21.

Semantic proximity is calculated for each entity in each sentence, then these generated proximity scores are added together producing an overall proximity score for each entity for the entire set of source texts. For example, given the sentences:

“The giraffe walked with long legs to the base of the tree, then stretched his long neck
 up to reach the lowest leaves. Having eaten, the giraffe bent his long legs and stretched
 his long neck down to drink from the pool.”

the proximity of the concept “giraffe” might be as follows: long legs=128, long neck=67, base=42, tree=32, pool=32, lowest leaves=21.

Entity proximity is commutative; this means that the proximity of entity1 to entity2 is the same as the proximity of entity2 to entity1. iKnow does not calculate a semantic proximity of an entity to itself. For example, the sentence “The boy told a boy about another boy.” would not generate any proximity scores, but the sentence “The boy told a younger boy about another small boy.” generates the proximity scores younger boy=64, small boy=42. If the same entity appears multiple times in a sentence, the proximity score is additive. For example, the proximity for the concept “girl” in the sentence “The girl told the boy about another boy.” is boy=106, the total the two proximity scores 64 and 42.

Japanese Semantic Proximity

iKnow semantic analysis of Japanese uses an algorithm to create Entity Vectors. An entity vector is an ordering of entities in the sentence that follow a predefined logical sequence. When iKnow converts a Japanese sentence into an entity vector it commonly rearranges the order of entities. Semantic proximity for Japanese uses the entity vector entity order, not the original sentence entity order.

Proximity Examples

Refer to A Note on Program Examples for details on the coding and data used in the examples in this book.

The following example uses the GetProfile() method to return the proximity of the concept “student pilot” to other concepts in sentences in all of the sources in the domain. GetProfile() supports filters and blacklists:

#Include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
ProximityForEntity
    SET entity="student pilot"
    DO ##class(%iKnow.Semantics.ProximityAPI).GetProfile(.eresult,domId,entity,1,20)
       SET k=1
       WHILE $DATA(eresult(k)) {
          SET item=$LISTTOSTRING(eresult(k))
          WRITE $PIECE(item,",",1)," ^ "
          WRITE $PIECE(item,",",2)," ^ "
          WRITE $PIECE(item,",",3),!
          SET k=k+1 }
    WRITE !,"all done"
Copy code to clipboard

The following example uses the GetProfileBySourceId() method to list the concepts with the greatest proximity to a given entity for each source. Each concept is listed by entity Id, value, and proximity score:

#Include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCountQuery
  SET totsrc=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
GetEntityID
  SET entId=##class(%iKnow.Queries.EntityAPI).GetId(domId,"student pilot")
QueryBySource
  DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId,1,totsrc)
  SET j=1,k=1
  WHILE $DATA(result(j),srclist) {
    SET src = $LISTGET(srclist)
    WRITE !,"Source id: ",src,!
    SET entity="student pilot"
    DO ##class(%iKnow.Semantics.ProximityAPI).GetProfileBySourceId(.srcresult,domId,entId,src,1,totsrc)
       WHILE $DATA(srcresult(k)) {
          SET item=$LISTTOSTRING(srcresult(k))
          WRITE $PIECE(item,",",1)," ^ "
          WRITE $PIECE(item,",",2)," ^ "
          WRITE $PIECE(item,",",3),!
          SET k=k+1 }
    SET k=1
    SET j=j+1 }
  WRITE !!,"Printed all ",j-1," sources"
Copy code to clipboard