Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. You may also have to hit the broom in the upper right-hand corner of the window. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

I have set the options message=FALSE and echo=FALSE to avoid cluttering your solutions with all the output from this code.

Associated Press

We are going to look at two different data sets today. The first is a set of news articles from the associated press. There is no predictive task. Our only goal is to apply unsupervised techniques to the data to understand the structure and themes of the collection. Start by loading the data:

docs <- read_csv(file.path("..", "data", "ap.csv.bz2"))
anno <- read_csv(file.path("..", "data", "ap_tokens.csv.bz2"))

Now, create a dataset (give it a name, you’ll need it later) that creates the first two principal components based on the nouns and verbs in the data. Set the min_df to be zero to avoid errors. Finally, add kmeans clustering with 20 clusters, and add a column called train_id which is always equal to “train”.

# Question 01
dt <- anno %>%
  filter(upos %in% c("NOUN", "VERB")) %>%
  dsst_pca(min_df = 0) %>%
  dsst_kmeans(n_clusters = 20) %>%
  mutate(train_id = "train")
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead
dt
## # A tibble: 2,226 × 5
##    doc_id              v1         v2 cluster train_id
##    <chr>            <dbl>      <dbl>   <dbl> <chr>   
##  1 AP881218-0003 -0.0159  -0.0205         12 train   
##  2 AP880224-0195 -0.0265  -0.00433         1 train   
##  3 AP881017-0144 -0.0208  -0.0176         12 train   
##  4 AP881017-0219 -0.0162  -0.00470         6 train   
##  5 AP900117-0022 -0.00764 -0.00522        14 train   
##  6 AP880405-0167 -0.0323  -0.0197         18 train   
##  7 AP880825-0239 -0.0108  -0.0000505      17 train   
##  8 AP880325-0232 -0.0187  -0.0118          2 train   
##  9 AP881105-0097 -0.0281  -0.0174          3 train   
## 10 AP880716-0112 -0.0117  -0.0102         10 train   
## # … with 2,216 more rows

Next, compute the size of each cluster. Notice that these are not equally sized, but the variation should be within an order of magnitude.

# Question 02
dt %>%
  group_by(cluster) %>%
  summarize(n = n())
## # A tibble: 20 × 2
##    cluster     n
##      <dbl> <int>
##  1       1    90
##  2       2   170
##  3       3   153
##  4       4    57
##  5       5    53
##  6       6   122
##  7       7    75
##  8       8    44
##  9       9    82
## 10      10   227
## 11      11   119
## 12      12   167
## 13      13    47
## 14      14   236
## 15      15    39
## 16      16    61
## 17      17   130
## 18      18    82
## 19      19   194
## 20      20    78

Using the dsst_metrics function, compute the nouns and verbs most associated with each cluster. Print out the top 10 words for each cluster. Take a few moments to look at the results and try to make sense of them.

# Question 03
anno %>%
  filter(upos %in% c("NOUN", "VERB")) %>%
  dsst_metrics(dt, label_var = "cluster") %>%
  filter(count > expected) %>%
  group_by(label) %>%
  slice_head(n = 10) %>%
  summarize(words = paste0(token, collapse = " | ")) %>%
  getElement('words')
##  [1] "aid | budget | contract | bill | industry | poll | spending | cut | campaign | voter"                                 
##  [2] "violation | employer | arbitration | platform | subcommittee | plank | contractor | inspector | plant | reimbursement"
##  [3] "election | party | campaign | union | aid | vote | state | candidate | senator | governor"                            
##  [4] "police | government | policeman | arrest | kill | rebel | violence | people | soldier | attack"                       
##  [5] "cent | future | price | quarter | soybean | bushel | percent | market | rate | oil"                                   
##  [6] "facility | hormone | massage | milk | recycling | waste | record | cigarette | nicotine | pesticide"                  
##  [7] "trade | trust | tax | percent | payment | subsidy | farm | export | asbestos | farmer"                                
##  [8] "deficit | tax | budget | cut | company | trade | plan | taxis | spending | percent"                                   
##  [9] "percent | company | customer | business | bushel | sale | debt | stock | price | farmland"                            
## [10] "child | tunnel | parent | water | pilot | foot | training | chess | boat | lake"                                      
## [11] "police | arrest | kill | student | Police | death | wound | soldier | trial | murder"                                 
## [12] "police | charge | trial | kill | hijacker | man | hostage | militia | fire | prison"                                  
## [13] "percent | sale | share | price | company | rise | oil | analyst | offer | economist"                                  
## [14] "shuttle | spacecraft | telescope | mall | test | planet | scientist | transplant | theater | play"                    
## [15] "percent | rate | price | rise | market | cent | month | inflation | point | index"                                    
## [16] "rating | estate | arthritis | accident | coverage | company | mpg | auction | aspirin | venture"                      
## [17] "inch | snow | temperature | dress | thunderstorm | bird | turtle | jackpot | inheritance | shower"                    
## [18] "party | campaign | president | leader | republic | summit | delegate | talk | meeting | reporter"                     
## [19] "fire | pope | abortion | firefighter | convoy | school | tape | rocket | woman | burn"                                
## [20] "dollar | yen | market | index | stock | price | rise | trading | share | close"

Next, use the function slice_sample with n = 1 to randomly select one article from each cluster. Explore a few examples and see how they line up with the words above.

# Question 04
docs %>%
  inner_join(dt, by = c("doc_id", "train_id")) %>%
  group_by(cluster) %>%
  slice_sample(n = 1) %>%
  dsst_print_text()
## AP880329-0257; train; -0.0256938512918124; -0.00802977774437425; 1
## Sundstrand Corp., a major aerospace contractor, has agreed in
## principle to plead guilty to filing false statements with the
## government and pay $12.5 million for overcharging on military work,
## a newspaper reported. The proposed plea agreement covers allegations
## that the company's Seattle-based Sundstrand Data Control unit
## improperly included patent-litigation expenses in overhead costs
## on government military contracts, the Wall Street Journal reported
## Tuesday, quoting unidentified sources. Sundst
## 
## AP881228-0045; train; -0.0219452772594927; -0.0111203927759508; 2
## President Reagan and other U.S. officials marked the ninth anniversary
## of Soviet intervention in Afghanistan stressing that they expect
## Moscow to honor its commitment to complete military withdrawal by Feb.
## 15. Failure to meet the deadline would get Moscow ``off to a very bad
## start'' with President-elect Bush's incoming administration, warned
## Michael Armacost, the State Department's undersecretary for political
## affairs. The State Department also predicted Tuesday that the Afghan
## national army wi
## 
## AP881019-0093; train; -0.0229445066395743; -0.0141084826813256; 3
## The Food and Drug Administration today announced new drug approval
## procedures to speed development of treatments for life-threatening
## diseases like AIDS. In essense, the new regulation would create a
## mechanism for the FDA to work with drug companies in the earliest
## stages of a drug's development to refine preliminary tests in animals
## and humans to get the most useful data in the shortest possible time.
## While that clearly could shorten the approval time for drugs that
## subsequently prove to be saf
## 
## AP900613-0115; train; -0.0244686538410052; -0.0265504770588531; 4
## Soldiers shot at anti-government demonstrators in central Bucharest
## tonight after the protesters occupied state-run television and stormed
## and burned police headquarters, witnesses said. One witness reported
## seeing at least two bodies after the shooting, but this could not
## be immediately confirmed. The most serious violence in the Romanian
## capital since December's revolution that toppled the Ceausescu
## regime was caused by a pre-dawn police raid that ended a 53-day anti-
## Communist protest in Unive
## 
## AP900713-0202; train; -0.0362597532122714; 0.0370566222205621; 5
## Treasury Secretary Nicholas Brady says he will oppose further
## reductions of the Commodity Futures Trading Commission's powers if
## Congress passes legislation stripping the agency's authority over
## stock-index futures. Brady offered the olive branch at a Senate
## committee hearing Thursday on the Bush administration's plan to
## transfer control over stock-index futures to the rival Securities and
## Exchange Commission. One futures industry official said he saw Brady's
## remarks as an indication that the se
## 
## AP900611-0036; train; -0.0140184293373013; -0.00471537928116314; 6
## Using too much water to douse the fire aboard the Mega Borg could sink
## the oil supertanker, but firefighters using foam to smother the flames
## today faced another problem, salvage experts say. ``You have to get
## on that ship to extinguish it,'' said Les Williams, whose Port Arthur,
## Texas, salvage company has fought nearly two dozen offshore oil fires.
## ``If the fire's below deck, it's like trying to walk on a hot skillet.
## You can't do that.'' The fire burned since Saturday aboard the 853-
## foot Norwe
## 
## AP900426-0003; train; -0.0203735205543821; 0.00125218125354586; 7
## Unocal Corp., joining a growing contingent of oil companies with clean
## air ideas, announced Thursday it would offer to buy 7,000 older gas-
## eating cars in an effort to reduce smog in the Los Angeles basin.
## Unocal has allocated $5 million to purchase the cars, which must be
## at least 20 years old, and scrap them, said Richard J. Stegemeier,
## Unocal's chairman and chief executive. The plan would eliminate an
## estimated 6 million pounds of pollutant gases, he said. Owners would
## be offered $700 apiece f
## 
## AP881111-0034; train; -0.0304767419653186; -0.000867360619911944; 8
## President-elect George Bush has been given the glum news that the
## budget deficit in the next fiscal year will be $21 billion higher than
## the Reagan administration had previously estimated. That information
## means that Bush will be facing an even bigger budget headache when
## he takes office on Jan. 20. The new deficit estimate was presented to
## Bush and President Reagan during a Cabinet briefing Thursday on the
## administration's final budget submission to Congress. Joseph Wright
## Jr., director of the
## 
## AP900906-0222; train; -0.0228574695362764; 0.00988692051740968; 9
## American textile and fiber makers are talking tougher. They have
## dumped the celebrities who grinned as they flashed the ``Made in
## U.S.A.'' labels for the cameras in those old buy-American commercials.
## In a new round of ads that debut this weekend on NBC's telecast of
## the Miss America Pageant, they show shoppers making excuses for buying
## imports while workers cart their belongings outside a plant closing
## for good. In another commercial, a mother explains to her bewildered
## young son that they are
## 
## AP900328-0151; train; -0.0156485089183749; -0.00853107155638839; 10
## Prime Minister Margaret Thatcher underlined the need for restraint
## on all sides in the Lithuania crisis during a 50-minute telephone
## conversation Wednesday with Soviet President Mikhail Gorbachev, her
## office said. The phone call, arranged in advance through diplomatic
## contacts, was timed to precede Mrs. Thatcher's meetings Thursday and
## Friday with West German Chancellor Helmut Kohl in Britain and with
## President Bush in Bermuda on April 13. Mrs. Thatcher's office gave no
## details on Gorbachev's re
## 
## AP900409-0050; train; -0.0249232087312242; -0.021936156930724; 11
## Three Israeli soldiers have testified that they were told army orders
## to break the bones of Palestinian detainees came from the upper
## echelons of the military. The soldiers _ reserve Sgts. Guy Neeman,
## Amiram Avirosh and Ronen Ferber _ were called as defense witnesses in
## the court martial of Col. Yehuda Meir, the former commander of Israeli
## troops in the Nablus area of the West Bank. Meir, 38, who left the
## army last year, is charged with ordering soldiers to break the limbs
## of Palestinians taken
## 
## AP900521-0095; train; -0.0175926878553265; -0.021223054124039; 12
## A gunman shot two people this morning and took one of his victims
## hostage in a church, police said. The man shot the people about 8:45
## a.m., said Julie Wolinksi, a police spokeswoman in Melbourne Beach. He
## then barricaded himself inside the Episcopal church, Saint Sebastian-
## By-The-Sea. No one else was believed to be inside the church, she
## said. But reporters at the scene heard a man and a woman screaming,
## leading them to believe two people were being held hostage. The
## gunman's identity was not i
## 
## AP881013-0333; train; -0.0274984257820136; 0.0227024201427486; 13
## Air Wis Services Inc. rejected as inadequate a $121 million
## takeover bid from a Connecticut investment firm, but left the door
## open for a higher offer. The offer of $16.365 a share from Cove
## Capital Associates Inc. of Greenwich, Conn., is the fourth takeover
## bid for the airline holding company since last November. Cove
## represents Transmark USA Inc., an insurance holding company based in
## Jacksonville, Fla. The offer rejected Wednesday includes $14 a share
## in cash and $2.36{ in preferred stock. Th
## 
## AP900424-0155; train; -0.0118650656989766; -0.00418254792847363; 14
## President Bush said Tuesday he will nominate James S. Halpern, a
## partner in a Washington law firm, as a judge on the U.S. Tax Court.
## If confirmed by the Senate, Halpern would succeed Meade Whitaker. In
## other personnel decisions, Bush: _Said he will nominate Ming Hsu to be
## a federal maritime commissioner for the remainder of a term expiring
## June 30, 1991, succeeding Elaine L. Chao. She is director of the New
## Jersey Commerce Department's division of international trade, and is
## the governor's speci
## 
## AP901114-0225; train; -0.0256352070079461; 0.0625624007852754; 15
## Shares closed lower on London's Stock Exchange Wednesday as investors
## assessed the news that Prime Minister Margaret Thatcher is being
## challenged as leader of the governing Conservative Party. Former
## Defense Secretary Michael Heseltine said he will challenge Mrs.
## Thatcher in a ballot among Conservative legislators Tuesday. The
## Financial Times-Stock Exchange 100-share index fell 10 points, or 0.5
## percent, to close at 2,046.0. The Financial Times 300-share index was
## down 10.4 points at 1,583.2. Th
## 
## AP880415-0173; train; -0.0133057419973123; 0.0100201856445268; 16
## American rice farmers may benefit in the early months of the 1988-89
## season from a shortage of exportable crops in competing nations, an
## Agriculture Department report said Friday. World rice production from
## the 1987-88 harvests is forecast at 304 million metric tons, milled
## equivalent, down 4 percent from last year, the department's Economic
## Research Service said. Weak monsoons left Asian producers with limited
## supplies. Meanwhile, global rice consumption is expected to increase 3
## percent, leavi
## 
## AP900409-0201; train; -0.0051619558199858; -0.00356399031585741; 17
## He's a composer for our time, a man who writes operas about the social
## condition of the modern world. Sir Michael Tippett, though better
## known in his native England than in America, has been called ``one
## of the three or four giant composers in the last half of the 20th
## century.''
## 
## AP901124-0011; train; -0.0313683091096418; -0.019339763971344; 18
## It was apparently late Wednesday night, unseasonably chilly even for
## November, when Prime Minister Margaret Thatcher faced the cold reality
## that most of her party wanted her to quit. Only hours before, she had
## strode confidently from 10 Downing Street, the official residence, to
## say she was intent on beating Michael Heseltine, the former defense
## secretary who had challenged her for leadership of the governing
## Conservative Party. ``I fight on, I fight to win!'' she had declared.
## According to stat
## 
## AP901014-0026; train; -0.0146273212285791; -0.0116505632219637; 19
## The first black ever to make a runoff for mayor in Shreveport says he
## thinks racial hatred may have been the motive of whomever festooned
## his front yard with toilet paper. ``This makes me more determined to
## be mayor of this town,'' said C.O. Simpkins, looking at the yards of
## toilet paper draped through trees and bushes Saturday. ``I think that
## this speaks for just a small number of people in Shreveport,'' said
## Simpkins, a Democrat. ``In my talking to people, especially white
## people, I feel that
## 
## AP881003-0328; train; -0.0369357614917165; 0.0721409614953838; 20
## Stock prices fell in light trading today, starting off the final
## quarter of 1988 on a wary note. The Dow Jones average of 30
## industrials, down 6.40 on Friday, dropped 21.52 to 2,091.29 by 2
## p.m. today on Wall Street. Losers outnumbered gainers by 3 to 1 in
## nationwide trading of New York Stock Exchange-listed issues, with
## 359 up, 1,079 down and 414 unchanged. Wall Streeters were generally
## resigning themselves to a sluggish week, expecting investors to back
## away from stocks before the government i

Now, fit a topic model with the nouns and verbs from the data with 20 topics. As above, set min_df = 0.

# Question 05
model <- anno %>%
  filter(upos %in% c("NOUN", "VERBS")) %>%
  dsst_lda_build(num_topics = 20, min_df = 0)

Modifying the code from the notes, compute the 10 words most associated with each topic. Do you see similar patterns fro the PCA + G-Score analysis? What are some notable differences? Do any topics line up well with clusters?

# Question 06
model$terms %>%
  group_by(topic) %>%
  arrange(desc(beta)) %>%
  slice_head(n = 10) %>%
  summarise(words = paste(token, collapse = "; "))
## # A tibble: 20 × 2
##    topic words                                                              
##    <int> <chr>                                                              
##  1     1 country; talk; official; government; agreement; meeting; trade; ai…
##  2     2 market; dollar; price; stock; rate; point; index; trading; share; …
##  3     3 people; area; mile; water; state; city; building; fire; year; wind 
##  4     4 child; year; film; show; movie; time; rating; network; art; week   
##  5     5 percent; year; rate; month; tax; sale; increase; income; budget; g…
##  6     6 court; case; attorney; trial; lawyer; charge; judge; state; year; …
##  7     7 cent; price; oil; farmer; year; market; future; ton; farm; soybean 
##  8     8 worker; union; job; plant; contract; strike; company; employee; ye…
##  9     9 year; abortion; state; museum; water; _; bird; foot; plant; time   
## 10    10 troop; force; war; soldier; rebel; official; government; today; ar…
## 11    11 police; people; man; Police; death; officer; bus; city; night; year
## 12    12 student; school; year; president; church; teacher; people; member;…
## 13    13 plane; flight; pilot; accident; airline; official; passenger; air;…
## 14    14 hospital; patient; study; disease; doctor; health; treatment; hear…
## 15    15 government; party; election; leader; year; people; member; country…
## 16    16 campaign; bill; state; vote; candidate; president; year; election;…
## 17    17 year; family; woman; child; home; people; wife; son; man; time     
## 18    18 program; computer; system; defense; report; member; year; official…
## 19    19 drug; prison; year; law; crime; charge; state; cocaine; death; age…
## 20    20 company; bank; year; share; stock; offer; business; plan; firm; sa…

Export your topic model to a JSON file.

# Question 07
dsst_json_lda(model, docs)

Load the file into the web interface and explore the data. Think about the kinds of things you can and cannot learn by this method and how it compares to the clustering analysis.

Nematode Abstracts

As a second task, we will look at a collection of abstract from the study of Nematodes. Read the data in with the following:

docs <- read_csv(file.path("..", "data", "nematode_abs.csv.bz2"))
anno <- read_csv(file.path("..", "data", "nematode_abs_tokens.csv.bz2"))

The questions below here use the same code as the section above. It’s the output that will be different.

Now, create a dataset (give it a name, you’ll need it later) that creates the first two principal components based on the nouns and verbs in the data. Set the min_df to be zero to avoid errors. Finally, add kmeans clustering with 20 clusters, and add a column called train_id which is always equal to “train”.

# Question 08
dt <- anno %>%
  filter(upos %in% c("NOUN", "VERB")) %>%
  dsst_pca(min_df = 0) %>%
  dsst_kmeans(n_clusters = 20) %>%
  mutate(train_id = "train")
## Warning: did not converge in 10 iterations
dt
## # A tibble: 5,947 × 5
##    doc_id         v1        v2 cluster train_id
##    <chr>       <dbl>     <dbl>   <dbl> <chr>   
##  1 doc00001 -0.00890 -0.00470       17 train   
##  2 doc00002 -0.00530  0.00110       19 train   
##  3 doc00003 -0.00527  0.00325        9 train   
##  4 doc00004 -0.00612 -0.00473       17 train   
##  5 doc00005 -0.0157   0.0231        13 train   
##  6 doc00006 -0.00845  0.000699       5 train   
##  7 doc00008 -0.00497 -0.00111       19 train   
##  8 doc00009 -0.00344 -0.00309       19 train   
##  9 doc00010 -0.00478 -0.00422       19 train   
## 10 doc00011 -0.0171   0.00205       15 train   
## # … with 5,937 more rows

Next, compute the size of each cluster. Notice that these are not equally sized, but the variation should be within an order of magnitude.

# Question 09
dt %>%
  group_by(cluster) %>%
  summarize(n = n())
## # A tibble: 20 × 2
##    cluster     n
##      <dbl> <int>
##  1       1   199
##  2       2   100
##  3       3   194
##  4       4    72
##  5       5   517
##  6       6   310
##  7       7   262
##  8       8   124
##  9       9   405
## 10      10   192
## 11      11   216
## 12      12   473
## 13      13   181
## 14      14   182
## 15      15   392
## 16      16   332
## 17      17   573
## 18      18   121
## 19      19   629
## 20      20   473

Using the dsst_metrics function, compute the nouns and verbs most associated with each cluster. Print out the top 10 words for each cluster. Take a few moments to look at the results and try to make sense of them.

# Question 10
anno %>%
  filter(upos %in% c("NOUN", "VERB")) %>%
  dsst_metrics(dt, label_var = "cluster") %>%
  filter(count > expected) %>%
  group_by(label) %>%
  slice_head(n = 10) %>%
  summarize(words = paste0(token, collapse = " | ")) %>%
  getElement('words')
##  [1] "sequence | intron | tc1 | element | site | repeat | splice | insertion | amino | genome"                     
##  [2] "sequence | element | tc1 | region | nucleotide | intron | amino | bp | repeat | acid"                        
##  [3] "cell | division | embryo | rotation | germline | caspase | blastomere | apoptosis | fate | axis"             
##  [4] "cell | death | fate | vulval | program | precursor | lineage | anchor | induction | division"                
##  [5] "vesicle | temperature | response | hypoxia | food | stress | synaptobrevin | endocytosis | antigen | model"  
##  [6] "sequence | splicing | tran | chromosome | map | subunit | site | duplication | genome | transposition"       
##  [7] "gene | protein | domain | x | family | sequence | mutation | region | isoform | expression"                  
##  [8] "cell | fate | death | vulval | lineage | division | precursor | program | vulva | specification"             
##  [9] "neuron | axon | microtubule | sperm | spectrin | cytokinesis | centrosome | spindle | bundle | cone"         
## [10] "pathway | signal | receptor | regulate | function | expression | neuron | protein | differentiation | kinase"
## [11] "gene | expression | protein | muscle | function | mutation | subelement | sex | phenotype | domain"          
## [12] "mutant | life | span | muscle | mutation | allele | longevity | resistance | phenotype | stress"             
## [13] "cell | fate | lineage | vulval | division | death | induction | germ | vulva | precursor"                    
## [14] "cell | signal | kinase | ras | development | embryo | fate | require | act | specify"                        
## [15] "neuron | function | dauer | expression | mutant | regulate | insulin | pathway | defect | control"           
## [16] "cell | migration | embryo | centrosome | axon | polarity | axis | cortex | germ | endoderm"                  
## [17] "concentration | medium | % | nematode | metal | toxicity | halothane | culture | sensitivity | use"          
## [18] "sequence | gene | region | domain | motif | protein | promoter | cdna | predict | clone"                     
## [19] "nematode | soil | habituation | biomass | toxicity | medium | training | sterol | concentration | food"      
## [20] "chromosome | acid | x | strain | myosin | codon | filament | bind | duplication | span"

Next, use the function slice_sample with n = 1 to randomly select one article from each cluster. Explore a few examples and see how they line up with the words above.

# Question 11
docs %>%
  inner_join(dt, by = c("doc_id", "train_id")) %>%
  group_by(cluster) %>%
  slice_sample(n = 1) %>%
  dsst_print_text()
## doc04047; Krieser RJ;Eastman A; Gene 252: 155-162 2000; NA; 4242;
## 20363736; Deoxyribonuclease II: structure and chromosomal localization
## of the murine gene, and comparison with the genomic structure of
## the human and three C. elegans; ARTICLE; train; -0.0159677168263459;
## -0.0262398296726539; 1
## Deoxyribonuclease II (DNase II) has been implicated in diverse
## functions including degradation of foreign DNA, genomic instability,
## and in mediating the DNA digestion associated with apoptosis. The
## production of a mouse deleted for DNase II would clearly help to
## discriminate these functions. We have cloned and sequenced the mouse
## gene encoding DNase II. It was found to have a similar intron/exon
## structure to the human gene, although introns 3 and 5 are considerably
## shorter. The gene is located o
## 
## doc01074; Moerman DG;Waterston RH; "Mobile DNA." Berg DE and Howe
## MM (eds), American Society for Microbiology. : 537-556 1989; ced-4
## dpy-5 dpy-19 lin-12 mut-2 mut-4 mut-5 sma-1 tra-2 unc-15 unc-22
## unc-37 unc-54 unc-105; 1155; NA; Mobile elements in Caenorhabditis
## elegans and other nematodes.; REVIEW; train; -0.0145054362586276;
## -0.039345604145116; 2
## Transposable elements have recently been described in several species:
## Caenorhabditis elegans, Caenorhabditis briggsae, Ascaris lumbricoides,
## and Panagrellus redivivus. Because of the intense interest in C.
## elegans as an experimental organism for developmental genetic studies
## and the availability of sophisticated genetics, most is know about
## transposons in this species. This review focuses principally on
## Tc1 (Tc=transposon) of C. elegans, the best understood element in
## nematodes. Other element
## 
## doc04465; Delattre M;Felix MA; Developmental Biology 232:
## 362-371 2001; bar-1 egl-5 lag-2 lin-3 lin-12 lin-17 lin-44; 4662;
## 21294970; Development and evolution of a variable left-right
## asymmetry in nematodes: The handedness of P11/P12; ARTICLE; train;
## -0.0113133746215426; 0.02058146760575; 3
## In Caenorhabditis elegans, two lateral blast cells called P(11/12)L
## and P(11/12)R are symmetric left-right homologs at hatching, migrate
## subsequently in opposite anteroposterior directions during the
## first larval stage, and adopt two different fates, thus breaking the
## symmetry between them. Our results show that, unlike most other cell
## fate decisions in C. elegans, the orientation of P(11/12)L/R migration
## is highly biased, but not fixed. The handedness of their migration is
## linked to whole body
## 
## doc02127; Maine EM; Seminars in Developmental Biology 2: 295-304 1995;
## dig-1 let-23 let-60 lin-1 lin-2 lin-3 lin-7 lin-8 lin-9 lin-10 lin-11
## lin-12 lin-13 lin-15 lin-17 lin-18 lin-24 lin-25 lin-26 lin-31 lin-33
## lin-34 lin-35 lin-36 lin-37 lin-38; 2219; NA; Cell-signaling events
## regulate vulval development in the nematode, Caenorhabditis elegans.;
## REVIEW; train; -0.0271581578315179; 0.044131096917042; 4
## Several distinct cell-signaling events are responsible for
## determination of cell fates during vulval development in C. elegans. A
## gonadal cell, the anchor cell, induces three hypodermal cells to adopt
## vulval cell fates. This signal overrides an inhibitory influence,
## probably exerted by the surrounding hypodermal syncytium. Interactions
## among the induced cells ensure that they will adopt the appropriate
## vulval fates. The gonad signals proper attachment and innervation of
## muscle cells to the d
## 
## doc05029; Agostoni E;Gobessi S;Petrini E;Monte M;Schneider C;
## Biochemica et Biophysica Acta - Gene Structure & Expression 1574: 1-9
## 2002; ced-1 phas-1; 5230; NA; Cloning and characterization of the C.
## elegans gas1 homolog: phas-1.; ARTICLE; train; -0.0123515955169332;
## -0.00176930953152622; 5
## Among the set of genes expressed during the quiescent G0 phase of
## the cell cycle (gas genes), gas1 encodes for a GPI anchor protein
## associated to the plasma membrane, which is able to induce growth
## arrest when overexpressed in proliferating fibroblasts. In this
## report we describe the isolation and characterization of a gas1
## Cetenorhabditis elegans homolog, phas-1, that seems to be transcribed
## as an operon together with a gene encoding for a protein similar to
## human acid ceramidases. Phas-1 struc
## 
## doc02350; Winter CE;Penha C;Blumenthal T; Molecular Biology and
## Evolution 13: 674-684 1996; vit-2 vit-5 vit-6; 2443; 96212989;
## Comparison of a vitellogenin gene between two distantly related
## Rhabditid nematode species.; ARTICLE; train; -0.0102505511712119;
## -0.0173019458528632; 6
## Three vitellogenin genes from the free-living nematode Caenorhabditis
## elegans have previously been characterized at the molecular level.
## In order to study evolutionary relationships within this poorly
## understood taxon, we have cloned a vitellogenin gene, CEW1-vit-6
## from a distantly related species belonging to the same family as C.
## elegans. Screening of a genomic library with a probe to total poly(A+)
## RNA yielded three clones that hybridized more intensely than all
## others, and all three correspo
## 
## doc01983; Akerib CC;Meyer BJ; Genetics 138: 1105-1125 1994; dpy-21
## dpy-26 dpy-27 dpy-28 dpy-30 sdc-1 sdc-2 sdc-3 unc-2 unc-9 xol-1
## meDf5 meDf6 mnDp8 mnDp10 mnDp57 mnDp66 mnDp73 nDf19 stDp2 yDf13
## yDf14 yDp4 yDp5 yDp6 yDp7 yDp8 yDp9 yDp10 yDp11 yDp12 yDp13 yDp14
## yDp15 yDp16; 2074; 95203681; Identification of X chromosome regions
## in Caenorhabditis elegans that contain sex-determination signal
## elements.; ARTICLE; train; -0.0177176979439467; -0.0140030853264545; 7
## The primary sex-determination signal of Caenorhabditis elegans is the
## ratio of X chromosomes to sets of autosomes (X/A ratio). This signal
## coordinately controls both sex determination and X chromosome dosage
## compensation. To delineate regions of X that contain counted signal
## elements, we examined the effect on the X/A ratio of changing the
## dose of specific regions of X, using duplications in XO animals and
## deficiencies in XX animals. Based on the mutant phenotypes of genes
## that are controlled
## 
## doc02447; Newman AP;Sternberg PW; Proceedings of the National Academy
## of Sciences USA 93: 9329-9333 1996; ksr-1 lag-2 let-23 let-60 lin-1
## lin-3 lin-11 lin-12 lin-15 lin-25 lin-31 lin-45 mek-2 mpk-1 sem-5
## sur-2; 2540; 96382465; Coordinated morphogenesis of epithelia during
## development of the Caenorhabditis elegans uterine-vulva connection.;
## ARTICLE; train; -0.0188766607048349; 0.0314545327128794; 8
## Development of the nematode egg-laying system requires the formation
## of a connection between the uterine lumen acid the developing vulval
## lumen, thus allowing a passage for eggs and sperm. This relatively
## simple process serves as a model for certain aspects of organogenesis.
## Such a connection demands that cells in both tissues become
## specialized to participate in the connection, and that the specialized
## cells are brought in register, A single cell, the anchor cell, acts to
## induce and to organize
## 
## doc05024; Sumiyoshi E;Sugimoto A;Yamamoto M; Journal of Cell Science
## 115: 1403-1410 2002; fem-1 fem-2 pph-4.1 pph-4.2 rde-1; 5225;
## 11896188; Protein phosphatase 4 is required for centrosome maturation
## in mitosis and sperm meiosis in C. elegans.; ARTICLE; train;
## -0.0111360000521579; 0.00555937028041766; 9
## The centrosome consists of two centrioles surrounded by the
## pericentriolar material (PCM). In late G2 phase, centrosomes enlarge
## by recruiting extra PCM, and concomitantly its microtubule nucleation
## activity increases dramatically. The regulatory mechanisms of this
## dynamic change of centrosomes are not well understood. Protein
## phosphatase 4 (PP4) is known to localize to mitotic centrosomes in
## mammals and Drosophila. An involvement of PP4 in the mitotic spindle
## assembly has been implicated in Dro
## 
## doc01417; Miller DM;Shen MM;Shamu CE;Burglin TR;Ruvkun G;Dubois
## ML;Ghee M;Wilson L; Nature 355: 841-845 1992; unc-4 eDf21 mnDf14
## mnDf16 mnDf24 mnDf25 mnDf26 mnDf56 mnDf59 mnDf60 mnDf61; 1502;
## 92168139; C. elegans unc-4 gene encodes a homeodomain protein that
## determines the pattern of synaptic input to specific motor neurons.;
## ARTICLE; train; -0.0184918598215987; 0.00512396591632017; 10
## The creation of neural circuits depends on the formation of synapses
## between specific sets of neurons. Little is known, however, of the
## molecular mechanisms governing synaptic choice. A mutation in the
## unc- 4 gene alters the pattern of synaptic input to one class of motor
## neurons in the Caenorhabditis elegans ventral nerve cord. In unc-
## 4(e120), the presynaptic partners of VA motor neurons are replaced
## with interneurons appropriate to motor neurons of the VB class. This
## change in neural specific
## 
## doc03500; Carmi I;Meyer BJ; Genetics 152: 990-1015 1999; dpy-21
## dpy-26 dpy-27 dpy-28 dpy-30 fem-1 fem-2 fem-3 fox-1 her-1 mix-1
## sdc-1 sdc-2 sdc-3 sex-1 tra-1 tra-2 tra-3 xol-1 meDf5 meDf6 yDf17
## yDf19 yDf20 mnDp66 stDp2 yDp13 yDp14; 3597; 99318835; The primary
## sex determination signal of Caenorhabditis elegans.; ARTICLE; train;
## -0.0176106486927205; -0.00562226728281115; 11
## An X chromosome counting process determines sex in Caenorhabditis
## elegans. The dose of X chromosomes is translated into sexual fate by
## a set of X-linked genes that together control the activity of the sex-
## determination and dosage-compensation switch gene, xol-1. The double
## dose of X elements in XX animals represses xol-1 expression, promoting
## the hermaphrodite fate, while the single dose of X elements in XO
## animals permits high xol-1 expression at two levels, transcriptional
## and post-transcripti
## 
## doc02803; De Stasio E;Lephoto C;Azuma L;Holst C;Stanislaus D;Uttam
## J; Genetics 147: 597-608 1997; sup-9 sup-10 sup-11 unc-93; 2898;
## 97476311; Characterization of revertants of unc-93(e1500) in
## Caenorhabditis elegans induced by N-ethyl-N-nitrosourea.; ARTICLE;
## train; -0.0112234793221923; -0.00759526548409915; 12
## Phenotypic reversion of the rubber-hand, muscle-defective phenotype
## conferred by unc-93(e1500) aas used to determine the utility of N-
## ethyl-N-nitrosourea (ENU) as a mutagen for genetic research in
## Caenorhabditis elegans. In this system, ENU produces revertants at
## a frequency of 3 x 10(-4), equivalent to that of the commonly used
## mutagen, EMS. The gene identity of 154 ENU-induced revertants shows
## that the distribution of alleles between three possible suppressor
## genes differs from that induced by
## 
## doc02010; Stern MJ;DeVore DL; Developmental Biology 166: 443-459
## 1994; egl-15 egl-17 glp-1 let-23 let-60 let-341 lin-1 lin-2 lin-3
## lin-7 lin-10 lin-12 lin-15 lin-31 lin-45 mpk-1 sem-5 sur-1 unc-6
## unc-34 unc-40 unc-71 unc-76; 2101; 95113166; Extending and connecting
## signaling pathways in C. elegans.; REVIEW; train; -0.0203644608126139;
## 0.0198773797702205; 13
## The development of the nematode Caenorhabditis elegans is known
## to depend extensively on reproducible cell-cell interactions. The
## analysis of many of these signaling events has revealed that, in most
## cases, the mechanisms that mediate them have been conserved throughout
## metazoan evolution. Thus, the analysis of signaling pathways in C.
## elegans can aid in the understanding of signal transduction mechanisms
## in general. In this review we focus on signaling events that occur
## during the developmen
## 
## doc05023; Wallenfang MR;Seydoux G; Proceedings of the National Academy
## of Sciences USA 99: 5527-5532 2002; cdk-7 cey-2 end-1 pes-10 tDf3;
## 5224; NA; cdk-7 is required for mRNA transcription and cell cycle
## progression in Caenorhabditis elegans embryos.; ARTICLE; train;
## -0.0176056038236199; 0.0125572963300051; 14
## CDK7 is a cyclin-dependent kinase proposed to function in two
## essential cellular processes: transcription and cell cycle regulation.
## CDK7 is the kinase subunit of the general transcription factor TFIIH
## that phosphorylates the C-terminal domain (CTD) of RNA polymerase
## 11, and has been shown to be broadly required for transcription
## in Saccharomyces cerevisiae. CDK7 can also phosphorylate CDKs that
## promote cell cycle progression, and has been shown to function as a
## CDK-activating kinase (CAK) in Sc
## 
## doc00397; Hosono R;Sato Y;Aizawa SI;Mitsui Y; Experimental Gerontology
## 15: 285-289 1980; NA; 469; 81004191; Age-dependent changes in
## mobility and separation of the nematode C. elegans.; ARTICLE; train;
## -0.0125150280594183; 0.0013791512940098; 15
## It has generally been believed that a senescent state is brought
## about by the loss of division ability of essential cells or the
## disappearance of irreplaceable components. Since aging shows
## considerable individual variations, it is difficult to pursue the
## problem in any single species. Analysis of population aging have been
## mainly done by measuring a decrease in the population size. However,
## the decrease in population size involves not only senescent death but
## also death due to other changes, a
## 
## doc01620; Graham PL;Kimble J; Genetics 133: 919-931 1993; fem-1
## fem-2 fem-3 fog-1 fog-2 her-1 mog-1 tra-1 tra-2 tra-3 sup-7 qC1 nDf40
## qDp3; 1710; 93216089; The mog-1 gene is required for the switch from
## spermatogenesis to oogenesis in Caenorhabditis elegans.; ARTICLE;
## train; -0.0126439290918452; 0.0105741864672936; 16
## Caenorhabditis elegans hermaphrodites make first sperm, then oocytes.
## By contrast, animals homozygous for any of six loss-of-function
## mutations in the gene mog-1 (for masculinization of the germ line)
## make sperm continuously and do not switch into oogenesis. Therefore,
## in mog-1 mutants, germ cells that normally would become oocytes
## are transformed into sperm. By contrast, somatic sexual fates are
## normal, suggesting that mog-1 plays a germ line-specific role in sex
## determination. Analyses of doub
## 
## doc01381; Williamson VM;Long M;Theodoris G; Biochemical Genetics 29:
## 313-323 1991; NA; 1466; 92082452; Isolation of Caenorhabditis elegans
## mutants lacking alcohol dehydrogenase activity.; ARTICLE; train;
## -0.00557258943706707; -0.00635204556750313; 17
## Alcohol dehydrogenase (ADH) and the genes encoding this enzyme have
## been studied intensively in a broad range of organisms. Little,
## however, has been reported on ADH in the free-living nematode
## Caenorhabditis elegans. Extracts of wild-type C. elegans contain
## ADH activity and display a single band of activity on a native
## polyacrylamide gel. Reaction rate for alcohol oxidation is more rapid
## with higher molecular weight alcohols as substrate than with ethanol.
## Primary alcohols are preferred to seco
## 
## doc03802; Otto E;Kispert A;Schatzle S;Lescher B;Rensing C;Hildebrandt
## F; Journal of the American Society of Nephrology 11: 270-282 2000;
## NA; 3905; 20127700; Nephrocystin: gene expression and sequence
## conservation between human, mouse, and Caenorhabditis elegans.;
## ARTICLE; train; -0.0194936813254559; -0.0164673162464167; 18
## Juvenile nephronophthisis, an autosomal recessive cystic kidney
## disease, is the primary genetic cause for chronic renal failure in
## children. The gene (NPHP1) for nephronophthisis type 1 has recently
## been identified. Its gene product, nephrocystin, is a novel protein
## of unknown function, which contains a src-homology 3 domain. To study
## tissue expression and analyze amino acid sequence conservation of
## nephrocystin, the full-length murine Nphp1 cDNA sequence was obtained
## and Northern and in situ hy
## 
## doc00482; Shulkin DJ;Zuckerman BM; Age 5: 50-53 1982; NA; 557; NA;
## Spectrofluorometric analysis of the effect of centrophenoxine on
## lipofuscin accumulation in the nematode C. elegans.; ARTICLE; train;
## -0.00442888326742019; -0.00476931516294347; 19
## A 41.3% mean decrease in lipofuscin was found in the nematode
## Caenorhabditis elegans following treatment of 6.8 mM centrophenoxine
## for 21 days. It is proposed that the spectrofluorometric technique is
## a convenient and more accurate method for determining cellular content
## of lipofuscin than planimetric and histochemical methodologies. This
## study further demonstrates the similarity of nematode lipofuscin to
## mammalian age pigment and provides a rapid, inexpensive method for
## evaluating the effects
## 
## doc02520; Li X;Greenwald I; Neuron 17: 1015-1021 1996; sel-12; 2613;
## 97092712; Membrane topology of the C. elegans SEL-12 presenilin.;
## ARTICLE; train; -0.0130516920957489; -0.0106285291850725; 20
## Mutant presenilins cause Alzheimer's disease. Presenilins have
## multiple hydrophobic regions that could theoretically span a membrane,
## and a knowledge of the membrane topology is crucial for deducing,the
## mechanism of presenilin function. By analyzing the activity of beta-
## galactosidase hybrid proteins expressed in C. elegans, we show that
## the C. elegans SEL-12 presenilin has eight transmembrane domains and
## that there is a cleavage site after the sixth transmembrane domain. We
## examine the presenili

Now, fit a topic model with the nouns and verbs from the data with 20 topics. As above, set min_df = 0.

# Question 12
model <- anno %>%
  filter(upos %in% c("NOUN", "VERBS")) %>%
  dsst_lda_build(num_topics = 20, min_df = 0)

Modifying the code from the notes, compute the 10 words most associated with each topic. Do you see similar patterns fro the PCA + G-Score analysis? What are some notable differences? Do any topics line up well with clusters?

# Question 13
model$terms %>%
  group_by(topic) %>%
  arrange(desc(beta)) %>%
  slice_head(n = 10) %>%
  summarise(words = paste(token, collapse = "; "))
## # A tibble: 20 × 2
##    topic words                                                              
##    <int> <chr>                                                              
##  1     1 gene; sequence; elegan; acid; region; protein; amino; intron; %; s…
##  2     2 gene; protein; family; member; function; elegan; expression; role;…
##  3     3 chromosome; sex; x; male; hermaphrodite; determination; dosage; an…
##  4     4 protein; acid; enzyme; activity; elegan; extract; nematode; peptid…
##  5     5 muscle; body; cuticle; protein; filament; actin; structure; collag…
##  6     6 cell; embryo; division; microtubule; cleavage; spindle; protein; p…
##  7     7 gene; mutation; mutant; phenotype; function; allele; defect; type;…
##  8     8 germ; line; sperm; germline; cell; oocyte; RNAi; development; herm…
##  9     9 protein; domain; kinase; activity; receptor; alpha; beta; subunit;…
## 10    10 expression; gene; cell; stage; development; pattern; larval; prote…
## 11    11 life; span; rate; stress; mutant; elegan; longevity; insulin; agin…
## 12    12 effect; elegan; nematode; concentration; heat; growth; worm; %; st…
## 13    13 organism; elegan; study; system; model; specie; analysis; nematode…
## 14    14 cell; migration; neuron; axon; junction; guidance; motor; nerve; p…
## 15    15 cell; fate; death; vulval; development; lineage; pathway; signal; …
## 16    16 element; site; sequence; dna; tc1; repeat; insertion; copy; deleti…
## 17    17 channel; receptor; membrane; cell; protein; subunit; elegan; vesic…
## 18    18 gene; map; region; chromosome; %; deficiency; elegan; duplication;…
## 19    19 neuron; response; behavior; dauer; touch; receptor; animal; temper…
## 20    20 nematode; elegan; disease; model; plant; specie; activity; host; s…

Export your topic model to a JSON file.

# Question 14
dsst_json_lda(model, docs)

Load the file into the web interface and explore the data. Think about the kinds of things you can and cannot learn by this method and how it compares to the clustering analysis.

Further Exploration

If you have remaining time, return to the Associated Press articles data. Use UMAP in place of PCA and take a much larger (50? 100?) number of clusters. How do the clusters compare in the previous results? Then, do a topic model with 50 topics. Do these provide a better of worse understanding of the data?