Step 1: Tokenizing the Texts
Please refer to the below table for all of the texts we started with. We were unable to chunk the whole Moonstone because it continued to crash the tokenizer. As a result we broke The Moonstone up into seven different pieces. The organization schema we based on the Gutenberg project for ease. The only difference is that we combined everything after the 6th narrative to the epilogue in one grouping. After tokenizing the text we moved onto using chunking, merging chunksets and treeview.
Step 2: Overcoming Failures with Dendrograms
We first chunked by # of chunks into 1000 chunks for each and tried to merge chunkset 4 times and it crashed, which was our first failure.
Then we rechunked and relabeled (please see the provided table) and used advanced by chunk size, rather than # of chunks. This successfully chunked everything , merged chunksets and called it WholeCorpus, which was useless because it was too difficult to read. We tried it three times into treeview and clicked get dendro and used first a PDF clustering, which never uploaded. Then we did dendro phyloxml and got an OOPS message then we viewed the 3rd was download XML, however this never successfully downloaded.
The failures continued until Then we ran some sample tests using Moonstone1stperiod =114 and the CaskAmon=2.
After conducting Moonstone 1st period and cask of amon. as a test to see if it fails we used it as a PDF.
We tested the Moonstone1st period to see #chunks would be less that 114. so 2000 chunks=57. then we chunked it to 4000= 28 chunks.
Step 3: Success!
After numerous failed attempts we decided to use advanced chunking and break up our chunks into 30 chunks or less. Although seemingly arbitrary, this produced readable results. Following is a graph of all of the most useful dendrograms. All together we produced fifteen dendrograms useable dendrograms and countless useless ones. For the purposes of this blog we are choosing to post only the most useful. Aside from the dendrogram below, all others will be reproduced in our individual analysis sections.
This dendrograms is called “Everything” and it is the first readable dendrograms we made.
Because of difficulties and problems with tokenizing we could not compare the entirety of The Moonstone to the various texts. As a result, we separated The Moonstone into seven parts, using Project Gutenberg’s labeling system.
For a complete breakdown please see the graph:
Using the dendrograms as a starting point we decided to narrow our focus even more and limited ourselves to five texts, which were the most similar and exemplified what we found. Using a Dendrogram we narrowed our focus to include The Moonstone, “The Murders in the Rue Morgue,” “The Purloined Letter,” The Sign of Four and A Study in Scarlet. Tara focused on the end of The Moonstone with “The Murders in the Rue Morgue.” Sinead focused on “The Murders in the Rue Morgue” and ” The Purloined Letter.” Kendra focused on The Sign of Four and the “The Purloined Letter.”
Our group decided to use the Topic Flowers as another way to visualize the text in order to find more connections. We first found the text on Project Gutenberg and copied it into the text tokenizer. Then we copy and pasted the text into a word document. Then from word we transferred it to the Topic Flower text box and clicked create. Then we used the snipping tool to capture the image, then we saved the image as a “JPEG” file so we could post it onto WordPress easily. One of the problems we encountered while using the topic flowers was the fact that we tried putting all of our texts into the text box, but that overwhelmed the web-page, so putting in multiple texts at the same time proved impossible.
We hoped that by looking at these two topic flowers we would find a connection between “The Sign of Four” and “A Study in Scarlet.” There were similarities between colors (purple, yellow and blue), but “A Study in Scarlet” focused more on society and science, while “The Sign of Four” focused more on economy and society. This research did not lead to any major conclusions.
Above are the topic flowers for Edgar Allan Poe’s “The Purloined Letter” and “The Murders in the Rue Morgue.” We found that again the overall topic flower color’s were similar (blue, yellow and orange), but “The Murders in the Rue Morgue,” focuses on society , then a bit of science and “The Purloined Letter” focuses more on science and then a bit of society (both with a touch of recreation). “The Purloined Letter” had more ‘hairs’ than “The Murders in the Rue Morgue,” which only showed us that it used more personal pronouns, not generating any new insight on our texts.
This was the topic flower for The Moonstone. This flower had sharper and more elongated petals which is indicative of negative terms, like death, murder and idiot. Although all the other stories were about mysteries or murders, this topic flower stood out, because it sensed a larger usage of negative terms, making The Moonstone the only topic flower to match the dark contents of the text.
Our next step was to use Many-Eyes. We first created an account on Many-Eyes and from there decided to use the text visualization tools to find more similarities within our texts. We then decided to see which words would be the most visible in the word cloud. We were hoping to find words associated with detective mystery novel genre like murder, blood, clue, detective… etc. We found that the word connections we wanted were not there, but instead we found other similarities that we would never have thought of. For example: we did find that every text had the word “upon” and “one” except for the Moonstone, which only had the word “one” but not “upon.” This conclusion didn’t lead us to any new information or insight regarding our texts. The last similarity between all the texts was the use of the main characters’ names. The only issues we encountered with Many-eyes was that using some of the applications were impossible using the free form data set. The application wanted the information/ text transformed into an excel sheet, so that the two texts would be seen as separate. Our group decided that using the applications which required that process was not a viable option in the end.
These word clouds from Many Eyes, suggest that the word “upon” was used the most in all texts except, The Moonstone. Each text also used the name of the main characters the most: Sergeant, Mr., Franklin, Holmes, and Dupin. The word “one” is also used in all five texts, which we concluded suggest that each text had only one murder!