Another month, another project in the books. This one was the most difficult but the most rewarding. I had to build a model that recommends certain types of weed strains to patients. This is a part of product called “MedCabinet”

The Data

There was preliminary data given from a Kaggle set last updated two years ago. This was already an impressive dataset with about 2500 rows, but it still contained a ton of NaN values and missing information. Using various sources, I cleaned and added relevant info and cleaned up all of NaN values.

Hopefully, people can find this updated info more useful.

The idea was to find a way to recommend strains based on just a few words from the user. Here’s the link to the Python notebook

As the Machine Learning Engineer, I deployed an NLP model to help the user decide what kind of cannabis strain is right for their current situation. Heres a quick rundown of how this works

Import the necessary tools

For this project, it’s necessary to import the panda’s library as well as the spacy for our NLP processing.

Read the cleaned CSV

The most time-consuming part of this project was manually cleaning up the given dataset. As I mentioned before, there were a ton of NaN values, that I personally cleaned with extra research. The easier part was importing the CSV into the notebook.

Tokenize and Vectorize

Tokenization is the process of splitting the string into the list of tokens. For example, a full sentence can be tokenized into words. This process makes it easier to find patterns. Using the “bag of words” model and NLP’s tokenizer method, I created a nice column of tokens. Now, using sklearn’s TfidfVectorizer, we can now convert these tokens into numerical data.

Use a KNN Model

The reason we convert the words into numerical data is to help run our Nearest Neighbors Model. The idea is to take the user given words and match them to closest in meaning with ours. The algorithm calculates the “closeness” and provides its “nearest neighbors”

Pickle the Model

The model should be employed on our main website. A way to do this pickling the model, which is basically saving and loading the model.

Create a search function

I also wanted to test the machine learning model locally. To do so, I created a user text search function. I can run the method and it will display the top five matches below.

Heres a small demo: